1. Home
2. Artificial Intelligence
3. Publications

# Publications

At Thomson Reuters, we place high value on being active members of the research community. Publishing papers in scientific conferences and workshops helps ensure that our work continues to be aligned with state of the art in our fields.

At Thomson Reuters, we place high value on being active members of the research community. We benefit from published research in our work, so we try to contribute back by sharing some of our research and findings through publications. Publishing papers in scientific conferences and workshops helps ensures that our work continues to be aligned with state of the art in our fields.

Chapter One

## 2019

### Statutory entailment using similarity features and decomposable attention models

John Hudzina, Thomas Vacek, Kanika Madan, Tonya Custis, and Frank Schilder.   Statutory entailment using similarity features and decomposable attention models.  Proceedings of Competition on Legal Information Extraction/Entailment (COLIEE), COLIEE-2019 Workshop on June, 21st 2019 in International Conference on Artificial Intelligence and Law (ICAIL), 2019.

### Textual entailment using word embeddings and linguistic similarity

Kanika Madan, John Hudzina, Thomas Vacek, Frank Schilder, and Tonya Custis.   Textual entailment using word embeddings and linguistic similarity.  Proceedings of Competition on Legal Information Extraction/Entailment (COLIEE), COLIEE-2019 Workshop on June, 21st 2019 in International Conference on Artificial Intelligence and Law (ICAIL), 2019.

### Exploiting Search Logs to Aid in Training and Automating Infrastructure for Question Answering in Professional Domains

Filippo Pompili, Jack G. Conrad, and Carter Kolbeck.   Exploiting Search Logs to Aid in Training and Automating Infrastructure for Question Answering in Professional Domains.  Proceedings of the 17th International Conference on Artificial Intelligence and Law (ICAIL), 2019.

### Litigation Analytics: Case Outcomes Extracted from US Federal Court Dockets

Thomas Vacek, Ronald Teo, Dezhao Song, Timothy Nugent, Conner Cowling, and Frank Schilder.   Litigation Analytics: Case Outcomes Extracted from US Federal Court Dockets.  Proceedings of the Natural Legal Language Processing Workshop 2019, 45--54, 2019.

Dockets contain a wealth of information for planning a litigation strategy, but the information is locked up in semi-structured text. Manually deriving the outcomes for each party (e.g., settlement, verdict) would be very labor intensive. Having such information available for every past court case, however, would be very useful for developing a strategy because it potentially reveals tendencies and trends of judges and courts and the opposing counsel. We used Natural Language Processing (NLP) techniques and deep learning methods allowing us to scale the automatic analysis of millions of US federal court dockets. The automatically extracted information is fed into a Litigation Analytics tool that is used by lawyers to plan how they approach concrete litigations.

### Litigation Analytics: Extracting and querying motions and orders from US federal courts

Thomas Vacek, Dezhao Song, Hugo Molina-Salgado, Ronald Teo, Conner Cowling, and Frank Schilder.   Litigation Analytics: Extracting and querying motions and orders from US federal courts.  Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 116--121, 2019.

Legal litigation planning can benefit from statistics collected from past decisions made by judges. Information on the typical duration for a submitted motion, for example, can give valuable clues for developing a successful strategy. Such information is encoded in semi-structured documents called dockets. In order to extract and aggregate this information, we deployed various information extraction and machine learning techniques. The aggregated data can be queried in real time within the Westlaw Edge search engine. In addition to a keyword search for judges, lawyers, law firms, parties and courts, we also implemented a question answering interface that offers targeted questions in order to get to the respective answers quicker.

### Sentence Boundary Detection in Legal Text

George Sanchez.   Sentence Boundary Detection in Legal Text.  Proceedings of the Natural Legal Language Processing Workshop 2019, 31--38, 2019.
https://www.aclweb.org/anthology/W19-2204

In this paper, we examined several algorithms to detect sentence boundaries in legal text. Legal text presents challenges for sentence tokenizers because of the variety of punctuations and syntax of legal text. Out-of-the-box algorithms perform poorly on legal text affecting further analysis of the text. A novel and domain-specific approach is needed to detect sentence boundaries to further analyze legal text. We present the results of our investigation in this paper.

### Litigation Analytics: Case outcomes extracted from US federal court dockets

Thomas Vacek, Ronald Teo, Dezhao Song, Timothy Nugent, Conner Cowling, and Frank Schilder.   Litigation Analytics: Case outcomes extracted from US federal court dockets.  Proceedings of the first Workshop on Natural Legal Language Processing (NLLP), 2019.

### Westlaw Edge AI Features Demo: KeyCite Overruling Risk, Litigation Analytics, and WestSearch Plus

Tonya Custis, Frank Schilder, Thomas Vacek, Gayle McElvain, and Hector Martinez Alonso.   Westlaw Edge AI Features Demo: KeyCite Overruling Risk, Litigation Analytics, and WestSearch Plus.  Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law, ICAIL '19, 256--257, 2019.
http://doi.acm.org/10.1145/3322640.3326739

### WestSearch Plus: A Non-factoid Question-Answering System for the Legal Domain

Gayle McElvain, George Sanchez, Sean Matthews, Don Teo, Filippo Pompili, and Tonya Custis.   WestSearch Plus: A Non-factoid Question-Answering System for the Legal Domain.  Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'19, 1361--1364, 2019.
http://doi.acm.org/10.1145/3331184.3331397

Chapter Two

## 2018

### A Comparison of Two Paraphrase Models for Taxonomy Augmentation

Vassilis Plachouras, Fabio Petroni, Timothy Nugent, and Jochen L. Leidner.   A Comparison of Two Paraphrase Models for Taxonomy Augmentation.  Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 315--320, 2018.
https://www.aclweb.org/anthology/N18-2051

Taxonomies are often used to look up the concepts they contain in text documents (for instance, to classify a document). The more comprehensive the taxonomy, the higher recall the application has that uses the taxonomy. In this paper, we explore automatic taxonomy augmentation with paraphrases. We compare two state-of-the-art paraphrase models based on Moses, a statistical Machine Translation system, and a sequence-to-sequence neural network, trained on a paraphrase datasets with respect to their abilities to add novel nodes to an existing taxonomy from the risk domain. We conduct component-based and task-based evaluations. Our results show that paraphrasing is a viable method to enrich a taxonomy with more terms, and that Moses consistently outperforms the sequence-to-sequence neural...

### attr2vec: Jointly Learning Word and Contextual Attribute Embeddings with Factorization Machines

Fabio Petroni, Vassilis Plachouras, Timothy Nugent, and Jochen L. Leidner.   attr2vec: Jointly Learning Word and Contextual Attribute Embeddings with Factorization Machines.  Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 453--462, 2018.
https://www.aclweb.org/anthology/N18-1042

The widespread use of word embeddings is associated with the recent successes of many natural language processing (NLP) systems. The key approach of popular models such as word2vec and GloVe is to learn dense vector representations from the context of words. More recently, other approaches have been proposed that incorporate different types of contextual information, including topics, dependency relations, n-grams, and sentiment. However, these models typically integrate only limited additional contextual information, and often in ad hoc ways. In this work, we introduce attr2vec, a novel framework for jointly learning embeddings for words and contextual attributes based on factorization machines. We perform experiments with different types of contextual information. Our experimental...

### TipMaster: A Knowledge Base of Authoritative Local News Sources on Social Media

Xin Shuai, Xiaomo Liu, Nourbakhsh Armineh, Sameena Shah, and Tonya Custis.   TipMaster: A Knowledge Base of Authoritative Local News Sources on Social Media.  13th Conference on Innovative Applications of Artificial Intelligence, IAAI-2018, 2018.

### Introduction to the special issue on legal text analytics

Jack G. Conrad and Luther Karl Branting  Introduction to the special issue on legal text analytics.  Artif. Intell. Law, 26, 99--102, 2018.
https://doi.org/10.1007/s10506-018-9227-z

### The E2E NLG Challenge: A Tale of Two Systems

Charese Smiley, Elnaz Davoodi, Dezhao Song, and Frank Schilder.   The E2E NLG Challenge: A Tale of Two Systems.  Proceedings of the 11th International Conference on Natural Language Generation, 472--477, 2018.

### An Extensible Event Extraction System With Cross-Media Event Resolution

Fabio Petroni, Natraj Raman, Tim Nugent, Armineh Nourbakhsh, Žarko Panić, Sameena Shah, and Jochen L. Leidner.   An Extensible Event Extraction System With Cross-Media Event Resolution.  Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '18, 626--635, 2018.
http://doi.acm.org/10.1145/3219819.3219827

Chapter Three

## 2017

### Scenario analytics: analyzing jury verdicts to evaluate legal case outcomes

Jack G. Conrad and Khalid Al-Kofahi.   Scenario analytics: analyzing jury verdicts to evaluate legal case outcomes.  Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law, ICAIL 2017, London, United Kingdom, June 12-16, 2017, 29--37, 2017.
https://doi.org/10.1145/3086512.3086516

### Say the right thing right: Ethics issues in natural language generation systems

Charese Smiley, Frank Schilder, Vassilis Plachouras, and Jochen L Leidner.   Say the right thing right: Ethics issues in natural language generation systems.  Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, 103--108, 2017.

### Building and querying an enterprise knowledge graph

Dezhao Song, Frank Schilder, Shai Hertz, Giuseppe Saltini, Charese Smiley, Phani Nivarthi, Oren Hazai, Dudi Landau, Mike Zaharkin, Tom Zielund, et al.  Building and querying an enterprise knowledge graph.  IEEE Transactions on Services Computing, 2017.

### A sequence approach to case outcome detection

Tom Vacek and Frank Schilder.   A sequence approach to case outcome detection.  Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law, 209--215, 2017.

### A Multidimensional Investigation of the Effects of Publication Retraction on Scholarly Impact

Xin Shuai, Jason Rollins, Isabelle Moulinier, Tonya Custis, Mathilda Edmunds, and Frank Schilder  A Multidimensional Investigation of the Effects of Publication Retraction on Scholarly Impact.  Journal of the Association for Information Science & Technology, 68, 2225-2236, 2017.

### Hashtag Mining: Discovering Relationship Between Health Concepts and Hashtags

Quanzhi Li, Sameena Shah, Rui Fang, Armineh Nourbakhsh, and Xiaomo Liu (2017).  In Public Health Intelligence and the Internet, Hashtag Mining: Discovering Relationship Between Health Concepts and Hashtags.  (pp. 75--85). Springer.

### Reuters Tracer: Toward Automated News Production Using Large Scale Social Media Data

Xiaomo Liu, Armineh Nourbakhsh, Quanzhi Li, Sameena Shah, Robert Martin, and John Duprey.   Reuters Tracer: Toward Automated News Production Using Large Scale Social Media Data.  2017 IEEE International Conference on Big Data, 2017.

### Mapping the echo-chamber: detecting and characterizing partisan networks on Twitter

Armineh Nourbakhsh, Xiaomo Liu, Quanzhi Li, and Sameena Shah.   Mapping the echo-chamber: detecting and characterizing partisan networks on Twitter.  International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS), 2017.

### " Breaking" Disasters: Predicting and Characterizing the Global News Value of Natural and Man-made Disasters

Armineh Nourbakhsh, Quanzhi Li, Xiaomo Liu, and Sameena Shah.   " Breaking" Disasters: Predicting and Characterizing the Global News Value of Natural and Man-made Disasters.  KDD Workshop on Data Science + Journalism, 2017.

### funSentiment at SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs Using Word Vectors Built from StockTwits and Twitter

Quanzhi Li, Sameena Shah, Armineh Nourbakhsh, Rui Fang, and Xiaomo Liu.   funSentiment at SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs Using Word Vectors Built from StockTwits and Twitter.  Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017, Vancouver, Canada, August 3-4, 2017, 852--856, 2017.

### funSentiment at SemEval-2017 Task 4: Topic-Based Message Sentiment Classification by Exploiting Word Embeddings, Text Features and Target Contexts

Quanzhi Li, Armineh Nourbakhsh, Xiaomo Liu, Rui Fang, and Sameena Shah.   funSentiment at SemEval-2017 Task 4: Topic-Based Message Sentiment Classification by Exploiting Word Embeddings, Text Features and Target Contexts.  Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017, Vancouver, Canada, August 3-4, 2017, 741--746, 2017.

### Data Sets: Word Embeddings Learned from Tweets and General Data

Quanzhi Li, Sameena Shah, Xiaomo Liu, and Armineh Nourbakhsh.   Data Sets: Word Embeddings Learned from Tweets and General Data.  The 11th International Conference on Weblogs and Social Media (ICWSM), 2017.

### Real-time novel event detection from social media

Quanzhi Li, Armineh Nourbakhsh, Sameena Shah, and Xiaomo Liu.   Real-time novel event detection from social media.  2017 IEEE 33rd International Conference on Data Engineering (ICDE), 1129--1139, 2017.

Chapter Four

## 2016

### Fifteenth International Conference on Artificial Intelligence and Law (ICAIL 2015)

Katie Atkinson, Jack G. Conrad, Anne Gardner, and Ted Sichelman  Fifteenth International Conference on Artificial Intelligence and Law (ICAIL 2015).  AI Magazine, 37, 107--108, 2016.
http://www.aaai.org/ojs/index.php/aimagazine/article/view/2633

### Semi-Supervised Events Clustering in News Retrieval

Jack G. Conrad and Michael Bender.   Semi-Supervised Events Clustering in News Retrieval.  Proceedings of the First International Workshop on Recent Trends in News Information Retrieval co-located with 38th European Conference on Information Retrieval (ECIR 2016), Padua, Italy, March 20, 2016., 21--26, 2016.
http://ceur-ws.org/Vol-1568/paper4.pdf

### When to Plummet and When to Soar: Corpus Based Verb Selection for Natural Language Generation

Charese Smiley, Vassilis Plachouras, Frank Schilder, Hiroko Bretz, Jochen Leidner, and Dezhao Song.   When to Plummet and When to Soar: Corpus Based Verb Selection for Natural Language Generation.  Proceedings of the 9th International Natural Language Generation conference, 36--39, 2016.

### Interacting with financial data using natural language

Vassilis Plachouras, Charese Smiley, Hiroko Bretz, Ola Taylor, Jochen L Leidner, Dezhao Song, and Frank Schilder.   Interacting with financial data using natural language.  Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 1121--1124, 2016.

Rui Fang, Armineh Nourbakhsh, Xiaomo Liu, Sameena Shah, and Quanzhi Li.   Witness identification in twitter.  Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, 65--73, 2016.

### Reuters tracer: A large scale system of detecting & verifying real-time news events from twitter

Xiaomo Liu, Quanzhi Li, Armineh Nourbakhsh, Rui Fang, Merine Thomas, Kajsa Anderson, Russ Kociuba, Mark Vedder, Steven Pomerville, Ramdev Wudali, et al..   Reuters tracer: A large scale system of detecting & verifying real-time news events from twitter.  Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 207--216, 2016.

### Hashtag recommendation based on topic enhanced embedding, tweet entity data and learning to rank

Quanzhi Li, Sameena Shah, Armineh Nourbakhsh, Xiaomo Liu, and Rui Fang.   Hashtag recommendation based on topic enhanced embedding, tweet entity data and learning to rank.  Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2085--2088, 2016.

### Tweetsift: Tweet topic classification based on entity knowledge base and topic enhanced word embedding

Quanzhi Li, Sameena Shah, Xiaomo Liu, Armineh Nourbakhsh, and Rui Fang.   Tweetsift: Tweet topic classification based on entity knowledge base and topic enhanced word embedding.  Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2429--2432, 2016.

### Tweet topic classification using distributed language representations

Quanzhi Li, Sameena Shah, Xiaomo Liu, Armineh Nourbakhsh, and Rui Fang.   Tweet topic classification using distributed language representations.  2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), 81--88, 2016.

### Tweet Sentiment Analysis by Incorporating Sentiment-Specific Word Embedding and Weighted Text Features

Quanzhi Li, Sameena Shah, Rui Fang, Armineh Nourbakhsh, and Xiaomo Liu.   Tweet Sentiment Analysis by Incorporating Sentiment-Specific Word Embedding and Weighted Text Features.  2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), 568--571, 2016.

### Sentiment Analysis of Political Figures across News and Social Media

Quanzhi Li, Armineh Nourbakhsh, Rui Fang, Xiaomo Liu, and Sameena Shah.   Sentiment Analysis of Political Figures across News and Social Media.  2016 International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS), 2016.

### How Much Data Do You Need? Twitter Decahose Data Analysis

Quanzhi Li, Sameena Shah, Merine Thomas, Kajsa Anderson, Xiaomo Liu, Armineh Nourbakhsh, and Rui Fang.   How Much Data Do You Need? Twitter Decahose Data Analysis.  2016 International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS), 2016.

### Discovering Relevant Hashtags for Health Concepts: A Case Study of Twitter

Quanzhi Li, Sameena Shah, Rui Fang, Armineh Nourbakhsh, and Xiaomo Liu.   Discovering Relevant Hashtags for Health Concepts: A Case Study of Twitter.  AAAI Workshop: WWW and Population Health Intelligence, 2016.

### User Behaviors in Newsworthy Rumors: A Case Study of Twitter

Quanzhi Li, Xiaomo Liu, Rui Fang, Armineh Nourbakhsh, and Sameena Shah.   User Behaviors in Newsworthy Rumors: A Case Study of Twitter.  The 10th International Conference on Weblogs and Social Media (ICWSM), 627--630, 2016.

### Georeferencing

Jochen L. Leidner (2016).  In Wiley International Encyclopedia of Geography, Georeferencing.  Oxford, England, UK: Wiley-Blackwell.

### Newton: Building an authority-driven company tagging and resolution system

Merine Thomas, Hiroko Bretz, Thomas Vacek, Benjamin Hachey, Sudhanshu Singh, and Frank Schilder (2016).  In Working With Text: Tools, Techniques and Approaches for Text Mining, Tonkin, Emma and Taylor, Stephanie (Eds.), Newton: Building an authority-driven company tagging and resolution system.  (pp. 159--187). Chandos Publishing.
https://www.sciencedirect.com/science/article/pii/B9781843347...

## 2015

### Quantifying socio-economic indicators in developing countries from mobile phone communication data: applications to Côte d'Ivoire

Huina Mao, Xin Shuai, Yong-Yeol Ahn, and Johan Bollen  Quantifying socio-economic indicators in developing countries from mobile phone communication data: applications to Côte d'Ivoire.  EPJ Data Science, 4, 2015.
http://dx.doi.org/10.1140/epjds/s13688-015-0053-1

### Natural Language Question Answering and Analytics for Diverse and Interlinked Datasets

Dezhao Song, Frank Schilder, Charese Smiley, and Chris Brew.   Natural Language Question Answering and Analytics for Diverse and Interlinked Datasets.  Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 101--105, 2015.
http://www.aclweb.org/anthology/N15-3021

### Newsworthy rumor events: A case study of twitter

Armineh Nourbakhsh, Xiaomo Liu, Sameena Shah, Rui Fang, Mohammad Mahdi Ghassemi, and Quanzhi Li.   Newsworthy rumor events: A case study of twitter.  2015 IEEE International Conference on Data Mining Workshop (ICDMW), 27--32, 2015.

### Real-time rumor debunking on twitter

Xiaomo Liu, Armineh Nourbakhsh, Quanzhi Li, Rui Fang, and Sameena Shah.   Real-time rumor debunking on twitter.  Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM), 1867--1870, 2015.

### The Role of Evaluation in AI and Law: An Examination of Its Different Forms in the AI and Law Journal

Jack G. Conrad and John Zeleznikow.   The Role of Evaluation in AI and Law: An Examination of Its Different Forms in the AI and Law Journal.  Proceedings of the 15th International Conference on Artificial Intelligence and Law, ICAIL '15, 181--186, 2015.
http://doi.acm.org/10.1145/2746090.2746116

### Real-time Rumor Debunking on Twitter

Xiaomo Liu, Armineh Nourbakhsh, Quanzhi Li, Rui Fang, and Sameena Shah.   Real-time Rumor Debunking on Twitter.  Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM '15, 1867--1870, 2015.

### Newsworthy Rumor events: A Case Study of Twitter

Armineh Nourbakhsh, Xiaomo Liu, Sameena Shah, Rui Fang, Mohammad Ghassemi, and Quanzhi Li.   Newsworthy Rumor events: A Case Study of Twitter.  Proceedings of the ICDM workshop on Event Analytics using social media data, 2015.

### Information Extraction of Regulatory Enforcement Action: From Anti-Money Laundering Compliance to Countering Terrorism Finance

Vassilis Plachouras and Jochen L. Leidner.   Information Extraction of Regulatory Enforcement Action: From Anti-Money Laundering Compliance to Countering Terrorism Finance.  International Symposium on Open Source Intelligence and Security Informatics, FOSINT-SI, 2015.

### Multimodal Entity Coreference for Cervical Dysplasia Diagnosis

Dezhao Song, Edward Kim, Xiaolei Huang, Joseph Patruno, Héctor Muñoz-Avila, Jeff Heflin, L. Rodney Long, and Sameer Antani  Multimodal Entity Coreference for Cervical Dysplasia Diagnosis.  IEEE Transactions on Medical Imaging (IEEE TMI), 34, 229--245, 2015.

### TR Discover: A Natural Language Interface for Querying and Analyzing Interlinked Datasets

Dezhao Song, Frank Schilder, Charese Smiley, Chris Brew, Tom Zielund, Hiroko Bretz, Robert Martin, Chris Dale, John Duprey, Tim Miller, and Johanna Harrison (2015).  In The Semantic Web - ISWC 2015, TR Discover: A Natural Language Interface for Querying and Analyzing Interlinked Datasets.  (pp. 21-37). Springer International Publishing.
http://dx.doi.org/10.1007/978-3-319-25010-6_2

Currently, the dominant technology for providing non-technical users with access to Linked Data is keyword-based search. This is problematic because keywords are often inadequate as a means for expressing user intent. In addition, while a structured query language can provide convenient access to the information needed by advanced analytics, unstructured keyword-based search cannot meet this extremely common need. This makes it harder than necessary for non-technical users to generate analytics. We address these difficulties by developing a natural language-based system that allows non-technical users to create well-formed questions. Our system, called TR Discover, maps from a fragment of English into an intermediate First Order Logic representation, which is in turn mapped into SPARQL...

## 2014

### Text Analytics at Thomson Reuters

Jochen L. Leidner  Text Analytics at Thomson Reuters.  Invited Talk, London Text Analytics Meetup, London, England, 2014-10-16, 2014.
http://www.meetup.com/textanalytics/events/207765012/

Thomson Reuters is an information company that develops and sells information products to professionals in verticals such as Finance, Risk/Compliance, News, Law, Tax, Accounting, Intellectual Property, and Science. In this talk, I will describe how making money from information differs from making money from advertising, and the role of state-of-the-art text analytics techniques in the process will be described using some case studies. In addition, I will compare and contrast our industry research work with academic research.

### Research and Development in Information Access at Thomson Reuters Corporate R&D

Jochen L. Leidner  Research and Development in Information Access at Thomson Reuters Corporate R&D.  Invited Talk, Language and Computation Day (LAC), University of Essex, Colchester, England, 2014-10-06, 2014.
http://lac.essex.ac.uk/language-and-computation-day-2014

Thomson Reuters is a modern information company. In this talk, I characterise the nature of carrying out research, development and innovation activities as part of its Corporate R&D group that add value to end customers and translate into additional revenue. A couple of R&D projects in the are of natural language processing, information retrieval and applied machine learning will be described, covering the legal, scientific, financial and news areas. The talk will conclude with a cautious outlook of what the near future may hold. Additionally, I will attempt a comparison of doing research in a company with pursuing academic research at a university.

### A practical SIM learning formulation with margin capacity control

Thomas Vacek.   A practical SIM learning formulation with margin capacity control.  Proceedings of 2014 International Joint Conference on Neural Networks (IJCNN), 4160-4167, 2014.
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6889963

Given a finite i.i.d. dataset of the form (yi, Xi), the Single Index Model (SIM) learning problem is to estimate a regression of the form u o f(xi) where u is some Lipschitz-continuous nondecreasing function and / is a linear function. This paper applies Vapnik's Structural Risk Minimization principle to SIM learning. I show that a risk structure for the space of model functions/gives a risk structure for the space of functions u o f. Second, I provide a practical learning formulation for SIM using a risk structure defined by margin-based capacity control. The new learning formulation is compared with support vector regression.

### Winning by Following the Winners: Mining the Behaviour of Stock Market Experts in Social Media

Wenhui Liao, Sameena Shah, and Masoud Makrehchi.   Winning by Following the Winners: Mining the Behaviour of Stock Market Experts in Social Media.  Proceedings of the International Social Computing, Behavioral-Cultural Modeling and Prediction Conference (SBP 2014), 2014.

A novel yet simple method is proposed to exercise in stock market by following successful stock market expert in social media. The problem of "how and where to invest" is translated into "who to follow in my investment". In other words, looking for stock market investment strategy is converted into stock market expert search. Fortunately, many stock market experts are active in social media and openly express their opinion about market. By analyzing their behaviour and mining their opinions and suggested actions in Twitter, and virtually exercise based on their suggestions, we are able to score each expert based on his/her performance. Using this scoring system, experts with most successful trading are recommended. The main objective in this research is to identify traders that...

### Social Informatics: Revised Selected Papers from SocInfo 2013 International Workshops, QMC and HISTOINFORMATICS, Kyoto, Japan, November 25, 2013

http://www.springer.com/computer/database+management+%26+info...

This book constitutes the refereed post-proceedings of two workshops held at the 5th International Conference on Social Informatics, SocInfo 2013, in Kyoto, Japan, in November 2013: the First Workshop on Quality, Motivation and Coordination of Open Collaboration, QMC 2013, and the First International Workshop on Histoinformatics, HISTOINFORMATICS 2013. The 11 revised papers presented at the workshops were carefully reviewed and selected from numerous submissions. They cover specific areas of social informatics. The QMC 2013 workshop attracted papers on new algorithms and methods to improve the quality or to increase the motivation of open collaboration, to reduce the cost of financial motivation or to decrease the time needed to finish collaborative tasks. The papers presented at...

### Exploring Linked Data with contextual tag clouds

Xingjian Zhang, Dezhao Song, Sambhawa Priya, Zachary Daniels, Kelly Reynolds, and Jeff Heflin  Exploring Linked Data with contextual tag clouds.  Web Semantics: Science, Services and Agents on the World Wide Web, 24, 33 - 39, 2014.
http://www.sciencedirect.com/science/article/pii/S1570826814000055

Abstract In this paper we present the contextual tag cloud system: a novel application that helps users explore a large scale \{RDF\} dataset. Unlike folksonomy tags used in most traditional tag clouds, the tags in our system are ontological terms (classes and properties), and a user can construct a context with a set of tags that defines a subset of instances. Then in the contextual tag cloud, the font size of each tag depends on the number of instances that are associated with that tag and all tags in the context. Each contextual tag cloud serves as a summary of the distribution of relevant data, and by changing the context, the user can quickly gain an understanding of patterns in the data. Furthermore, the user can choose to include \{RDFS\} taxonomic and/or domain/range entailment...

## 2013

### A Statistical NLG Framework for Aggregated Planning and Realization

Ravi Kondadadi, Blake Howald, and Frank Schilder.   A Statistical NLG Framework for Aggregated Planning and Realization.  Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1406--1415, 2013.
http://www.aclweb.org/anthology/P13-1138

We present a hybrid natural language generation (NLG) system that consolidates macro and micro planning and surface realization tasks into one statistical learning process. Our novel approach is based on deriving a template bank automatically from a corpus of texts from a target domain. First, we identify domain specific entity tags and Discourse Representation Structures on a per sentence basis. Each sentence is then organized into semantically similar groups (representing a domain specific concept) by k-means clustering. After this semi-automatic processing (human review of cluster assignments), a number of corpus-level statistics are compiled and used as features by a ranking SVM to develop model weights from a training corpus. At generation time, a set of input data, the collection...

### GenNext: A Consolidated Domain Adaptable NLG System

Frank Schilder, Blake Howald, and Ravi Kondadadi.   GenNext: A Consolidated Domain Adaptable NLG System.  Proceedings of the 14th European Workshop on Natural Language Generation, 178--182, 2013.
http://www.aclweb.org/anthology/W13-2124

We introduce GenNext, an NLG system designed specifically to adapt quickly and easily to different domains. Given a domain corpus of historical texts, GenNext allows the user to generate a template bank organized by semantic concept via derived discourse representation structures in conjunction with general and domain-specific entity tags. Based on various features collected from the training corpus, the system statistically learns template representations and document structure and produces well-formed texts (as evaluated by crowdsourced and expert evaluations). In addition to domain adaptation, the GenNext hybrid approach significantly reduces complexity as compared to traditional NLG systems by relying on templates (consolidating micro-planning and surface realization) and...

### Domain Adaptable Semantic Clustering in Statistical NLG

Blake Howald, Ravikumar Kondadadi, and Frank Schilder.   Domain Adaptable Semantic Clustering in Statistical NLG.  Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) -- Long Papers, 143--154, 2013.
http://www.aclweb.org/anthology/W13-0113

We present a hybrid natural language generation system that utilizes Discourse Representation Structures (DRSs) for statistically learning syntactic templates from a given domain of discourse in sentence micro planning. In particular, given a training corpus of target texts, we extract semantic predicates and domain general tags from each sentence and then organize the sentences using supervised clustering to represent the conceptual meaning of the corpus. The sentences, additionally tagged with domain specific information (determined separately), are reduced to templates. We use a SVM ranking model trained on a subset of the corpus to determine the optimal template during generation. The combination of the conceptual unit, a set of ranked syntactic templates, and a given set of...

### Next Generation Legal Search - It's Already Here

Qiang Lu and Jack G. Conrad  Next Generation Legal Search - It's Already Here.  Vox Populii blog, Legal Information Institute (LII), Cornell University, 2013.
http://blog.law.cornell.edu/voxpop/2013/03/28/next-generation

Editor's Note: We are pleased to publish this piece from Qiang Lu and Jack Conrad, both of whom worked with Thomson Reuters R&D on the WestlawNext research team. Jack Conrad continues to work with Thomson Reuters, though currently on loan to the Catalyst Lab at Thomson Reuters Global Resources in Switzerland. Qiang Lu is now based at Kore Federal in the Washington, D.C. area. We read with interest their 2012 paper from the International Conference on Knowledge Engineering and Ontology Development (KEOD), Bringing order to legal documents: An issue-based recommendation system via cluster association'', and are grateful that they have agreed to offer some system-specific context for their work in this area. Their current contribution represents a practical description of the advances...

### Evaluating Entity Linking with Wikipedia

Ben Hachey, Will Radford, Joel Nothman, Matthew Honnibal, and James R. Curran  Evaluating Entity Linking with Wikipedia.  Artificial Intelligence, 194, 130-150, 2013.
http://www.sciencedirect.com/science/article/pii/S0004370212000446

Named Entity Linking (nel) grounds entity mentions to their corresponding node in a Knowledge Base (kb). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate entities and then disambiguate them, returning either the best candidate or nil. However, comparison has focused on disambiguation accuracy, making it difficult to determine how search impacts performance. Furthermore, important approaches from the literature have not been systematically compared on standard data sets. We reimplement three seminal nel systems and present a detailed evaluation of search strategies. Our experiments find that coreference and acronym handling lead to substantial improvement, and search strategies account...

### The Significance of Evaluation in AI and Law: A Case Study Re-examining ICAIL Proceedings

Jack G. Conrad and John Zeleznikow.   The Significance of Evaluation in AI and Law: A Case Study Re-examining ICAIL Proceedings.  Proceedings of the 14th International Conference on Artificial Intelligence and Law (ICAIL), 186-191, 2013.

This paper examines the presence of performance evaluation in works published at ICAIL conferences since 2000. As such, it is a self-reflexive, meta-level study that investigates the proportion of works that include some form of performance assessment in their contribution. It also reports on the categories of evaluation present as well as their degree. In addition the paper compares current trends in performance measurement with those of earlier ICAILs, as reported in the Hall and Zeleznikow work on the same topic (ICAIL 2001). The paper also develops an argument for why evaluation in formal Artificial Intelligence and Law reports such as ICAIL proceedings is imperative. It underscores the importance of answering the question: how good is the system?, how reliable is the approach?,...

### Ants find the shortest path: A mathematical Proof

Jayadeva, Sameena Shah, A. Bhaya, R. Kothari, and S. Chandra  Ants find the shortest path: A mathematical Proof.  Swarm Intelligence, 7, 43-62, 2013.

In the most basic application of Ant Colony Optimization (ACO), a set of artificial ants find the shortest path between a source and a destination. Ants deposit pheromone on paths they take, preferring paths that have more pheromone on them. Since shorter paths are traversed faster, more pheromone accumulates on them in a given time, attracting more ants and leading to reinforcement of the pheromone trail on shorter paths. This is a positive feedback process that can also cause trails to persist on longer paths, even when a shorter path becomes available. To counteract this persistence on a longer path, ACO algorithms employ remedial measures, such as using negative feedback in the form of uniform evaporation on all paths. Obtaining high performance in ACO algorithms typically requires...

### Making Structured Data Searchable via Natural Language Generation with an Application to ESG Data

Jochen L. Leidner and Darya Kamkova.   Making Structured Data Searchable via Natural Language Generation with an Application to ESG Data.  Proceedings of the 10th International Conference Flexible Query Answering Systems (FQAS 2013), Granada, Spain, September 18-20, 2013, Lecture Notes in Computer Science, 8132, 495--506, 2013.

Relational Databases are used to store structured data, which is typically accessed using report builders based on SQL queries. To search, forms need to be understood and filled out, which demands a high cognitive load. Due to the success of Web search engines, users have become acquainted with the easier mechanism of natural language search for accessing unstructured data. However, such keyword-based search methods are not easily applicable to structured data, especially where structured records contain non-textual content such as numbers. We present a method to make structured data, including numeric data, searchable with a Web search engine-like keyword search access mechanism. Our method is based on the creation of surrogate text documents using Natural Language Generation (NLG)...

### Stock Prediction Using Event-based Sentiment Analysis

M Makrehchi, Sameena Shah, and W. Liao.   Stock Prediction Using Event-based Sentiment Analysis.  Proceedings of IEEE/ACM International Conference on Web Intelligence, 2013.

We propose a novel approach to label social media text using significant stock market events (big losses or gains). Since stock events are easily quantifiable using returns from indices or individual stocks, they provide meaningful and automated labels. We extract significant stock movements and collect appropriate pre, post and contemporaneous text from social media sources (for example, tweets from twitter). Subsequently, we assign the respective label (positive or negative) for each tweet. We train a model on this collected set and make predictions for labels of future tweets. We aggregate the net sentiment per each day (amongst other metrics) and show that it holds significant predictive power for subsequent stock market movement. We create successful trading strategies based on...

### Benchmarks for Enterprise Linking: Thomson Reuters R&D at TAC 2013

Thomas Vacek, Hiroko Bretz, Frank Schilder, and Ben Hachey.   Benchmarks for Enterprise Linking: Thomson Reuters R&D at TAC 2013.  proceeding of Text Analysis Conference (TAC), 2013.

## 2012

### Event Linking: Grounding Event Reference in a News Archive

Joel Nothman, Matthew Honnibal, Ben Hachey, and James R. Curran.   Event Linking: Grounding Event Reference in a News Archive.  Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 228-232, 2012.
http://www.aclweb.org/anthology/P12-2045

Interpreting news requires identifying its constituent events. Events are complex linguistically and ontologically, so disambiguating their reference is challenging. We introduce event linking, which canonically labels an event reference with the article where it was first reported. This implicitly relaxes coreference to co-reporting, and will practically enable augmenting news archives with semantic hyperlinks. We annotate and analyse a corpus of 150 documents, extracting 501 links to a news archive with reasonable inter-annotator agreement.

### Bringing Order to Legal Documents - An Issue-based Recommendation System Via Cluster Association

Qiang Lu and Jack G. Conrad.   Bringing Order to Legal Documents - An Issue-based Recommendation System Via Cluster Association.  KEOD, 76-88, 2012.

The task of recommending content to professionals (such as attorneys or brokers) differs greatly from the task of recommending news to casual readers. A casual reader may be satisfied with a couple of good recommendations, whereas an attorney will demand precise and comprehensive recommendations from various content sources when conducting legal research. Legal documents are intrinsically complex and multi-topical, contain carefully crafted, professional, domain specific language, and possess a broad and unevenly distributed coverage of issues. Consequently, a high quality content recommendation system for legal documents requires the ability to detect significant topics from a document and recommend high quality content accordingly. Moreover, a litigation attorney preparing for a case...

### A history of AI and Law in 50 papers: 25 years of the international conference on AI and Law

Trevor J.M. Bench-Capon, Michal Araszkiewicz, Kevin D. Ashley, Katie Atkinson, Floris Bex, Filipe Borges, Daniele Bourcier, Paul Bourgine, Jack G. Conrad, Enrico Francesconi, Thomas F. Gordon, Guido Governatori, Jochen L. Leidner, David D. Lewis, Ronald Prescott Loui, L. Thorne McCarty, Henry Prakken, Frank Schilder, Erich Schweighofer, Paul Thompson, Alex Tyrrell, Bart Verheij, Douglas N. Walton, and Adam Zachary Wyner  A history of AI and Law in 50 papers: 25 years of the international conference on AI and Law.  Artif. Intell. Law, 20, 215-319, 2012.
http://dx.doi.org/10.1007/s10506-012-9131-x

### Convergence of the Dynamic Load Balancing Problem to Nash Equilibrium using Distributed Local Interactions

Sameena Shah and R. Kothari  Convergence of the Dynamic Load Balancing Problem to Nash Equilibrium using Distributed Local Interactions.  Information Sciences, 221, 297-305, 2012.

Load balancers distribute workload across multiple nodes based on a variation of the round robin algorithm, or a more complex algorithm that optimizes a specified objective or allows for horizontal scalability and higher availability. In this paper, we investigate whether robust load balancing can be achieved using a local co-operative mechanism between the resources (nodes). The local aspect of the mechanism implies that each node interacts with a small subset of the nodes that define its neighborhood. The co-operative aspect of the mechanism implies that a node may offload some of load to its neighbor nodes that have lesser load or accept jobs from neighbor nodes that have higher load. Each node is thus only aware of the state of its neighboring nodes and there is no central entity...

## 2011

### The Role of HLT in High-end Search and the Persistent Need for Advanced HLT Technologies

Jack G. Conrad  The Role of HLT in High-end Search and the Persistent Need for Advanced HLT Technologies.  Invited Talk, Workshop on Applying Human Language Technologies to Law (AHLTL 2011), held in conjunction with The Thirteenth International Conference on Artificial Intelligence and Law (ICAIL11), Pittsburgh, PA, 2011.

This talk will first address the multiple 'views' into legal materials that are harnessed by today's high-end legal search engines. These dimensions include the traditional document view (e.g., tf.idf scoring of a document's terms relative to a query), the taxonomic view (the classification of a candidate document using an expansive legal taxonomy such as the Key Number System), the citation network view (where legal documents are characterized by numerous citations, both in-bound and out-bound, some which remain based on solid decisions and some which may be weakened by subsequent judicial opinions), and the user view (records of thousands of user interactions with candidate documents including views, prints, cites, etc.). This is hardly a Saltonian search engine applied to legal...

### Public Record Aggregation Using Semi-supervised Entity Resolution

Jack G. Conrad, Christopher Dozier, Hugo Molina-Salgado, Merine Thomas, and Sriharsha Veeramachaneni.   Public Record Aggregation Using Semi-supervised Entity Resolution.  Proceedings of the 13th International Conference on Artificial Intelligence and Law (ICAIL 2011), 239-248, 2011.
http://www.law.pitt.edu/events/2011/06/icail-2011-the-thirtee...

This paper describes a highly scalable state of the art record aggregation system and the backbone infrastructure developed to support it. The system, called PeopleMap, allows legal professionals to effectively and efficiently explore a broad spectrum of public records databases by way of a single person-centric search. The backbone support system, called Concord, is a toolkit that allows developers to economically create record resolution solutions. The PeopleMap system is capable of linking billions of public records to a master data set consisting of hundreds of millions of person records. It was constructed using successive applications of Concord to link disparate public record data sets to a central person authority file. To our knowledge, the PeopleMap system is the largest of...

### Review of: Handbook of Natural Language Processing (second edition) Nitin Indurkhya and Fred J. Damerau (editors) (University of New South Wales; IBM Thomas J. Watson Research Center)Boca Raton, FL: CRC Press, 2010, xxxiii+678 pp; hardbound, ISBN 978-1-4200-8592-1

Jochen L. Leidner  Review of: Handbook of Natural Language Processing (second edition) Nitin Indurkhya and Fred J. Damerau (editors) (University of New South Wales; IBM Thomas J. Watson Research Center)Boca Raton, FL: CRC Press, 2010, xxxiii+678 pp; hardbound, ISBN 978-1-4200-8592-1.  Computational Linguistics, 37, 395--397, 2011.
http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_r_00048

### Detecting geographical references in the form of place names and associated spatial natural language

Jochen L. Leidner and Michael D. Lieberman  Detecting geographical references in the form of place names and associated spatial natural language.  SIGSPATIAL Special, 3, 5--11, 2011.

### Legal Document Clustering With Build-in Topic Segmentation

Qiang Lu, Jack G. Conrad, Khalid Al-Kofahi, and William Keenan.   Legal Document Clustering With Build-in Topic Segmentation.  Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM-11), 2011.

Clustering is a useful tool for helping users navigate, summarize, and organize large quantities of textual documents available on the Internet, in news sources, and in digital libraries. A variety of clustering methods have also been applied to the legal domain, with various degrees of success. Some unique characteristics of legal content as well as the nature of the legal domain present a number of challenges. For example, legal documents are often multi-topical, contain carefully crafted, professional, domain-specific language, and possess a broad and unevenly distributed coverage of legal issues. Moreover, unlike widely accessible documents on the Internet, where search and categorization services are generally free, the legal profession is still largely a fee-for-service field...

### Summarize this! - Recipes for multi-lingual automatic summarization

Frank Schilder and Liang Zhou (2011).  In Multilingual Natural Language Applications: From Theory to Practice, Imed Zitouni and Daniel M. Bikel (Eds.), Summarize this! - Recipes for multi-lingual automatic summarization.  IBM Press.

## 2010

### Using Cross-Lingual Projections to Generate Semantic Role Labeled Annotated Corpus for Urdu - A Resource Poor Language

Smruthi Mukund, Debanjan Ghosh, and Rohini Srihari.   Using Cross-Lingual Projections to Generate Semantic Role Labeled Annotated Corpus for Urdu - A Resource Poor Language.  Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), 797--805, 2010.
http://www.aclweb.org/anthology/C10-1090

In this paper we explore the possibility of using cross lingual projections that help to automatically induce role-semantic annotations in the PropBank paradigm for Urdu, a resource poor language. This technique provides annotation projections based on word alignments. It is relatively inexpensive and has the potential to reduce human effort involved in creating semantic role resources. The projection model exploits lexical as well as syntactic information on an English-Urdu parallel corpus. We show that our method generates reasonably good annotations with an accuracy of 92\% on short structured sentences. Using the automatically generated annotated corpus, we conduct preliminary experiments to create a semantic role labeler for Urdu. The

### Hunting for the Black Swan: Risk Mining from Text

Jochen Leidner and Frank Schilder.   Hunting for the Black Swan: Risk Mining from Text.  Proceedings of the ACL 2010 System Demonstrations, 54--59, 2010.
http://www.aclweb.org/anthology/P10-4010

In the business world, analyzing and dealing with risk permeates all decisions and actions. However, to date, risk identification, the first step in the risk management cycle, has always been a manual activity with little to no intelligent software tool support. In addition, although companies are required to list risks to their business in their annual SEC filings in the USA, these descriptions are often very high-level and vague. In this paper, we introduce Risk Mining, which is the task of identifying a set of risks pertaining to a business area or entity. We argue that by combining Web mining and Information Extraction (IE) techniques, risks can be detected automatically before they materialize, thus providing valuable business intelligence. We describe a system that induces a risk...

### Brain connectivity analysis by reduction to pair classification

Emanuele Olivetti, Sriharsha Veeramachaneni, Susanne Greiner, and Paolo Avesani.   Brain connectivity analysis by reduction to pair classification.  Proceedings of 2nd International Workshop on Cognitive Information Processing (CIP), 275-280, 2010.
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5604101

Brain connectivity studies aim at describing the connections within the brain. Diffusion and functional MRI techniques provide different kinds of information to understand brain connectivity non-invasively. Fiber tract segmentation is the task of identifying pathways of neuronal axons connecting different brain areas from MRI data. In this work we propose a method to investigate the role of both diffusion and functional MRI data for supervised tract segmentation based on learning the pairwise relationships between streamlines. Experiments on real data demonstrate the promise of the approach.

### Concord - A Tool that Automates the Construction of Record Resolution Systems

Christopher Dozier, Hugo Molina-Salgado, Merine Thomas, and Sriharsha Veeramachaneni.   Concord - A Tool that Automates the Construction of Record Resolution Systems.  Proceedings of the Workshop on Named Entity Resolution at the Eighth International Conference on Language Resources and Evaluation (LREC 2010), 2010.

We describe an application we created called Concord that enables software engineers to build and execute Java based record resolution systems (RRS) quickly. Concord allows developers to interactively configure a RRS by specifying match feature functions, blocking functions, and unsupervised machine learning methods for a specific resolution problem. From the developer's defined configuration parameters, Concord creates a Java based RRS that generates training data, learns a matching model and resolves the records in the input files. As far as we know, Concord is unique among RRS generators in that it allows users to select feature functions which are customized for particular field types and in that it allows users to create matching models in a novel unsupervised way using a...

### Book Review: Representation and Management of Narrative Information: Theoretical Principles and Implementation

Frank Schilder  Book Review: Representation and Management of Narrative Information: Theoretical Principles and Implementation.  Computational Linguistics, 36, 151-156, 2010.
http://www.mitpressjournals.org/doi/abs/10.1162/coli.2010.36.1.36105

Gian Piero Zarri's book summarizes more than a decade of his research on knowledge representation for narrative text. The centerpiece of Zarri's work is the Narrative Knowledge Representation Language (NKRL), which he describes and compares to other competing theories. In addition, he discusses how to model the meaning of narrative text by giving many real-world examples. NKRL provides three different components or capabilities: (a) a representation system, (b) inferencing, and (c) an implementation. It is implemented via a Java-based system that shows how a representational theory can be applied to narrative texts.

### Building and Operating a Hadoop/MapReduce Cluster from Commodity Components: A Case Study

Jochen L. Leidner and Gary Berosik  Building and Operating a Hadoop/MapReduce Cluster from Commodity Components: A Case Study.  ;login:, 26--37, 2010.

This tutorial presents a recipe for the construction of a compute cluster for processing large volumes of data, using cheap, easily available personal computer hardware (Intel/AMD based PCs) and freely available open source software (Ubuntu Linux, Apache Hadoop).

### E-Discovery Revisited: the Need for Artificial Intelligence beyond Information Retrieval

Jack G. Conrad  E-Discovery Revisited: the Need for Artificial Intelligence beyond Information Retrieval.  Artificial Intelligence and Law, 18, 1-25, 2010.
http://dx.doi.org/10.1007/s10506-010-9096-6

In this work, we provide a broad overview of the distinct stages of E-Discovery. We portray them as an interconnected, often complex workflow process, while relating them to the general Electronic Discovery Reference Model (EDRM). We start with the definition of E-Discovery. We then describe the very positive role that NIST's Text REtrieval Conference (TREC) has added to the science of E-Discovery, in terms of the tasks involved and the evaluation of the legal discovery work performed. Given the critical nature that data analysis plays at various stages of the process, we present a pyramid model, which complements the EDRM model: for gathering and hosting; indexing; searching and navigating; and finally consolidating and summarizing E-Discovery findings. Next we discuss where the...

### Filter-based Data Partitioning for Training Multiple Classifier Systems

Rozita A. Dara, Masoud Makrehchi, and Mohamed S. Kamel  Filter-based Data Partitioning for Training Multiple Classifier Systems.  IEEE Transactions on Knowledge and Data Engineering, 22, 508-522, 2010.

Data partitioning methods such as bagging and boosting have been extensively used in multiple classifier systems. These methods have shown a great potential for improving classification accuracy. This study is concerned with the analysis of training data distribution and its impact on the performance of multiple classifier systems. In this study, several feature-based and class-based measures are proposed. These measures can be used to estimate statistical characteristics of the training partitions. To assess the effectiveness of different types of training partitions, we generated a large number of disjoint training partitions with distinctive distributions. Then, we empirically assessed these training partitions and their impact on the performance of the system by utilizing the...

### Simultaneous measurement of RBC velocity, flux, hematocrit and shear rate in vascular networks

Walid S Kamoun, Sung-Suk Chae, Delphine A Lacorre, James A Tyrrell, Mariela Mitre, Marijn A Gillissen, Dai Fukumura, Rakesh K Jain, and Lance L Munn  Simultaneous measurement of RBC velocity, flux, hematocrit and shear rate in vascular networks.  Nature Methods, 7, 655-660, 2010.
http://www.nature.com/nmeth/journal/v7/n8/full/nmeth.1475.html

Not all tumor vessels are equal. Tumor-associated vasculature includes immature vessels, regressing vessels, transport vessels undergoing arteriogenesis and peritumor vessels influenced by tumor growth factors. Current techniques for analyzing tumor blood flow do not discriminate between vessel subtypes and only measure average changes from a population of dissimilar vessels. We developed methodologies for simultaneously quantifying blood flow (velocity, flux, hematocrit and shear rate) in extended networks at single-capillary resolution in vivo. Our approach relies on deconvolution of signals produced by labeled red blood cells as they move relative to the scanning laser of a confocal or multiphoton microscope and provides fully resolved three-dimensional flow profiles within vessel...

### Unsupervised Learning for Reranking-based Patent Retrieval

Wenhui Liao and Sriharsha Veeramachaneni.   Unsupervised Learning for Reranking-based Patent Retrieval.  3rd International Workshop on Patent Information Retrieval, in 19th ACM Conference on Information and Knowledge Management (ICKM), 2010.

We present a reranking-based patent retrieval system where the query text is a patent claim, which may be from an existing patent. The novelty of our approach is the automatic generating of training data for learning the ranker. The ranking is based on several features of the candidate patent, such as the text similarity to the claim, international patent code overlap, and internal citation structure of the candidates. Our approach more than doubles the average number of relevant patents in the top 5 over a strong baseline retrieval system.

### An Information Theoretic Approach to Generating Fuzzy Hypercubes for If-Then Classifiers

Masoud Makrehchi and M.S. Kamel  An Information Theoretic Approach to Generating Fuzzy Hypercubes for If-Then Classifiers.  Journal of Intelligent and Fuzzy Systems, 21, 2010.

In this paper, a framework for automatic generation of fuzzy membership functions and fuzzy rules from training data is proposed. The main focus of this paper is designing fuzzy if-then classifiers; however the proposed method can be employed in designing a wide range of fuzzy system applications. After the fuzzy membership functions are modeled by their supports, an optimization technique, based on a multi-objective real coded genetic algorithm with adaptive cross over and mutation probabilities, is implemented to find near optimal supports. Employing interpretability constraint in parameter representation and encoding, we ensure that the generated fuzzy membership function does have a semantic meaning. The fitness function of the genetic algorithm, which estimates the quality of the...

Chapter Five

## 2009

### The Semantic Web: Organizing by Meaning, not Words -- What it is, what it is not, and what it can do for the Healthcare Sector

Jochen L. Leidner  The Semantic Web: Organizing by Meaning, not Words -- What it is, what it is not, and what it can do for the Healthcare Sector.  Invited Talk, Meeting of the Healthcare Executives Leadership Network (HELN), Chicago, IL., 2009.

The Semantic Web (SemWeb) or "data Web" is a vision to extend the technology stack behind the World Wide Web (WWW) in order to make content (more) machine-interpretable. In this talk, I present an outline of the SemWeb program and its objectives, describe the technical and conceptual obstacles encountered and the current state of the initiative and its technologies. I then proceed to survey current attempts of applying SemWeb ideas in the healthcare sector, and conclude with some suggestions and tentative trends identified.

### Quantitative Analysis and Modeling of Tumor Microvasculature on a GPU Architecture

Alex Tyrrell  Quantitative Analysis and Modeling of Tumor Microvasculature on a GPU Architecture.  Invited Talk at Harvard Medical School, 2009.

Recent advances in optical microscopy have provided new insights into tumor development and treatment response. A primary challenge is that the resulting image datasets are increasingly large and complex, making computer-aided analysis attractive. The demand for speed and accuracy must be met with more sophisticated image processing, coupled with more powerful hardware platforms. As methodologies in quantitative image analysis are improved, one area of potential impact is in the development and validation of mathematical tumor models. Here again, efforts are often hampered by the high computational overhead associated with these models. This talk will focus on efforts to achieve high performance computing on a GPU architecture in order to analyze and model 3-D tumor microvasculature.

### Three-dimensional microscopy of the tumor microenvironment in vivo using optical frequency domain imaging

Benjamin J Vakoc, Ryan M Lanning, James A Tyrrell, Timothy P Padera, Lisa A Bartlett, Triantafyllos Stylianopoulos, Lance L Munn, Guillermo J Tearney, Dai Fukumura, Rakesh K Jain, and Brett E Bouma  Three-dimensional microscopy of the tumor microenvironment in vivo using optical frequency domain imaging.  Nature Medicine, 2009.
http://www.nature.com/nm/journal/vaop/ncurrent/abs/nm.1971.html

Intravital multiphoton microscopy has provided powerful mechanistic insights into health and disease and has become a common instrument in the modern biological laboratory. The requisite high numerical aperture and exogenous contrast agents that enable multiphoton microscopy, however, limit the ability to investigate substantial tissue volumes or to probe dynamic changes repeatedly over prolonged periods. Here we introduce optical frequency domain imaging (OFDI) as an intravital microscopy that circumvents the technical limitations of multiphoton microscopy and, as a result, provides unprecedented access to previously unexplored, crucial aspects of tissue biology. Using unique OFDI-based approaches and entirely intrinsic mechanisms of contrast, we present rapid and repeated...

### Approximate Nonmyopic Sensor Selection Via Submodularity and Partitioning

Wenhui Liao, Qiang Ji, and W. A. Wallace  Approximate Nonmyopic Sensor Selection Via Submodularity and Partitioning.  IEEE Transactions on Systems, Man, and Cybernetics Part A, 39, 782-794, 2009.
http://portal.acm.org/citation.cfm?id=1656589

As sensors become more complex and prevalent, they present their own issues of cost effectiveness and timeliness. It becomes increasingly important to select sensor sets that provide the most information at the least cost and in the most timely and efficient manner. Two typical sensor selection problems appear in a wide range of applications. The first type involves selecting a sensor set that provides the maximum information gain within a budget limit. The other type involves selecting a sensor set that optimizes the tradeoff between information gain and cost. Unfortunately, both require extensive computations due to the exponential search space of sensor subsets. This paper proposes efficient sensor selection algorithms for solving both of these sensor selection problems. The...

### Hearing improvement after bevacizumab in patients with neurofibromatosis type 2

Scott R Plotkin, Anat O Stemmer-Rachamimov, 2nd Barker Fred G, Chris Halpin, Timothy P Padera, Alex Tyrrell, A Gregory Sorensen, Rakesh K Jain, and Emmanuelle di Tomaso  Hearing improvement after bevacizumab in patients with neurofibromatosis type 2.  New England Journal of Medicine, 361, 358-67, 2009.
http://content.nejm.org/cgi/content/abstract/361/4/358

BACKGROUND: Profound hearing loss is a serious complication of neurofibromatosis type 2, a genetic condition associated with bilateral vestibular schwannomas, benign tumors that arise from the eighth cranial nerve. There is no medical treatment for such tumors. METHODS: We determined the expression pattern of vascular endothelial growth factor (VEGF) and three of its receptors, VEGFR-2, neuropilin-1, and neuropilin-2, in paraffin-embedded samples from 21 vestibular schwannomas associated with neurofibromatosis type 2 and from 22 sporadic schwannomas. Ten consecutive patients with neurofibromatosis type 2 and progressive vestibular schwannomas who were not candidates for standard treatment were treated with bevacizumab, an anti-VEGF monoclonal antibody. An imaging response was defined...

### Query-based opinion summarization for legal blog entries

Jack G. Conrad, Jochen L. Leidner, Frank Schilder, and Ravi Kondadadi.   Query-based opinion summarization for legal blog entries.  Proceedings of the 12th International Conference on Artificial Intelligence and Law (ICAIL), 167-176, 2009.
http://doi.acm.org/10.1145/1568234.1568253

We present the first report of automatic sentiment summarization in the legal domain. This work is based on processing a set of legal questions with a system consisting of a semi-automatic Web blog search module and FastSum, a fully automatic extractive multi document sentiment summarization system. We provide quantitative evaluation results of the summaries using legal expert raters. We report baseline evaluation results for query-based sentiment summarization for legal blogs: on a five-point scale, average responsiveness and linguistic quality are slightly higher than 2 (at human inter-annotator agreement kappa=0.75). To the best of our knowledge, this is the first evaluation of sentiment summarization in the legal blogosphere.

### The TempEval Challenge: Identifying Temporal Relations in Text

Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Jessica Moszkowicz, and James Pustejovsky  The TempEval Challenge: Identifying Temporal Relations in Text.  Language Resources and Evaluation, Special Issue on Computational Semantic Analysis of Language: SemEval-2007 and Beyond, 43, 161--179, 2009.
http://dx.doi.org/10.1007/s10579-009-9086-z

TempEval is a framework for evaluating systems that automatically annotate texts with temporal relations. It was created in the context of the SemEval 2007 workshop and uses the TimeML annotation language. The evaluation consists of three subtasks of temporal annotation: anchoring an event to a time expression in the same sentence, anchoring an event to the document creation time, and ordering main events in consecutive sentences. In this paper we describe the TempEval task and the systems that participated in the evaluation. In addition, we describe how further task decomposition can bring even more structure to the evaluation of temporal relations.

### Named Entity Recognition and Resolution in Legal Text

Christopher Dozier, Ravi Kondadadi, Marc Light, Arun Vachher, Sriharsha Veeramachaneni, and Ramdev Wudali.   Named Entity Recognition and Resolution in Legal Text.  Semantic Processing of Legal Texts, Lecture Notes in Computer Science, 2009.

Named entities in text are persons, places, companies, etc. that are explicitly mentioned in text using proper nouns. The process of finding named entities in a text and classifying them to a semantic type, is called named entity recognition. Resolution of named entities is the process of linking a mention of a name in text to a pre-existing database entry. This grounds the mention in something analogous to a real world entity. For example, a mention of a judge named Mary Smith might be resolved to a database entry for a specific judge of a specific district of a specific state. This recognition and resolution of named entities can be leveraged in a number of ways including providing hypertext links to information stored about a particular judge: their education, who appointed them,...

### Multilingual Information Access: Three Heretical Questions

Jochen L. Leidner  Multilingual Information Access: Three Heretical Questions.  2009.

Based on the experience from two different environments, a European SME developing analytics software and a large corporate environment in the U.S. focusing on professional information services, three provocative theses are put to the panel and the audience for discussion, namely: (1.) Are you solving a real problem? (2.) What's the right price of a linguistic resource and (3.) why do academics stop working on a problem just when it gets really interesting?

### Building and Installing a Hadoop/MapReduce Cluster from Commodity Components

Jochen L. Leidner and Gary Berosik  Building and Installing a Hadoop/MapReduce Cluster from Commodity Components.  Thomson Reuters Corporation, 2009.

This tutorial presents a recipe for the construction of a compute cluster for processing large volumes of data, using cheap, easily available personal computer hardware (Intel/AMD based PCs) and freely available open source software (Ubuntu Linux, Apache Hadoop).

### Learning Bayesian Network Parameters under Incomplete Data with Qualitative Domain Knowledge

Wenhui Liao and Qiang Ji  Learning Bayesian Network Parameters under Incomplete Data with Qualitative Domain Knowledge.  Pattern Recognition, 42, 3046-3056, 2009.
http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V1.

Bayesian networks (BNs) have gained increasing attention in recent years. One key issue in Bayesian networks is parameter learning. When training data is incomplete or sparse or when multiple hidden nodes exist, learning parameters in Bayesian networks becomes extremely difficult. Under these circumstances, the learning algorithms are required to operate in a high-dimensional search space and they could easily get trapped among copious local maxima. This paper presents a learning algorithm to incorporate domain knowledge into the learning to regularize the otherwise ill-posed problem, to limit the search space, and to avoid local optima. Unlike the conventional approaches that typically exploit the quantitative domain knowledge such as prior probability distribution, our method...

### Integrating High Precision Rules with Statistical Sequence Classifiers for Accuracy and Speed

Wenhui Liao, Marc Light, and Sriharsha Veeramachaneni.   Integrating High Precision Rules with Statistical Sequence Classifiers for Accuracy and Speed.  Software engineering, testing, and quality assurance for natural language processing, Workshop of North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT), 74-77, 2009.
http://www.aclweb.org/anthology/W/W09/W09-1512.pdf

Integrating rules and statistical systems is a challenge often faced by natural language system builders. A common subclass is integrating high precision rules with a Markov statistical sequence classifier. In this paper we suggest that using such rules to constrain the sequence classifier decoder results in superior accuracy and efficiency. In a case study of a named entity tagging system, we provide evidence that this method of combination does prove efficient than other methods. The accuracy was the same.

### Feature Engineering on Event-centric Surrogate Documents to Improve Search Results

Wenhui Liao and Isabelle Moulinier.   Feature Engineering on Event-centric Surrogate Documents to Improve Search Results.  Proceedings of the 18th ACM Conference on Information and Knowledge Management (ICKM), 2009.

We investigate the task of re-ranking search results based on query log information. Prior work has considered this problem as either the task of learning document rankings of using features based on user behavior, or as the task of enhancing documents and queries using log data. Our contribution combines both. We distill log information into event-centric surrogate documents (ESDs), and extract features from these ESDs to be used in a learned ranking function. Our experiments on a legal corpus demonstrate that features engineered on surrogate documents lead to improved rankings, in particular when the original ranking is of poor quality.

### A Simple Semi-supervised Algorithm For Named Entity Recognition

Wenhui Liao and Sriharsha Veeramachaneni.   A Simple Semi-supervised Algorithm For Named Entity Recognition.  Semi-supervised Learning for Natural Language Processing, Workshop of North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT), 58-65, 2009.
http://www.aclweb.org/anthology/W/W09/W09-2208.pdf

We present a simple semi-supervised learning algorithm for named entity recognition (NER) using conditional random fields (CRFs). The algorithm is based on exploiting evidence that is independent from the features used for a classifier, which provides high-precision labels to unlabeled data. Such independent evidence is used to automatically extract high-accuracy and non-redundant data, leading to a much improved classifier at the next iteration. We show that our algorithm achieves an average improvement of 12 in recall and 4 in precision compared to the supervised algorithm. We also show that our algorithm achieves high accuracy when the training and test sets are from different domains.

### A metric for automatically evaluating coherent summaries via context chains

Frank Schilder and Ravi Kondadadi.   A metric for automatically evaluating coherent summaries via context chains.  Proceedings of the International Conference on Semantic Computing (ICSC), 2009.
http://www.icsi.berkeley.edu/icsc/

This paper introduces a new metric for automatically evaluation summaries called ContextChain. Based on an in-depth analysis of the TAC 2008 update summarization results, we show that previous automatic metrics such as ROUGE-2 and BE cannot reliably predict strong perform- ing systems. We introduce two new terms called Correlation Recall and Correlation Precision and discuss how they cast more light on the coverage and the correctness of the respective metric. Our newly proposed metric called ContextChain incorporates findings from Giannakopoulos et al. (2008) and Barzilay and Lapata (2008) [2]. We show that our metric correlates with responsiveness scores for the top n systems ($n \ge 15$) that participated in the TAC 2008 update summarization task, whereas ROUGE-2 and BE do not show...

### Thomson Reuters at TAC 2009: Context Chain and Fractional Conditional Compressibility of Models

Frank Schilder, Ravikumar Kondadadi, and Sriharsha Veeramachaneni.   Thomson Reuters at TAC 2009: Context Chain and Fractional Conditional Compressibility of Models.  Proceedings of the First Text Analysis Conference (TAC 2009), 358--366, 2009.

This paper contains the result for the TAC 2009 main task -- update summarization -- for the FastSum system and a simple baseline system we propose. For the pilot task of Automatically Evaluating Summaries of Peers (AESOP), we present two novel metrics. The first metric called ContextChain is an extension of a recently proposed metric AutoSummENG that is based on comparing n-gram graphs of the model summaries and the automatically generated summaries. Our modification of the generated n-gram graphs is based on co-reference chains extracted from the summaries. The n-gram graph is then generated from the context information of these referents. Our second metric called Fractional Conditional Compressibility of Models (FraCC) is based on the Burrows-Wheeler compression algorithm. For this...

### Learning to Interpret Cognitive States from fMRI Brain Images

Diego Sona, Sriharsha Veeramachaneni, Emanuele Olivetti, and Paolo Avesani  Learning to Interpret Cognitive States from fMRI Brain Images.  Computational Intelligence and Bioengineering, 21-35, 2009.

Over the last few years, functional Magnetic Resonance Imaging (fMRI) has emerged as a new and powerful method to map the cognitive states of a human subject to specific functional areas of the subject brain. Although fMRI has been widely used to determine average activation in different brain regions, the problem of automatically decoding the cognitive state from instantaneous brain activations has received little attention. In this paper, we study this prediction problem on a complex time-series dataset that relates fMRI data (brain images) with the corresponding cognitive states of the subjects while watching three 20 minute movies. This work describes the process we used to reduce the extremely high-dimensional feature space and a comparison of the models used for prediction. To...

### Surrogate Learning -- From Feature Independence to Semi-Supervised Classification

Sriharsha Veeramachaneni and Ravi Kumar Kondadadi.   Surrogate Learning -- From Feature Independence to Semi-Supervised Classification.  NAACL Workshop on Semi-Supervised Learning, 2009.
http://www.aclweb.org/anthology/W/W09/W09-2202.pdf

We consider the task of learning a classifier from the feature space $X$ to the set of classes $Y = \{0, 1\}$, when the features can be partitioned into class-conditionally independent feature sets $X_1$ and $X_2$ . We show that the class-conditional independence can be used to represent the original learning task in terms of 1) learning a classifier from $X_2$ to $X_1$ (in the sense of estimating the probability $P(x_1|x_2)$)and 2) learning the class-conditional distribution of the feature set $X_1$. This fact can be exploited for semi-supervised learning because the former task can be accomplished purely from unlabeled samples. We present experimental evaluation of the idea in two real world applications.

## 2008

### Exploiting Qualitative Domain Knowledge for Learning Bayesian Network Parameters with Incomplete Data

Wenhui Liao and Qiang Ji.   Exploiting Qualitative Domain Knowledge for Learning Bayesian Network Parameters with Incomplete Data.  Proceedings of the 19th International Conference on Pattern Recognition (ICPR), 2008.

When a large amount of data are missing, or when multiple hidden nodes exist, learning parameters in Bayesian networks (BNs) becomes extremely difficult. This paper presents a learning algorithm to incorporate qualitative domain knowledge to regularize the otherwise ill-posed problem, limit the search space, and avoid local optima. Specifically, the problem is formulated as a constrained optimization problem, where an objective function is defined as a combination of the likelihood function and penalty functions constructed from the qualitative domain knowledge. Then, a gradient-descent procedure is systematically integrated with the E-step and M-step of the EM algorithm, to estimate the parameters iteratively until it converges. The experiments show our algorithm improves the accuracy...

### Professional Credibility: Authority on the Web

Jack G. Conrad, Jochen Leidner, and Frank Schilder.   Professional Credibility: Authority on the Web.  Proceedings of the 2nd Workshop on Information Credibility on the Web (WICOW 2008), 2008.
http://www.dl.kuis.kyoto-u.ac.jp/wicow2/

Opinion mining techniques add another dimension to search and summarization technology by actually identifying the author's opinion about a sub ject, rather than simply identifying the sub ject itself. Given the dramatic explosion of the blogosphere, both in terms of its data and its participants, it is becoming increasingly important to be able to measure the authority of these participants, especially when professional application areas are involved. After having performed preliminary investigations into sentiment analysis in the legal blogosphere, we are beginning a new direction of work which addresses representing, measuring, and monitoring the degree of authority and thus presumed credibility associated with various types of blog participants. In particular, we explore the...

### Linking, mapping, and clustering entity records in information-based solutions for business and professional customers

Jack G. Conrad, Tonya Custis, Christopher Dozier, Terry Heinze, Marc Light, and Sriharsha Veeramachaneni.   Linking, mapping, and clustering entity records in information-based solutions for business and professional customers.  Proceedings of the LREC workshop Resources & Evaluation for Identity Matching, Entity Resolution & Entity Management, 2008.

This is a position paper that describes a number of use cases and their corresponding evaluation metrics. We discuss three types of resolution problems: linking entity mentions in text to records in a database, mapping records in one database to those in another database, and clustering records in a single database. The use cases arose at the Thomson Corporation and the systems developed support a number of products.

### Increasing Maintainability of NLP Evaluation Modules Through Declarative Implementations

Terry Heinze and Marc Light.   Increasing Maintainability of NLP Evaluation Modules Through Declarative Implementations.  Proceedings of the ACL workshop Software Engineering, Testing, and Quality Assurance for Natural Language Processing, 2008.

Computing precision and recall metrics for named entity tagging and resolution involves classifying text spans as true positives, false positives, or false negatives. There are many factors that make this classification complicated for real world systems. We describe an evaluation system that attempts to control this complexity through a set of rules and a forward chaining inference engine.

### A Multi-criteria Convex Quadratic Programming model for Credit Data Analysis

Yi Peng, Gang Kou, Yong Shi, and Zhengxin Chen  A Multi-criteria Convex Quadratic Programming model for Credit Data Analysis.  Decision Support Systems, 44, 1016-1030, 2008.

Speed and scalability are two essential issues in data mining and knowledge discovery. This paper proposed a mathematical programming model that addresses these two issues and applied the model to Credit Classification Problems. The proposed Multi-criteria Convex Quadric Programming (MCQP) model is highly efficient (computing time complexity $O(n^{1.5--2})$) and scalable to massive problems (size of $O(10^9)$) because it only needs to solve linear equations to find the global optimal solution. Kernel functions were introduced to the model to solve nonlinear problems. In addition, the theoretical relationship between the proposed MCQP model and SVM was discussed.

### Experiences with UIMA for online information extraction at Thomson Corporation

Terry Heinze, Marc Light, and Frank Schilder.   Experiences with UIMA for online information extraction at Thomson Corporation.  Proceedings of the LREC workshop Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP, 2008.

We have built a pair of information extraction systems using UIMA (Unstructured Information Management Architecture). These systems have very low latency and run on financial news. We outline the implementation of these systems and report on our web service injection process, our type system, and an ANTLR (ANother Tool for Language Recognition) wrapper we implemented. We conclude with a list of UIMA strengths from our perspective and a wish list for future releases.

### Perivascular nitric oxide gradients normalize tumor vasculature

Satoshi Kashiwagi, Kosuke Tsukada, Lei Xu, Junichi Miyazaki, Sergey V. Kozin, James A. Tyrrell, William C. Sessa, Leo E. Gerweck, Rakesh K. Jain, and Dai Fukumura  Perivascular nitric oxide gradients normalize tumor vasculature.  Nature Medicine, 14, 255--257, 2008.
http://dx.doi.org/10.1038/nm1730

Normalization of tumor vasculature is an emerging strategy to improve cytotoxic therapies. Here we show that eliminating nitric oxide (NO) production from tumor cells via neuronal NO synthase silencing or inhibition establishes perivascular gradients of NO in human glioma xenografts in mice and normalizes the tumor vasculature, resulting in improved tumor oxygenation and response to radiation treatment. Creation of perivascular NO gradients may be an effective strategy for normalizing abnormal vasculature.

### Multilingual Information Access: in the Lab and in the Wild

Jochen L. Leidner  Multilingual Information Access: in the Lab and in the Wild.  Invited Talk (2008-10-03), TrebleCLEF Consortium, Zurich University of Applied Sciences, Winterthur, Switzerland, 2008.

It has long been argued that Multilingual Information Access (MLIA) methods such as cross language information retrieval (CLIR) are essential tools in the 21st century, especially crucial to a multilingual Europe, and the Cross Language Evaluation Forum (CLEF) has pioneered a series of evaluations to improve the state of the art. In this presentation, I present the case for MLIA from both an academic and industry perspective, and argue that both have distinct expectations when approaching e.g. CLIR research. I contrast the stages before/after CLEF on the one hand side and between CLEF and an imaginary ideal state on the other hand. By comparing CLEF with this yet unrealized destination on a virtual MLIA roadmap, I try to provide a constructive criticism of what CLEF has achieved to...

### Efficient Non-myopic Value-of-Information Computation For Influence Diagrams

Wenhui Liao and Qiang Ji  Efficient Non-myopic Value-of-Information Computation For Influence Diagrams.  International journal of approximate reasoning, 2008.
http://dx.doi.org/10.1016/j.ijar.2008.04.003

In an influence diagram (ID), value-of-information (VOI) is defined as the difference between the maximum expected utilities with and without knowing the outcome of an uncertainty variable prior to making a decision. It is widely used as a sensitivity analysis technique to rate the usefulness of various information sources, and to decide whether pieces of evidence are worth acquisition before actually using them. However, due to the exponential time complexity of exactly computing VOI of multiple information sources, decision analysts and expert-system designers focus on the myopic VOI, which assumes observing only one information source, even though several information sources are available. In this paper, we present an approximate algorithm to compute non-myopic VOI efficiently by...

### Adaptive and Interactive Approaches to Document Analysis

George Nagy and Sriharsha Veeramachaneni (2008).  In Machine Learning in Document Analysis and Recognition, Simone Marinai and Hiromichi Fujisawa (Eds.), Adaptive and Interactive Approaches to Document Analysis.  (pp. 221-257). Springer.
http://dx.doi.org/10.1007/978-3-540-76280-5_9

This chapter explores three aspects of learning in document analysis: (1) field classification, (2) interactive recognition, and (3) portable and networked applications. Context in document classification conventionally refers to language context, i.e., deterministic or statistical constraints on the sequence of letters in syllables or words, and on the sequence of words in phrases or sentences. We show how to exploit other types of statistical dependence, specifically the dependence between the shape features of several patterns due to the common source of the patterns within a field or a document. This type of dependence leads to field classification, where the features of some patterns may reveal useful information about the features of other patterns from the same source but not...

### Privacy-Preserving for Medical Data: Application of Data Partition Methods

Yi Peng, Gang Kou, Yong Shi, and Zhengxin Chen (2008).  Privacy-Preserving for Medical Data: Application of Data Partition Methods.  (pp. 331-340). Springer Heidelberg.

Medical data mining has been a popular data mining topic of late. Compared with other data mining applications, medical data mining has some unique characteristics. Since medical records are related to human subjects, privacy protection is taken more seriously than other data mining tasks. This paper applied two data separation techniques -- vertical and horizontal partition - to preserve privacy in medical data classification. In the vertical partition approach, each site uses a portion of the attributes to compute its results and the distributed results are assembled at a central trusted party using majority-vote ensemble method. In the horizontal partition approach, data are distributed among several sites. Each site computes its own data and a central trusted party integrate these...

### FastSum: Fast and accurate query-based multi-document summarization

Frank Schilder and Ravikumar Kondadadi.   FastSum: Fast and accurate query-based multi-document summarization.  Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2008.
http://www.ling.ohio-state.edu/acl08/

We present a fast query-based multi-document summarizer called FastSum based solely on word-frequency features of clusters, documents and topics. Summary sentences are ranked by a regression SVM. The summarizer does not use any expensive NLP techniques such as parsing, tagging of names or even part of speech information. Still, the achieved accuracy is comparable to the best systems presented in recent academic competitions (i.e., Document Understanding Conference (DUC)). Because of a detailed feature analysis using Least Angle Regression (LARS), FastSum can rely on a minimal set of features leading to fast processing times: 1250 news documents in 60 seconds.

### Thomson Reuters at TAC~2008: Aggressive Filtering with FastSum for Update and Opinion Summarization

Frank Schilder, Ravikumar Kondadadi, Jochen L. Leidner, and Jack G. Conrad.   Thomson Reuters at TAC~2008: Aggressive Filtering with FastSum for Update and Opinion Summarization.  Proceedings of the First Text Analysis Conference (TAC 2008), 396--405, 2008.

In TAC 2008 we participated in the main task (Update Summarization) as well as the Sentiment Summarization pilot task. We modified the FastSum system (Schilder and Kondadadi, 2008) and added more aggressive filtering in order to adapt the system to update summarization and sentiment summarization. For the Update Summarization task, we show that a classifier that identifies sentences that are similar to typical first sentences of a news article improves the overall linguistic quality of the generated summaries. For the Sentiment Summarization pilot task, we use a simple sentiment classifier based on a gazetteer of positive and negative sentiment words derived from the General Inquirer and other sources to produce opinion-based summaries for a collection of blog posts given a set of...

### Polarity Filtering for Sentiment Summarization

Frank Schilder, Jochen L. Leidner, Jack G. Conrad, and Ravikumar Kondadadi  Polarity Filtering for Sentiment Summarization.  Poster presented at the First Text Analysis Conference (TAC 2008), NIST, Gaithersburg, MD, USA., 2008.

### Cost-Sensitive Learning in Answer Extraction

Michael Wiegand, Jochen L. Leidner, and Dietrich Klakow.   Cost-Sensitive Learning in Answer Extraction.  Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC), 2008.
http://www.lrec-conf.org/lrec2008/

One problem of data-driven answer extraction in open-domain factoid question answering is that the class distribution of labeled training data is fairly imbalanced. This imbalance has a deteriorating effect on the performance of resulting classifiers. In this paper, we propose a method to tackle class imbalance by applying some form of cost-sensitive learning which is preferable to sampling. We present a simple but effective way of estimating the misclassification costs on the basis of the class distribution. This approach offers three benefits. Firstly, it maintains the distribution of the classes of the labeled training data. Secondly, this form of meta-learning can be applied to a wide range of common learning algorithms. Thirdly, this approach can be easily implemented with the...

## 2007

### Feature Selection Via Least Squares Support Feature Machine

Jianping Li, Zhenyu Chen, Liwei Wei, Weixuan Xu, and Gang Kou  Feature Selection Via Least Squares Support Feature Machine.  International Journal of Information Technology and Decision Making, 6, 671-686, 2007.

In many applications such as credit risk management, data are represented as high-dimensional feature vectors. It makes the feature selection necessary to reduce the computational complexity, improve the generalization ability and the interpretability. In this paper, we present a novel feature selection method--Least Squares Support Feature Machine'' (LS-SFM). The proposed method has two advantages comparing with conventional Support Vector Machine (SVM) and LS-SVM. First, the convex combinations of basic kernels are used as the kernel and each basic kernel makes use of a single feature. It transforms the feature selection problem that cannot be solved in the context of SVM to an ordinary multiple-parameter learning problem. Second, all parameters are learned by a two stage iterative...

### Categorical Attribute Transformation Technique for Multiple Criteria Quadratic Programming Classification Model

Yi Peng and Gang Kou.   Categorical Attribute Transformation Technique for Multiple Criteria Quadratic Programming Classification Model.  Proceedings of the Seventh IEEE International Conference on Data Mining - Workshops (ICDMW'07), 237-242, 2007.

Categorical attributes exist in a great variety of business and scientific data sets and are often useful in prediction and classification. However, there are methods, such as Multiple Criteria Quadratic Programming (MCQP), that can only handle numeric inputs. The conventional MCQP algorithms usually convert categorical attributes to binary vectors or simply ignore them. The goal of this paper is to present a probability estimation-based transformation technique for MCQP model. An empirical study using a real-life credit card transaction data demonstrated that the transformation scheme produced higher classification accuracy and lower error rate.

### Facial Action Unit Recognition by Exploiting Their Dynamic and Semantic Relationships

Yan Tong, Wenhui Liao, and Qiang Ji  Facial Action Unit Recognition by Exploiting Their Dynamic and Semantic Relationships.  IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29, 1683--1699, 2007.
http://ieeexplore.ieee.org/iel5/34/4293197/04293201.pdf?isnum...

A system that could automatically analyze the facial actions in real time has applications in a wide range of different fields. However, developing such a system is always challenging due to the richness, ambiguity, and dynamic nature of facial actions. Although a number of research groups attempt to recognize facial action units (AUs) by improving either the facial feature extraction techniques or the AU classification techniques, these methods often recognize AUs or certain AU combinations individually and statically, ignoring the semantic relationships among AUs and the dynamics of AUs. Hence, these approaches cannot always recognize AUs reliably, robustly, and consistently. In this paper, we propose a novel approach that systematically accounts for the relationships among AUs and...

### Inferring Cognition from fMRI Brain Images

Diego Sona, Sriharsha Veeramachaneni, Emanuele Olivetti, and Paolo Avesani.   Inferring Cognition from fMRI Brain Images.  Proceedings of the International Conference on Artificial Neural Networks, 2007.

Over the last few years, functional Magnetic Resonance Imaging (fMRI) has emerged as a new and powerful method to map the cognitive states of a human subject to specific functional areas of the subject brain. Although fMRI has been widely used to determine average activation in different brain regions, the problem of automatically decoding the cognitive state from instantaneous brain activations has received little attention. In this paper, we study this prediction problem on a complex time-series dataset that relates fMRI data (brain images) with the corresponding cognitive states of the subjects while watching three 20 minute movies. This work describes the process we used to reduce the extremely high-dimensional feature space and a comparison of the models used for prediction. To...

### Privacy-Preserving Data Mining of Medical Data Using Data Separation-Based Techniques

Gang Kou, Yi Peng, Yong Shi, and Zhengxin Chen  Privacy-Preserving Data Mining of Medical Data Using Data Separation-Based Techniques.  Data Science Journal, 6, S429-S434, 2007.

Data mining is concerned with the extraction of useful knowledge from various types of data. Medical data mining has been a popular data mining topic of late. Compared with other data mining areas, medical data mining has some unique characteristics. Because medical files are related to human subjects, privacy concerns are taken more seriously than other data mining tasks. This paper applied data separation-based techniques to preserve privacy in classification of medical data. We take two approaches to protect privacy: one approach is to vertically partition the medical data and mine these partitioned data at multiple sites; the other approach is to horizontally split data across multiple sites. In the vertical partition approach, each site uses a portion of the attributes to compute...

### Analytical Results on Style-Constrained Bayesian Classification of Pattern Fields

Sriharsha Veeramachaneni and George Nagy  Analytical Results on Style-Constrained Bayesian Classification of Pattern Fields.  IEEE transaction on pattern analysis and machine intelligence, 2007.

We formalize the notion of style context, which accounts for the increased accuracy of the field classifiers reported in this journal recently. We argue that style context forms the basis of all order-independent field classification schemes. We distinguish between intraclass style, which underlies most adaptive classifiers, and interclass style, which is a manifestation of interpattern dependence between the features of the patterns of a field. We show how style-constrained classifiers can be optimized either for field error (useful for short fields like zip codes) or for singlet error (for long fields, like business letters). We derive bounds on the reduction of error rate with field length and show that the error rate of the optimal style-constrained field classifier converges...

### E-Discovery Revisited: A Broader Perspective for IR Researchers

Jack G. Conrad.   E-Discovery Revisited: A Broader Perspective for IR Researchers.  Proceedings of the 11th International Conference on Artificial Intelligence and Law (ICAIL 2007), 2007.
http://my.thomson.com/portal/server.pt/gateway/PTARGS_0_49733...

It is a very positive development that NIST's Text REtrieval Conference (TREC) has added a track focusing on the legal (discovery) domain. Its organizers should be acknowledged for their commitment and hard work to establish preliminary tasks and arranging initial assessments. In order to ensure that the track evolves into a realistic and relevant field of study, future tracks will need to accurately reflect the nature and scope of the actual E-Discovery task, or series of tasks, at hand.

### Fast Tagging of Medical Terms in Legal Text

Christopher Dozier, Ravi Kondadadi, Khalid Al-Kofahi, Mark Chaudhary, and Xi Guo.   Fast Tagging of Medical Terms in Legal Text.  Proceedings of the 11th International Conference on Artificial Intelligence and Law (ICAIL 2007), 2007.
http://my.thomson.com/portal/server.pt/gateway/PTARGS_0_49733...

Medical terms occur across a wide variety of legal, medical, and news corpora. Documents containing these terms are of particular interest to legal professionals operating in such fields as medical malpractice, personal injury, and product liability. This paper describes a novel method of tagging medical terms in legal, medical, and news text that is very fast and also has high recall and precision. To date, most research in medical term spotting has been confined to medical text and has approached the problem by extracting noun phrases from sentences and mapping them to a list of medical concepts via a fuzzy lookup. The medical term tagging described in this paper relies on a fast finite state machine that finds within sentences the longest contiguous sets of words associated with...

### SemEval-2007 Task 15: TempEval Temporal Relation Identification

Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky.   SemEval-2007 Task 15: TempEval Temporal Relation Identification.  Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), 75--80, 2007.

The TempEval task proposes a simple way to evaluate automatic extraction of temporal relations. It avoids the pitfalls of evaluating a graph of inter-related labels by defining three sub tasks that allow pairwise evaluation of temporal relations. The task not only allows straightforward evaluation, it also avoids the complexities of full temporal parsing.

### Epsilon-Support Vector and Large-scale data mining problems

Gang Kou, Yi Peng, Yong Shi, and Zhengxin Chen (2007).  Y. Shi et al. (Eds.), Epsilon-Support Vector and Large-scale data mining problems.  (pp. 874 - 881). Springer-Verlag Berlin Heidelberg.

Data mining and knowledge discovery has made great progress during the last fifteen years. As one of the major tasks of data mining, classification has wide business and scientific applications. Among a variety of proposed methods, mathematical programming based approaches have been proven to be excellent in terms of classification accuracy, robustness, and efficiency. However, there are several difficult issues. Two of these issues are of particular interest of this research. The first issue is that it is challenging to find optimal solution for large-scale dataset in mathematical programming problems due to the computational complexity. The second issue is that many mathematical programming problems require specialized codes or programs such as CPLEX or LINGO. The objective of this...

### Application of Classification Methods to Health Insurance Fraud Detection

Yi Peng, Gang Kou, A. Sabatka, J. Matza, Zhengxin Chen, D. Khazanchi, and Yong Shi (2007).  Y. Shi et al. (Eds.), Application of Classification Methods to Health Insurance Fraud Detection.  (pp. 852--858). Springer-Verlag Berlin Heidelberg.

As the number of electronic insurance claims increases each year, it is difficult to detect insurance fraud in a timely manner by manual methods alone. The objective of this study is to use classification modeling techniques to identify suspicious policies to assist manual inspections. The predictive models can label high-risk policies and help investigators to focus on suspicious records and accelerate the claim-handling process. The study uses health insurance data with some known suspicious and normal policies. These known policies are used to train the predictive models. Missing values and irrelevant variables are removed before building predictive models. Three predictive models: Na{\"\i}ve Bayes (NB), decision tree, and Multiple Criteria Linear Programming (MCLP), are trained...

### Self-healing systems --- survey and synthesis

Debanjan Ghosh, Raj Sharman, H. Raghav Rao, and Shambhu Upadhyaya  Self-healing systems --- survey and synthesis.  Decision Support Systems, 42, 2164--2185, 2007.
http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V8...

### Interaction for style-constrained OCR

Sriharsha Veeramachaneni and George Nagy.   Interaction for style-constrained OCR.  Proceedings of SPIE -- Volume 6500, Document Recognition and Retrieval XIV, 2007.

The error rate can be considerably reduced on a style-consistent document if its style is identified and the right style-specific classifier is used. Since in some applications both machines and humans have difficulty in identifying the style, we propose a strategy to improve the accuracy of style-constrained classification by enlisting the human operator to identify the labels of some characters selected by the machine. We present an algorithm to select the set of characters that is likely to reduce the error rate on unlabeled characters by utilizing the labels to reclassify the remaining characters. We demonstrate the efficacy of our algorithm on simulated data.

### Essential Deduplication Functions for Transactional Databases in Law Firms

Jack G. Conrad and Jr. Raymond Edward L..   Essential Deduplication Functions for Transactional Databases in Law Firms.  Proceedings of the 11th International Conference on Artificial Intelligence and Law (ICAIL 2007), 9 pgs, 2007.

As massive document repositories and knowledge management systems continue to expand, in proprietary environments as well as on the Web, the need for duplicate detection becomes increasingly important. In business enterprises such as law firms, effective retrieval applications depend upon such functionality. Today's Internet-savvy users are not interested in search results containing numerous sets of duplicate documents, whether exact duplicates or near variants. This report addresses our work in the domain of legal information retrieval, working with a large, transactional knowledge management system. We specifically explore the occurrence and treatment of identical, near-identical, and fuzzy duplicate sub-documents (clauses') in a contracts database. To our knowledge, we are the...

### Opinion Mining in Legal Blogs

Jack G. Conrad and Frank Schilder.   Opinion Mining in Legal Blogs.  Proceedings of the 11th International Conference on Artificial Intelligence and Law (ICAIL 2007), 5 pgs, 2007.

We perform a survey into the scope and utility of opinion mining in legal Weblogs (a.k.a. blawgs). The number of blogs' in the legal domain is growing at a rapid pace and many potential applications for opinion detection and monitoring are arising as a result. We summarize current approaches to opinion mining before describing different categories of blawgs and their potential impact on the law and the legal profession. In addition to educating the community on recent developments in the legal blog space, we also conduct some introductory opinion mining trials. We first construct a Weblog test collection containing blog entries that discuss legal search tools. We subsequently examine the performance of a language modeling approach deployed for both subjectivity analysis (i.e., is the...

### A New Approach for Evaluating Query Expansion: Query-Document Term Mismatch

Tonya Custis and Khalid Al-Kofahi.   A New Approach for Evaluating Query Expansion: Query-Document Term Mismatch.  Proceedings of the 30th Annual International ACM-SIGIR Conference on Research & Development in Information Retrieval (SIGIR-07), 575--582, 2007.
http://portal.acm.org/citation.cfm?id=1277840

The effectiveness of information retrieval (IR) systems is influenced by the degree of term overlap between user queries and relevant documents. Query-document term mismatch, whether partial or total, is a fact that must be dealt with by IR systems. Query Expansion (QE) is one method for dealing with term mismatch. IR systems implementing query expansion are typically evaluated by executing each query twice, with and without query expansion, and then comparing the two result sets. While this measures an overall change in performance, it does not directly measure the effectiveness of IR systems in overcoming the inherent issue of term mismatch between the query and relevant documents, nor does it provide any insight into how such systems would behave in the presence of query-document...

### Active Learning of Feature Relevance

Emanuele Olivetti, Sriharsha Veeramachaneni, and Paolo Avesani  Active Learning of Feature Relevance.  Computation Methods for Feature Selection, 2007.

This chapter deals with active feature value acquisition for feature relevance estimation in domains where feature values are expensive to measure.

### E-Business Intelligence Via MCMP-Based Data Mining Methods

Yi Peng, Yong Shi, Xingsen Li, Zhengxin Chen, and Gang Kou (2007).  N. Zhong et al. (Eds.), E-Business Intelligence Via MCMP-Based Data Mining Methods.  (pp. 443-453). Springer-Verlag Berlin Heidelberg.

Organizations gain competitive advantages and benefits through e-Business Intelligence (e-BI) technologies at all levels of business operations. E-BI gathers, processes, and analyzes tremendous relevant data to help enterprises make better decisions. Data mining, which utilizes methods and tools from various fields to extract useful knowledge from large amount of data, provides significant support to e-BI applications. This paper gives an overview of a data mining approach: Multiple Criteria Mathematical Programming (MCMP); describes a real-life application using MCMP; and explains how business users at different levels can benefit from the results of MCMP.

### Event Extraction and Temporal Reasoning in Legal Documents

Frank Schilder (2007).  In Annotating, Extracting and Reasoning about Time and Events, Frank Schilder and Graham Katz and James Pustejovsky (Eds.), Event Extraction and Temporal Reasoning in Legal Documents.  Springer Verlag.

This paper presents a prototype system that extracts events from the United States Code on U.S. immigration nationality and links these events to temporal constraints, such as in entered the United States before December 31, 2005. In addition, the paper provides an overview of what kinds of other temporal information can be found in different types of legal documents. In particular, it discusses how one could do further reasoning with the extracted temporal information for case law and statutes.

### Annotating, Extracting and Reasoning about Time and Events: An Overview of Recent Approaches

Frank Schilder, Graham Katz, and James Pustejovsky (2007).  In Annotating, Extracting and Reasoning about Time and Events, Frank Schilder and Graham Katz and James Pustejovsky (Eds.), Annotating, Extracting and Reasoning about Time and Events: An Overview of Recent Approaches.  Springer Verlag.

The main focus of the Dagstuhl seminar 05151 was on TimeML-based temporal annotation and reasoning. We were concerned with three main points: how effectively can one use the TimeML language for consistent annotation, determining how useful such annotation is for further processing, and determining what modifications should be applied to the standard to make it more useful for applications such as question-answering and information retrieval.

## 2006

### Mining Legal Text to Create a Litigation History Database

Mark Chaudhary, Christopher Dozier, Gregory Atkinson, Gary Berosik, Xi Guo, and Steve Samler.   Mining Legal Text to Create a Litigation History Database.  Proceedings of IASTED International Conference on Law and Technology (Lawtech 2006), 2006.

### Managing D\'ej\a Vu: Collection Building for Identifying Nonidentical Duplicate Documents

Jack G. Conrad and Cindy P. Schriber  Managing D\'ej\a Vu: Collection Building for Identifying Nonidentical Duplicate Documents.  Journal of the American Society for Information Science and Technology (JASIST), 57, 921-932, 2006.

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close variants. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its deleterious effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling nonidentical duplicate documents. We subsequently examine a flexible method of characterizing and comparing documents to permit the identification of near...

### Encyclopedia of Language & Linguistics

Peter Jackson and Frank Schilder (2006).  Natural Language Processing: Overview. Keith Brown (Eds.), Encyclopedia of Language & Linguistics.  (pp. 503-517). Oxford: Elsevier.

### What is the Future of Multi-lingual Information Access?

Isabelle Moulinier and Frank Schilder.   What is the Future of Multi-lingual Information Access?.  Working Notes of the New Directions in Multilingual Information Access Workshop at SIGIR 2006, 2006.

### Word and tree-based similarities for textual entailment

Frank Schilder and Bridget Thomson McInnes.   Word and tree-based similarities for textual entailment.  Proceedings of the second PASCAL workshop on Recognising Textual Entailment, 140-145, 2006.

### TLR at DUC 2006: Approximate tree similarity and a new evaluation regime

Frank Schilder and Bridget Thomson McInnes.   TLR at DUC 2006: Approximate tree similarity and a new evaluation regime.  Proceedings of the Document Understanding Conference (DUC-2006), 2006.

### Evaluating a summarizer for legal text with a large text collection

Frank Schilder and Hugo Molina-Salgado.   Evaluating a summarizer for legal text with a large text collection.  3rd Midwestern Computational Linguistics Colloquium (MCLC), 2006.

## 2005

### Thomson Legal and Regulatory Experiments at CLEF-2005

Isabelle Moulinier and Ken Williams.   Thomson Legal and Regulatory Experiments at CLEF-2005.  Working Notes of the Cross-Language Evaluation Forum (CLEF) Workshop 2005, 2005.
http://www.clef-campaign.org/2005/working%5Fnotes/

### Mining Text for Expert Witnesses

Christopher Dozier and Peter Jackson  Mining Text for Expert Witnesses.  IEEE Software, 94--100, 2005.

Text mining is a relatively new research area associated with the creation of novel information resources from electronic text repositories. An expert-witness database based on text from legal, medical, and news documents demonstrates the successful application of text-mining techniques.

### Temporal information extraction from legal documents

Frank Schilder and Andrew McCulloh.   Temporal information extraction from legal documents.  Proceedings of Dagstuhl Seminar on Annotating, Extracting and Reasoning about Time and Events, 9, 2005.

### Artificial intelligence and information retrieval

Peter Jackson  Artificial intelligence and information retrieval.  Searcher, 13, 29--33, 2005.

### Effective Document Clustering for Large Heterogeneous Law Firm Collections

Jack G. Conrad, Khalid Al-Kofahi, Ying Zhao, and George Karypis.   Effective Document Clustering for Large Heterogeneous Law Firm Collections.  Proceedings of the 10th International Conference on Artificial Intelligence and Law, 177--187, 2005.

Computational resources for research in legal environments have historically implied remote access to large databases of legal documents such as case law, statutes, law reviews and administrative materials. Today, by contrast, there exists enormous growth in lawyers' electronic work product within these environments, specifically within law firms. Along with this growth has come the need for accelerated knowledge management---automated assistance in organizing, analyzing, retrieving and presenting this content in a useful and distributed manner. In cases where a relevant legal taxonomy is available, together with representative labeled data, automated text classification tools can be applied. In the absence of these resources, document clustering offers an alternative approach to...

### Report on Thomson Legal and Regulatory Experiments at CLEF-2004

Isabelle Moulinier and Ken Williams.   Report on Thomson Legal and Regulatory Experiments at CLEF-2004.  Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross-Language Evaluation Forum, CLEF 2004, Bath, UK, September 15-17, 2004, Revised Selected Papers, 3491, 2005.
http://www.springeronline.com/sgw/cda/frontpage/0,11855,4-102...

### Thomson Legal and Regulatory at NTCIR-5: Japanese and Korean Experiments

Isabelle Moulinier and Ken Williams.   Thomson Legal and Regulatory at NTCIR-5: Japanese and Korean Experiments.  Proceedings of the Fifth NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access, 2005.
http://www.mt-archive.info/NTCIR-2005-Moulinier.pdf

Thomson Legal and Regulatory participated in the CLIR task of the NTCIR-5 workshop. We submitted formal runs for monolingual retrieval in Japanese and Korean, as well as for bilingual English-to-Japanese retrieval. We employed enhanced tokenization for our Japanese and Korean runs and applied a novel selective pseudo-relevance feedback scheme for Japanese. Our bilingual search participation was a straightforward application of an off-the-shelf Machine Translation system to transform an English query into a Japanese query. Unfortunately we cannot draw many conclusions from our participation, as our experiments were hampered by technical difficulties, particularly with our tokenization and stemming components.

### Temporal anaphoric expressions in German news messages

Frank Schilder.   Temporal anaphoric expressions in German news messages.  Time and Event Recognition in Natural Language, 2005.

### The Language of Time: A Reader

Frank Schilder and Christopher Habel (2005).  From temporal expressions to temporal information: Semantic tagging of news messages. Inderjeet Mani and James Pustejovsky and Robert Gaizauskas (Eds.), The Language of Time: A Reader.  (pp. 533--544). Oxford University Press.

### TLR at DUC: Tree Similarity

Frank Schilder, Andrew McCulloh, Bridget Thomson McInnes, and Alex Zhou.   TLR at DUC: Tree Similarity.  Proceedings of the Document Understanding Conference (DUC) 2005, 8, 2005.

## 2004

### Constructing a Text Corpus for Inexact Duplicate Detection

Jack G. Conrad and Cindy P. Schriber.   Constructing a Text Corpus for Inexact Duplicate Detection.  Proceedings of the 27th Annual International ACM-SIGIR Conference on Research & Development in Information Retrieval (SIGIR-04), 582--583, 2004.

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its negative effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling inexact duplicate documents.

### Cross Document Co-Reference Resolution Applications for People in the Legal Domain

Christopher Dozier and Thomas Zielund.   Cross Document Co-Reference Resolution Applications for People in the Legal Domain.  Proceedings of the ACL 2004 Workshop on Reference Resolution and its Applications, 9--16, 2004.

By combining information extraction and record linkage techniques, we have created a repository of references to attorneys, judges, and expert witnesses across a broad range of text sources. These text sources include news, caselaw, law reviews, Medline abstracts, and legal briefs among others. We briefly describe our cross document co-reference resolution algorithm and discuss applications these resolved references enable. Among these applications is one that shows summaries of relationships chains between individuals based on their document co-occurrence and cross document co-references.

Xi S. Guo, Mark Chaudhary, Christopher Dozier, Yohendran Arumainayagam, and Venkatesan Subramanian.   A Web Application Using RDF/RDFS for Metadata Navigation.  Proceedings of the 4th Workshop on NLP and XML (NLPXML-04), 17--23, 2004.
http://my.thomson.com/portal/server.pt/gateway/PTARGS_0_24991...

This paper describes using RDF/RDFS/XML to create and navigate a metadata model of relationships among entities in text. The metadata we create is roughly an order of magnitude smaller than the content being modeled, it provides the end-user with context sensitive information about the hyperlinked entities in focus. These entities at the core of the model are originally found and resolved using a combination of information extraction and record linkage techniques. The RDF/RDFS metadata model is then used to look ahead'' and navigate to related information. An RDF aware frontend web application streamlines the presentation of information to the end user.

### Thomson Legal and Regulatory at NTCIR-4: Monolingual and Pivot-language Retrieval Experiments

Isabelle Moulinier.   Thomson Legal and Regulatory at NTCIR-4: Monolingual and Pivot-language Retrieval Experiments.  Proceedings of the Fourth NTCIR Workshop, 1--8, 2004.

Thomson Legal and Regulatory participated in the CLIR task of the NTCIR-4 workshop. We submitted formal runs for monolingual retrieval in Japanese, Chinese and Korean. Our bilingual runs from Chinese and Korean to Japanese rely on English as a pivot language. During our monolingual experiments, we compared building stopword lists using query logs to building stopword lists from collection statistics with further manual editing. We investigated decompounding for Korean, more precisely partial credit of compound parts. Finally we incorporated pseudo-relevance feedback in our Japanese runs. Our bilingual approach was an experiment to construct a system within a short timeframe using publically available resources. The low quality of retrieval suggests that such an approach is not viable...

### Extracting Spatial Information: Grounding, Classifying and Linking Spatial Expressions

Frank Schilder, Yannick Versley, and Christopher Habel  Extracting Spatial Information: Grounding, Classifying and Linking Spatial Expressions.  Extended Abstract, 3, 2004.

This paper is concerned with the tagging of spatial expressions in German newspaper articles, assigning a meaning to the expression and classifying the usages of the spatial expression and linking the derived referent to an event description. In our system, we implemented the activation of concepts in a very simple fashion, a concept is activated once (with a cost depending on the item that activated it) and is left activated thereafter. As an example, a city also activates the nodes for the region and the country it is part of, so that cities from one country are chosen over cities from different countries. A test corpus of 12 German newspaper articles was tested regarding several disambiguation strategies. Disambiguation was carried out via a beam search to find an approximately...

### Learning Transformation Rules for Semantic Role Labeling

Ken Williams, Christopher Dozier, and Andrew McCulloh.   Learning Transformation Rules for Semantic Role Labeling.  Proceedings of the 8th Conference on Computational Natural Language Learning (CoNLL-2004), 134--137, 2004.

This paper presents our work on Semantic Role Labeling using a Transformation-Based ErrorDriven approach in the style of Eric Brill (Brill, 1995). Our approach achieved an overall F1 score of 43.48 on non-verb annotations. We believe our approach is noteworthy because of its novelty in this area and because it produces short lists of human-understandable transformation rules as its output.

## 2003

### Automatic categorization of questions for a mathematics education service

Ken Williams, Rafael A. Calvo, and David Bell.   Automatic categorization of questions for a mathematics education service.  Proceedings of the 11th International Conference on Artificial Intelligence in Education, 2003.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.3493 |

This paper describes a new approach to managing a stream of questions about mathematics by integrating a text categorization framework into a relational database management system. The corpus studied is based on unstructured submissions to an ask-an-expert service in learning mathematics. The classification system has been tested using a Nave Bayes learner built into the framework. The performance results of the classifier are also discussed. The framework was integrated into a PostgreSQL database through the use of procedural trigger functions.

### Validation: A Critical First Step in the Evaluation of Systems for Legal Corpus Determination

Jack G. Conrad and Joanne S. Claussen.   Validation: A Critical First Step in the Evaluation of Systems for Legal Corpus Determination.  Proceedings of Workshop on Evaluation of Legal Reasoning and Problem-Solving Systems (ICAIL-03), 1--9, 2003.

The continued growth of very large data environments, both proprietary and Web-based, increases the importance of effective and efficient legal corpus selection and searching. Current database selection'' research focuses largely on completely autonomous and automatic selection, searching, and results merging in distributed environments. This fully automatic approach has significant deficiencies, including reliance upon thresholds below which data sets with relevant documents are not searched (compromised recall). It also merges result sets, often from disparate data sources, some that users may have discarded before their source selection task completed (diluted precision). We examine the impact that user interaction can have on the process of legal corpus selection. After analyzing...

### Client-System Collaboration for Legal Corpus Selection in an Online Production Environment

Jack G. Conrad and Joanne S. Claussen.   Client-System Collaboration for Legal Corpus Selection in an Online Production Environment.  Proceedings of the 9th International Conference on Artificial Intelligence and Law (ICAIL-03), 262--273, 2003.

The continued growth of very large data environments such as Westlaw and Dialog, in addition to the World Wide Web, increases the importance of effective and efficient database selection and searching. Current research focuses largely on completely autonomous and automatic selection, searching, and results merging in distributed environments. This fully automatic approach has significant deficiencies, including reliance upon thresholds below which databases with relevant documents are not searched (compromised recall). It also merges result sets, often from disparate data sources that users may have discarded before their source selection task proceeded (diluted precision). We examine the impact that early user interaction can have on the process of database selection. After analyzing...

### Early User-System Interaction for Database Selection in Massive Domain-specific Online Environments

Jack G. Conrad and Joanne S. Claussen  Early User-System Interaction for Database Selection in Massive Domain-specific Online Environments.  ACM Transactions on Information Systems (TOIS), 21, 94--131, 2003.

The continued growth of very large data environments such as Westlaw and Dialog, in addition to the World Wide Web, increases the importance of effective and efficient database selection and searching. Current research focuses largely on completely autonomous and automatic selection, searching, and results merging in distributed environments. This fully automatic approach has significant deficiencies, including reliance upon thresholds below which databases with relevant documents are not searched (compromised recall). It also merges documents, often from disparate data sources that users may have discarded before their source selection task proceeded (diluted precision). We examine the impact that early user interaction can have on the process of database selection. After analyzing...

### Online Duplicate Document Detection: Signature Reliability in a Dynamic Retrieval Environment

Jack G. Conrad, Xi S. Guo, and Cindy P. Schriber.   Online Duplicate Document Detection: Signature Reliability in a Dynamic Retrieval Environment.  Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM-03), 243--252, 2003.

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close matches. Our goal in this work is to investigate the phenomenon and determine one or more approaches that minimize its impact on search results. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. A representative technique constructs a `fingerprint' of the rarest or richest features in a document using collection statistics as criteria for feature selection. One of the challenges of this approach, however, arises from the...

### Creation of an Expert Witness Database through Text Mining

Christopher Dozier, Peter Jackson, Xi S. Guo, Mark Chaudhary, and Yohendran Arumainayagam.   Creation of an Expert Witness Database through Text Mining.  Proceedings of the 9th International Conference on Artificial Intelligence and Law (ICAIL-2003), 177--184, 2003.

This paper describes how an online directory of expert witnesses was created from jury verdict and settlement documents using text mining techniques. We have created an expert witness directory that contains over 100,000 expert profiles, based on approximately 300,000 jury verdict and settlement documents, publicly available professional license information, an expertise taxonomy, and automatic text mining techniques. This directory can be browsed by area of expertise as well as by location and name. In addition, expert profiles are automatically linked to medline articles and jury verdict and settlement documents. The supporting technologies that made this application possible include information extraction from text via regular expression parsing, record linkage through Bayesian...

### Combining Record Linkage and Information Extraction to Mine Text

Christopher Dozier, Peter Jackson, Isabelle Moulinier, Xi S. Guo, Mark Chaudhary, and Yohendran Arumainayagam.   Combining Record Linkage and Information Extraction to Mine Text.  Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation (KDD-03), 43, 2003.

We have created an expert witness directory that contains over 100,000 expert profiles, based on approximately 300,000 jury verdict and settlement documents, publicly available professional license information, an expertise taxonomy, and automatic text mining techniques. This directory can be browsed by area of expertise as well as by location and name. In addition, expert profiles are automatically linked to Medline articles and jury verdict and settlement documents. The supporting technologies that made this application possible include information extraction from text via cascaded finite state transducers, record linkage through Bayesian based matching, and automatic rule-based classification. To the best of our knowledge, this is the largest expert witness directory of its kind and...

### Information Extraction from Case Law and Retrieval of Prior Cases

Peter Jackson, Khalid Al-Kofahi, Alex Tyrrell, and Arun Vachher  Information Extraction from Case Law and Retrieval of Prior Cases.  Artificial Intelligence, 150, 239--290, 2003.

We describe an information extraction and retrieval system, called History Assistant, which extracts rulings from court opinions and retrieves relevant prior cases from a citator database. The technology employed is similar to that adopted in the Message Understanding Conferences, but attempts a fuller parse in order to distinguish current rulings from previous rulings reported in a case. In addition, we employ a combination of information retrieval and machine learning techniques to link each new case to related documents that it may impact. We present experimental results, in terms of precision and recall, for all tasks performed by the extraction and linking programs. Part of the finished system has been deemed worthy of further development into a computer-assisted database update...

### A framework for text categorization

Ken Williams (2003).  A framework for text categorization.  Masters thesis, University of Sydney, .

The field of automatic Text Categorization (TC) concerns the creation of categorizer functions, usually involving Machine Learning techniques, to assign labels from a pre-defined set of categories to documents based on the documents' content. Because of the many variations on how this can be achieved and the diversity of applications in which it can be employed, creating specific TC applications is often a difficult task. This thesis concerns the design, implementation, and testing of an Object-Oriented Application Framework for Text Categorization. By encoding expertise in the architecture of the framework, many of the barriers to creating TC applications are eliminated. Developers can focus on the domain-specific aspects of their applications, leaving the generic aspects of...

## 2002

### Automatic categorization of announcements on the Australian stock exchange

Rafael A. Calvo and Ken Williams.   Automatic categorization of announcements on the Australian stock exchange.  Proceedings of the 7th Australasian Document Computing Symposium, 2002.

This paper compares the performance of several machine learning algorithms for the automatic categorization of corporate announcements in the Australian Stock Exchange (ASX) Signal G data stream. The article also describes some of the applications that the categorization of corporate announcements may enable. We have performed tests on two categorization tasks: market sensitivity, which indicates whether an announcement will have an impact on the market, and report type, which classifies each announcement into one of the report categories defined by the ASX. We have tried Neural Networks, a Naive Bayes classifier, and Support Vector Machines and achieved good results.

### A framework for text categorization

Ken Williams and Rafael A. Calvo.   A framework for text categorization.  Proceedings of the 7th Australasian Document Computing Symposium, 2002.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.4369

In this paper we discuss the architecture of an object-oriented application framework (OOAF) for text categorization. We describe the system requirements and the software engineering strategies that form the basis of the design and implementation of the framework. We show how designing a highly reusable OOAF architecture facilitates the development of new applications. We also highlight the key text categorization features of the framework, as well as practical considerations for application developers.

### Embedding Perl in HTML with Mason

Dave Rolsky and Ken Williams (2002). Embedding Perl in HTML with Mason O'Reilly and Associates.
http://masonbook.com

Although using Mason isn't difficult, creating a Mason-based site can be tricky. Embedding Perl in HTML with Mason shows you how to create large, complex, dynamically driven web sites that look good and are a snap to maintain. This concise book covers Mason's features from several angles, and includes a study of the authors' sample site where these features are used. You'll learn how to visualize multiple Mason-based solutions to any given problem and select among them. The book covers the latest line of Mason development 1.1x, which has many new features, including line number reporting based on source files, sub-requests, and easier use as a CGI. The only book to cover this important tool,Embedding Perl in HTML with Mason is essential reading for any Perl programmer who wants to...

### Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-specific Operational Environment

Jack G. Conrad, Xi S. Guo, Peter Jackson, and Monem Meziou.   Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-specific Operational Environment.  Proceedings of the 28th International Conference on Very Large Databases (VLDB-02), 71--82, 2002.

The continued growth of very large data environments such as Westlaw, Dialog, and the World Wide Web, increases the importance of effective and efficient database selection and searching. Recent research has focused on autonomous and automatic collection selection, searching, and results merging in distributed environments. These studies often rely on TREC data and queries for experimentation. We have extended this work to West's online production environment where thousands of legal, financial and news databases are accessed by up to a quarter-million professional users each day. Using the WIN natural language search engine, a cousin to UMass's INQUERY, along with a collection retrieval inference network (CORI) to provide database scoring, we examine the effect that a set of optimized...

### Effective Collection Metasearch in a Hierarchical Environment: Global vs. Localized Retrieval Performance

Jack G. Conrad, Changwen Yang, and Joanne S. Claussen.   Effective Collection Metasearch in a Hierarchical Environment: Global vs. Localized Retrieval Performance.  Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR-02), 371--372, 2002.

We compare standard global IR searching with user-centric localized techniques to address the \textit{database selection problem}. We conduct a series of experiments to compare the retrieval effectiveness of three separate search modes applied to a hierarchically structured data environment of textual database representations. The data environment is represented as a tree-like directory containing over 15,000 unique databases and over 100,000 total leaf nodes. Our search modes consist of varying degrees of browse and search, from a global search at the root node to a refined search at a subnode using dynamically-calculated inverse document frequencies (idfs) to score candidate databases for probable relevance. Our findings indicate that a browse and search approach that relies upon...

### Natural Language Processing for Online Applications: Text Retrieval, Extraction & Categorization

Peter Jackson and Isabelle Moulinier (2002). Natural Language Processing for Online Applications: Text Retrieval, Extraction & Categorization Amsterdam: John Benjamins.

### Thomson Legal and Regulatory Experiments for CLEF 2002

Isabelle Moulinier and Hugo Molina-Salgado.   Thomson Legal and Regulatory Experiments for CLEF 2002.  Proceedings of the CLEF 2002 Conference, 1--6, 2002.

Thomson Legal and Regulatory participated in the monolingual, the bilingual and the multilingual tracks. Our monolingual runs added Swedish to the languages we had submitted in previous participations. Our bilingual and multilingual efforts used English as the query language. We experimented with dictionaries and similarity thesauri for the bilingual task, while we used machine translations in our multi-lingual runs. Our various merging strategies had limited success compared to a simple round robin.

### Thomson Legal and Regulatory at NTCIR-3: Japanese, Chinese and English Retrieval Experiments

Isabelle Moulinier, Hugo Molina-Salgado, and Peter Jackson.   Thomson Legal and Regulatory at NTCIR-3: Japanese, Chinese and English Retrieval Experiments.  Proceedings of the Third NTCIR Workshop, Part II, 59--64, 2002.

Thomson Legal and Regulatory participated in the CLIR task of the NTCIR-3 workshop. We submitted formal runs for monolingual retrieval in Japanese and Chinese, and for bilingual retrieval from English to Japanese. Our main focus was in Japanese retrieval. We compared word-based and character-based indexing, as well as query formulation using characters and character bigrams. Our results show that wordbased and bigram-based retrieval show similar performance for most query formulation approaches, while they outperform character-based retrieval. For Chinese retrieval, we compared using single characters with using character bigrams. We also introduced a structured query to leverage both. Our results are consistent with previous work, where character bigrams were shown to have better...

## 2001

### A Machine Learning Approach to Prior Case Retrieval

Khalid Al-Kofahi, Alex Tyrrell, Arun Vachher, Tim Travers, and Peter Jackson.   A Machine Learning Approach to Prior Case Retrieval.  Proceedings of the 8th International Conference on Artificial Intelligence and Law (ICAIL-01), 88--93, 2001.

We describe a system that processes court opinions and retrieves related cases from a citator database, so that new cases can be linked to earlier ones that they impact. The design of the system combines information extraction, information retrieval and machine learning techniques in a novel way. The fully implemented program is capable of performing prior case retrieval at human levels of recall and acceptable levels of precision.

### Combining Multiple Classifiers for Text Categorization

Khalid Al-Kofahi, Alex Tyrrell, Arun Vachher, Tim Travers, and Peter Jackson.   Combining Multiple Classifiers for Text Categorization.  Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM-01), 97--104, 2001.

A major problem facing online information services is how to index and supplement large document collections with respect to a rich set of categories. We focus upon the routing of case law summaries to various secondary law volumes in which they should be cited. Given the large number (> 13,000) of closely related categories, this is a challenging task that is unlikely to succumb to a single algorithmic solution. Our fully implemented and recently deployed system shows that a superior classification engine for this task can be constructed from a combination of classifiers. The multi-classifier approach helps us leverage all the relevant textual features and meta data, and appears to generalize to related classification tasks.

### Automatic Recognition of Distinguishing Negative Indirect History Language in Judicial Opinions

Jack G. Conrad and Daniel P. Dabney.   Automatic Recognition of Distinguishing Negative Indirect History Language in Judicial Opinions.  Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM-01), 287--294, 2001.

### A Cognitive Approach to Judicial Opinion Structure: Applying Domain Expertise to Component Analysis

Jack G. Conrad and Daniel P. Dabney.   A Cognitive Approach to Judicial Opinion Structure: Applying Domain Expertise to Component Analysis.  Proceedings of the 8th International Conference on Artificial Intelligence and Law (ICAIL-01), 1--11, 2001.

Empirical research on basic \textit{components} of American judicial opinions has only scratched the surface. Lack of a coordinated pool of legal experts or adequate computational resources are but two reasons responsible for this deficiency. We have undertaken a study to uncover fundamental components of judicial opinions found in American case law. The study was aided by a team of twelve expert attorney-editors with a combined total of 135 years of legal editing experience. The scientific hypothesis underlying the experiment was that after years of working closely with thousands of judicial opinions, expert attorneys would develop a refined and internalized schema of the content and structure of legal cases. In this study participants were permitted to describe both concept-related...

### Assigning Belief Scores to Names in Queries

Christopher Dozier.   Assigning Belief Scores to Names in Queries.  Proceedings of the Human Language Technology Conference (HLT-01), 35--39, 2001.

Assuming that the goal of a person name query is to find references to a particular person, we argue that one can derive better relevance scores using probabilities derived from a language model of personal names than one can using corpus based occurrence frequencies such as inverse document frequency (idf). We present here a method of calculating person name match probability using a language model derived from a directory of legal professionals. We compare how well name match probability and idf predict search precision of word proximity queries derived from names of legal professionals and major league baseball players. Our results show that name match probability is a better predictor of relevance than idf. We also indicate how rare names with high match probability can be used as...

### Empirical Methods for Exploiting Parallel Texts

I. Dan Melamed (2001). Empirical Methods for Exploiting Parallel Texts Cambridge, MA: MIT Press.

Parallel texts (bitexts) are a goldmine of linguistic knowledge, because the translation of a text into another language can be viewed as a detailed annotation of what that text means. Knowledge about translational equivalence, which can be gleaned from bitexts, is of central importance for applications such as manual and machine translation, cross-language information retrieval, and corpus linguistics. The availability of bitexts has increased dramatically since the advent of the Web, making their study an exciting new area of research in natural language processing. This book lays out the theory and the practical techniques for discovering and applying translational equivalence at the lexical level. It is a start-to-finish guide to designing and evaluating many translingual applications.

### Thomson Legal and Regulatory at CLEF 2001: Monolingual and Bilingual Experiments

Hugo Molina-Salgado, Isabelle Moulinier, Mark Knutson, Elizabeth Lund, and Kirat Sekhon.   Thomson Legal and Regulatory at CLEF 2001: Monolingual and Bilingual Experiments.  Proceedings of the CLEF 2001 Conference, 1--6, 2001.

Thomson Legal and Regulatory participated in the monolingual track for all five languages and in the bilingual track with Spanish-English runs. Our monolingual runs for Dutch, Spanish and Italian use settings and rules derived from our runs in French and German last year. Our bilingual runs compared merging strategies for query translation resources.

### Automatic Categorization of Case Law

Paul Thompson.   Automatic Categorization of Case Law.  Proceedings of the 8th International Conference on Artificial Intelligence and Law (ICAIL-01), 70--77, 2001.

This paper describes a series of automatic text categorization experiments with case law documents. Cases are categorized into 40 broad, high-level categories. These results are compared to an existing operational process using Boolean queries manually constructed by domain experts. In this categorization process recall is considered more important than precision. This paper investigates three algorithms that potentially could automate this categorization process: 1) a nearest neighbor-like algorithm, 2) C4.5rules, a machine learning decision tree algorithm; and 3) Ripper, a machine learning rule induction algorithm. The results obtained by Ripper surpass those of the operational process.

## 2000

### Automatic Extraction and Linking of Person Names in Legal Text

Christopher Dozier and Robert Haschart.   Automatic Extraction and Linking of Person Names in Legal Text.  Proceedings of RIAO 2000 (Recherche d'Information Assistee par Ordinateur), 1305--1321, 2000.
http://www.sigmod.org/dblp/db/conf/riao/riao2000.html

This paper describes an application that creates hypertext links in text from named individuals to personal biographies. Our system creates these links by extracting MUC-style templates from text and linking them to biographical information in a relational database. The linking technique we use is based on a na{\"\i}ve Bayesian inference network. In particular, our application involves the extraction of attorney and judge names from American caselaw and the creation of links between the names and a file containing their biographies. It is a real world commercial application that involves the automatic creation of millions of reliable hypertext links in millions of documents. The techniques described in this paper could be applied to other domains besides law. Our experiments show that,...

Chapter Six

## 1999

### Name Recognition and Retrieval Performance

Paul Thompson and Christopher Dozier (1999).  Natural Language Information Retrieval. Strzalkowski, Tomek (Eds.), Name Recognition and Retrieval Performance.  (pp. 261--272). Dordrecht: Kluwer Academic.
http://www.amazon.com/gp/product/0792356853

The main application of name searching has be name matching in a database of names. This paper discusses a different application: improving information retrieval through name recognition. It investigates name recognition accuracy, and the effect on retrieval performance of indexing and searching personal names differently from non-name terms in the context of ranked retrieval. The main conclusions are: that name recognition in text can be effective; that names occur frequently enough in a variety of domains, including those of legal documents and news databases, to make recognition worthwhile; and that retrieval performance can be improved using name searching.

### Genetic Algorithms

Ken Williams and Brad Murray  Genetic Algorithms.  The Perl Journal, 4, 1999.
http://www.foo.be/docs/tpj/issues/vol4_3/tpj0403-0005.html

Evolving algebraic expressions.

## 1998

### The Structure of Judicial Opinions: Identifying Internal Components and their Relationships

Jack G. Conrad and Daniel P. Dabney.   The Structure of Judicial Opinions: Identifying Internal Components and their Relationships.  Proceedings of the 5th International ISKO Conference (ISKO-98), Structures and Relations in Knowledge Organization, 413 ff., 1998.

Empirical research on basic components of American judicial opinions has only scratched the surface. Lack of a coordinated pool of legal experts or adequate computational resources are but two reasons responsible for this deficiency. We have undertaken a three phase study to uncover fundamental components of judicial opinions found in American case law. The study was aided by a team of twelve expert attorney-editors with a combined total of 135 years of legal editing experience. The hypothesis underlying the experiment was that after years of working closely with thousands of judicial opinions, expert attorneys would develop a refined and internalized schema of the content and structure of legal cases. In this study participants were permitted to describe both concept-related and...

## 1997

### Name Searching and Information Retrieval

Paul Thompson and Christopher Dozier.   Name Searching and Information Retrieval.  Proceedings of 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP 1997), 134--140, 1997.

The main application of name searching has be name matching in a database of names. This paper discusses a different application: improving information retrieval through name recognition. It investigates name recognition accuracy, and the effect on retrieval performance of indexing and searching personal names differently from non-name terms in the context of ranked retrieval. The main conclusions are: that name recognition in text can be effective; that names occur frequently enough in a variety of domains, including those of legal documents and news databases, to make recognition worthwhile; and that retrieval performance can be improved using name searching.

## 1996

### Uncertainty in Information Retrieval Systems

Howard R. Turtle and W. Bruce Croft (1996).  In Uncertainty Management in Information Systems, Uncertainty in Information Retrieval Systems.  (pp. 189-224).

Any effective retrieval system includes three major components: the identification and representation of document content, the acquisition and representation of the information need, and the specification of a matching function that selects relevant documents based on these representations. Uncertainty must be dealt with in each of these components.

## 1995

### Text Retrieval in the Legal World

Howard R. Turtle  Text Retrieval in the Legal World.  Artificial Intelligence and Law, 3, 5-54, 1995.

The ability to find relevant materials in large document collections is a fundamental component of legal research. The emergence of large machine-readable collections of legal materials has stimulated research aimed at improving the quality of the tools used to access these collections. Important research has been conducted within the traditional information retrieval, the artificial intelligence, and the legal communities with varying degrees of interaction between these groups. This article provides an introduction to text retrieval and surveys the main research related to the retrieval of legal materials.

### Query Evaluation: Strategies and Optimizations

Howard R. Turtle and James Flood  Query Evaluation: Strategies and Optimizations.  Information Processing & Management, 31, 831-850, 1995.
http://dx.doi.org/10.1016/0306-4573(95)00020-H

This paper discusses the two major query evaluation strategies used in large text retrieval systems and analyzes the performance of these strategies. We then discuss several optimization techniques that can be used to reduce evaluation costs and present simulation results to compare the performance of these optimization techniques when evaluating natural language queries with a collection of full text legal materials.

## 1994

### A System for Discovering Relationships by Feature Extraction from Text Databases

Jack G. Conrad and Mary Hunter Utt.   A System for Discovering Relationships by Feature Extraction from Text Databases.  Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Dublin, Ireland, 3-6 July 1994 (Special Issue of the SIGIR Forum), 260-270, 1994.

A method for accessing text-based information using domain-specific features rather than documents alone is presented. The basis of this approach is the ability to automatically extract features from large text databases, and identify statistically significant relationships or associations between those features. The techniques supporting this approach are discussed, and examples from an application using these techniques, named the Associations System, are illustrated using the Wall Street Journal database. In this particular application, the features extracted are company and person names. The series of tests run on the Associations System demonstrate that feature extraction can be quite accurate, and that the relationships generated are reliable. In addition to conventional measures...

### TREC-3 Ad Hoc Retrieval and Routing Experiments using the WIN System

Paul Thompson, Howard R. Turtle, Bokyung Yang, and James Flood.   TREC-3 Ad Hoc Retrieval and Routing Experiments using the WIN System.  TREC, 1-7, 1994.

The WIN retrieval engine is West's implementation of the inference network retrieval model. The inference net model ranks documents based on the combination of different evidence, e.g., text representations, such as words, phrases, or paragraphs, in a consistent probabilistic framework. WIN is based on the same retrieval model as the INQUERY system that has been used in previous TREC competitions. The two retrieval engines have common roots but have evolved separately -- WIN has focused on the retrieval of legal materials from large (>50 gigabyte) collections in a commercial online environment that supports both Boolean and natural language retrieval. For TREC-3 we decided to run an essentially unmodified version of WIN to see how well a state-of-the-art commercial system compares to...

### Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance

Howard R. Turtle.   Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance.  Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Dublin, Ireland, 3-6 July 1994 (Special Issue of the SIGIR Forum), 212-220, 1994.

The results of experiments comparing the relative performance of natural language and Boolean query formulations are presented. The experiments show that on average a current generation natural language system provides better retrieval performance than expert searchers using a Boolean retrieval system when searching full-text legal materials. Methodological issues are reviewed and the effect of database size on query formulation strategy is discussed.