1. Home
  2. Artificial Intelligence
  3. Publications
Cropped 2880x1100 of library with books in background and magnifying glass on top of books in the foreground

Publications 

At Thomson Reuters, we place high value on being active members of the research community. Publishing papers in scientific conferences and workshops helps ensure that our work continues to be aligned with state of the art in our fields.

2021

Multilingual hope speech detection for code-mixed and transliterated texts

Chinnappa, D. (2021). dhivya-hope-detection@LT-EDI-EACL2021: Multilingual hope speech detection for code-mixed and transliterated texts. In Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion, pages 73–78, Kyiv. Association for Computational Linguistics.

In recent years, several systems have been developed to regulate the spread of negativity and eliminate aggressive, offensive or abusive contents from the online platforms. Nevertheless, a limited number of researches carried out to identify positive, encouraging and supportive contents. In this work, our goal is to identify whether a social media post/comment contains hope speech or not.
https://arxiv.org/abs/2103.00464

Extracting possessions from text: Experiments and error analysis

Chinnappa, D. and Blanco, E. (2021). Extracting possessions from text: Experiments and error analysis. Natural Language Engineering, pages 1–22.

This paper presents a corpus and experiments to mine possession relations from text. Specifically, we target alienable and control possessions and assign temporal anchors indicating when a possession relation holds between the possessor and possessee. 
https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/extracting-possessions-from-text-experiments-and-error-analysis/A1CC6321F5944A8C52C3EBB7CDCB56FB

Tamil lyrics corpus: Analysis and experiments

Chinnappa, D. and Dhandapani, P. (2021). Tamil lyrics corpus: Analysis and experiments. In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 1–9, Kyiv. Association for Computational Linguistics.

In this paper, we present a new Tamil lyrics corpus extracted from Tamil movies captured across a range of 65 years (1954 to 2019). We present a detailed corpus analysis showing the nature of Tamil lyrics with respect to lyricists and the year which it was written. We also present similar-ity score across different lyricists based on their song lyrics. We present experimental results based on the SOTA BERT Tamil models to identify the lyricists of a song. Finally, we present future research directions encouraging researchers to pursue Tamil NLP research.
https://www.aclweb.org/anthology/2021.dravidianlangtech-1.1/

Information Extraction & Entailment of Common Law & Civil Code

Hudzina, J., Madan, K., Chinnappa, D., Harmouche, J., Bretz, H., Vold, A., and Schilder, F. (2021). Information Extraction & Entailment of Common Law & Civil Code. In New Frontiers in Artificial Intelligence, pages 162–175. Springer International Publishing.

With the recent advancements in machine learning models, we have seen improvements in Natural Language Inference (NLI) tasks, but legal entailment has been challenging, particularly for supervised approaches.

In this paper, we evaluate different approaches on handling entailment tasks for small domain-specific data sets provided in the Competition on Legal Information Extraction/Entailment (COLIEE). This year COLIEE had four tasks, which focused on legal information processing and finding textual entailment on legal data. We participated in all the four tasks this year, and evaluated different kinds of approaches, including classification, ranking, and transfer learning approaches against the entailment tasks. In some of the tasks, we achieved competitive results when compared to simpler rule-based approaches, which so far have dominated the competition for the last six years.

Towards Explainable AI: Assessing the Usefulness and Impact of Added Explainability Features in Legal Document Summarization

Norkute, M., Herger, N., Michalak, L., Mulder, A., and Gao, S. (2021). Towards Explainable AI: Assessing the Usefulness and Impact of Added Explainability Features in Legal Document Summarization. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, CHI EA ’21, New York, NY, USA. Association for Computing Machinery.

This study tested two different approaches for adding an explainability feature to the implementation of a legal text summarization solution based on a Deep Learning (DL) model. Both approaches aimed to show the reviewers where the summary originated from by highlighting portions of the source text document. The participants had to review summaries generated by the DL model with two different types of text highlights and with no highlights at all.
https://dl.acm.org/doi/10.1145/3411763.3443441

Multi-label legal document classification: A deep learning-based approach with label-attention and domain-specific pre-training

Song, D., Vold, A., Madan, K., and Schilder, F. (2021). Multi-label legal document classification: A deep learning-based approach with label-attention and domain-specific pre-training. Information Systems, page 101718.

Multi-label document classification has a broad range of applicability to various practical problems, such as news article topic tagging, sentiment analysis, medical code classification, etc. A variety of approaches (e.g., tree-based methods, neural networks and deep learning systems that are specifically based on pre-trained language models) have been developed for multi-label document classification problems and have achieved satisfying performance on different datasets. In the legal domain, however, one is often faced with several key challenges when working with multi-label classification tasks. One critical challenge is the lack of high-quality human labeled datasets, which prevents researchers and practitioners from achieving decent performance on respective tasks. Also, existing methods on multi-label classification typically focus on the majority classes, which results in an unsatisfying performance for other important classes that do not have sufficient training samples. In order to tackle the above challenges, in this paper, we first present POSTURE50K, a novel legal extreme multi-label classification dataset, which we will release to the research community. The dataset contains 50,000 legal opinions and their manually labeled legal procedural postures. Labels in this dataset follow a Zipfian distribution, leaving many of the classes with only a few samples. Furthermore, we propose a deep learning architecture that adopts domain-specific pre-training and a label-attention mechanism for multi-label document classification. We evaluate our proposed architecture on POSTURE50K and another legal multi-label dataset EUROLEX57K, and show that our approach achieves better performances than two baseline systems and another four recent state-of-the-art methods on both datasets.
https://www.sciencedirect.com/science/article/abs/pii/S0306437921000016

2020

Quick Check: A Legal Research Recommendation System

Merine Thomas, Thomas Vacek, Xin Shuai, Wenhui Liao, George Sanchez, Paras Sethia, Don Teo, Kanika Madan, and Tonya Custis.  Quick Check: A Legal Research Recommendation System.  Proceedings of the 2020 Natural Legal Language Processing (NLLP) Workshop, 2020.

Finding relevant sources of law that discuss a specific legal issue and support a favorable decision is an onerous and time-consuming task for litigation attorneys. In this paper, we present Quick Check, a system that extracts the legal arguments from a user’s brief and recommends highly relevant case law opinions. Using a combi- nation of full-text search, citation network analysis, clickstream analysis, and a hierarchy of ranking models trained on a set of over 10K annotations, the system is able to effectively recommend cases that are similar in both legal issue and facts. Importantly, the system leverages a detailed legal taxonomy and an extensive body of editorial summaries of case law. We demonstrate how recommended cases from the system are surfaced through a user interface that enables a legal researcher to quickly determine the applicability of a case with respect to a given legal issue. 
http://ceur-ws.org/Vol-2645/short3.pdf

Regularizing Pattern Recognition with Conditional Probability Estimates

Thomas Vacek.  Regularizing Pattern Recognition with Conditional Probability Estimates.  Proceedings of the 2020 International Joint Conference on Neural Networks, 2020.

Recent contributions in non-parametric statistical pattern recognition have investigated augmenting the task with information about the conditional probability distribution P(Y|X) away from the 0.5 level set, i.e. the decision boundary. Many hypothesis spaces satisfy generous smoothness criteria, so the behavior of a function away from the decision boundary can serve as a regularizer for its behavior at the decision boundary. This paper proposes a paradigm to capture observable information about the conditional distribution and describe a learning formulation that can take advantage of it. Finally, it investigates why conditional probability can be an effective regularizer for inseparable pattern recognition problems.
https://ieeexplore.ieee.org/abstract/document/9207004

Beyond Possession Existence: Duration and Co-possession

Dhivya Chinnappa, Srikala Murugan, and Eduardo Blanco.  Beyond Possession Existence: Duration and Co-possession.  Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) International Joint, 2020. 

This paper introduces two tasks: determining (a) the duration of possession relations and (b) co-possessions, i.e., whether multiple possessors possess a possessee at the same time. We present new annotations on top of corpora annotating possession existence and experimental results. Regarding possession duration, we derive the time spans we work with empirically from annotations indicating lower and upper bounds. Regarding co-possessions, we use a binary label. Cohen’s kappa coefficients indicate substantial agreement, and experimental results show that text is more useful than the image for solving these tasks.
https://www.aclweb.org/anthology/2020.acl-main.739/

WikiPossessions: Possession timeline generation as an evaluation benchmark for machine reading comprehension of long texts

Alexis Palmer Dhivya Chinnappa and Eduardo Blanco.  WikiPossessions: Possession timeline generation as an evaluation benchmark for machine reading comprehension of long texts. .  Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), 2020.

This paper presents WikiPossessions, a new benchmark corpus for the task of temporally-oriented possession (TOP), or tracking objects as they change hands over time. We annotate Wikipedia articles for 90 different well-known artifacts paintings, diamonds, and archaeological artifacts), producing 799 artifact-possessor relations with associated attributes. For each article, we also produce a full possession timeline. The full version of the task combines straightforward entity-relation extraction with complex temporal reasoning, as well as verification of textual support for the relevant types of knowledge. Specifically, to complete the full TOP task for a given article, a system must do the following: a) identify possessors; b) anchor possessors to times/events; c) identify temporal relations between each temporal anchor and the possession relation it corresponds to; d) assign certainty scores to each possessor and each temporal relation; and e) assemble individual possession events into a global possession timeline. In addition to the corpus, we release evaluation scripts and a baseline model for the task.
https://www.aclweb.org/anthology/2020.lrec-1.140/

A smart system to generate and validate question answer pairs for COVID-19 literature

Bhambhoria, R., Feng, L., Sepehr, D., Chen, J., Cowling, C., Kocak, S., and Dolatabadi, E. (2020). A smart system to generate and validate question answer pairs for COVID-19 literature. In Proceedings of the First Workshop on Scholarly Document Processing, pages 20–30, Online. Association for Computational Linguistics.

Automatically generating question answer (QA) pairs from the rapidly growing coronavirus-related literature is of great value to the medical community. Creating high quality QA pairs would allow researchers to build models to address scientific queries for answers which are not readily available in support of the ongoing fight against the pandemic. QA pair generation is, however, a very tedious and time consuming task requiring domain expertise for annotation and evaluation. In this paper we present our contribution in addressing some of the challenges of building a QA system without gold data. We first present a method to create QA pairs from a large semi-structured dataset through the use of transformer and rule-based models. Next, we propose a means of engaging subject matter experts (SMEs) for annotating the QA pairs through the usage of a web application. Finally, we demonstrate some experiments showcasing the effectiveness of leveraging active learning in designing a high performing model with a substantially lower annotation effort from the domain experts.
https://www.aclweb.org/anthology/2020.sdp-1.4/

Determining event outcomes: The case of #fail

Murugan, S., Chinnappa, D., and Blanco, E. (2020). Determining event outcomes: The case of #fail. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4021–4033, Online. Association for Computational Linguistics.

This paper targets the task of determining event outcomes in social media. We work with tweets containing either #cookingFail or #bakingFail, and show that many of the events described in them resulted in something edible. Tweets that contain images are more likely to result in edible albeit imperfect outcomes. Experimental results show that edibility is easier to predict than outcome quality.
https://www.aclweb.org/anthology/2020.findings-emnlp.359/

Customizing contextualized language models for legal document reviews

Shaghaghian, S., Feng, L. Y., Jafarpour, B., and Pogrebnyakov, N. (2020). Customizing contextualized language models for legal document reviews. In 2020 IEEE International Con- ference on Big Data (Big Data), pages 2139–2148. IEEE.

Inspired by the inductive transfer learning on computer vision, many efforts have been made to train contextualized language models that boost the performance of natural language processing tasks. These models are mostly trained on large general-domain corpora such as news, books, or Wikipedia.Although these pre-trained generic language models well perceive the semantic and syntactic essence of a language structure, exploiting them in a real-world domain-specific scenario still needs some practical considerations to be taken into account such as token distribution shifts, inference time, memory, and their simultaneous proficiency in multiple tasks. In this paper, we focus on the legal domain and present how different language model strained on general-domain corpora can be best customized for multiple legal document reviewing tasks. We compare their efficiencies with respect to task performances and present practical considerations. 
https://arxiv.org/abs/2102.05757

2019

Statutory entailment using similarity features and decomposable attention models

John Hudzina, Thomas Vacek, Kanika Madan, Tonya Custis, and Frank Schilder.   Statutory entailment using similarity features and decomposable attention models.  Proceedings of Competition on Legal Information Extraction/Entailment (COLIEE), COLIEE-2019 Workshop on June, 21st 2019 in International Conference on Artificial Intelligence and Law (ICAIL), 2019.

Textual entailment using word embeddings and linguistic similarity

Kanika Madan, John Hudzina, Thomas Vacek, Frank Schilder, and Tonya Custis.   Textual entailment using word embeddings and linguistic similarity.  Proceedings of Competition on Legal Information Extraction/Entailment (COLIEE), COLIEE-2019 Workshop on June, 21st 2019 in International Conference on Artificial Intelligence and Law (ICAIL), 2019.

Exploiting Search Logs to Aid in Training and Automating Infrastructure for Question Answering in Professional Domains

Filippo Pompili, Jack G. Conrad, and Carter Kolbeck.   Exploiting Search Logs to Aid in Training and Automating Infrastructure for Question Answering in Professional Domains.  Proceedings of the 17th International Conference on Artificial Intelligence and Law (ICAIL), 2019.

Litigation Analytics: Case Outcomes Extracted from US Federal Court Dockets

Thomas Vacek, Ronald Teo, Dezhao Song, Timothy Nugent, Conner Cowling, and Frank Schilder.   Litigation Analytics: Case Outcomes Extracted from US Federal Court Dockets.  Proceedings of the Natural Legal Language Processing Workshop 2019, 45--54, 2019.

Dockets contain a wealth of information for planning a litigation strategy, but the information is locked up in semi-structured text. Manually deriving the outcomes for each party (e.g., settlement, verdict) would be very labor intensive. Having such information available for every past court case, however, would be very useful for developing a strategy because it potentially reveals tendencies and trends of judges and courts and the opposing counsel. We used Natural Language Processing (NLP) techniques and deep learning methods allowing us to scale the automatic analysis of millions of US federal court dockets. The automatically extracted information is fed into a Litigation Analytics tool that is used by lawyers to plan how they approach concrete litigations.

Litigation Analytics: Extracting and querying motions and orders from US federal courts

Thomas Vacek, Dezhao Song, Hugo Molina-Salgado, Ronald Teo, Conner Cowling, and Frank Schilder.   Litigation Analytics: Extracting and querying motions and orders from US federal courts.  Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 116--121, 2019.

Legal litigation planning can benefit from statistics collected from past decisions made by judges. Information on the typical duration for a submitted motion, for example, can give valuable clues for developing a successful strategy. Such information is encoded in semi-structured documents called dockets. In order to extract and aggregate this information, we deployed various information extraction and machine learning techniques. The aggregated data can be queried in real time within the Westlaw Edge search engine. In addition to a keyword search for judges, lawyers, law firms, parties and courts, we also implemented a question answering interface that offers targeted questions in order to get to the respective answers quicker.

Sentence Boundary Detection in Legal Text

George Sanchez.   Sentence Boundary Detection in Legal Text.  Proceedings of the Natural Legal Language Processing Workshop 2019, 31--38, 2019.
https://www.aclweb.org/anthology/W19-2204

In this paper, we examined several algorithms to detect sentence boundaries in legal text. Legal text presents challenges for sentence tokenizers because of the variety of punctuations and syntax of legal text. Out-of-the-box algorithms perform poorly on legal text affecting further analysis of the text. A novel and domain-specific approach is needed to detect sentence boundaries to further analyze legal text. We present the results of our investigation in this paper.

Litigation Analytics: Case outcomes extracted from US federal court dockets

Thomas Vacek, Ronald Teo, Dezhao Song, Timothy Nugent, Conner Cowling, and Frank Schilder.   Litigation Analytics: Case outcomes extracted from US federal court dockets.  Proceedings of the first Workshop on Natural Legal Language Processing (NLLP), 2019.

Westlaw Edge AI Features Demo: KeyCite Overruling Risk, Litigation Analytics, and WestSearch Plus

Tonya Custis, Frank Schilder, Thomas Vacek, Gayle McElvain, and Hector Martinez Alonso.   Westlaw Edge AI Features Demo: KeyCite Overruling Risk, Litigation Analytics, and WestSearch Plus.  Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law, ICAIL '19, 256--257, 2019.
http://doi.acm.org/10.1145/3322640.3326739

WestSearch Plus: A Non-factoid Question-Answering System for the Legal Domain

Gayle McElvain, George Sanchez, Sean Matthews, Don Teo, Filippo Pompili, and Tonya Custis.   WestSearch Plus: A Non-factoid Question-Answering System for the Legal Domain.  Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'19, 1361--1364, 2019.
http://doi.acm.org/10.1145/3331184.3331397

2018

A Comparison of Two Paraphrase Models for Taxonomy Augmentation

Vassilis Plachouras, Fabio Petroni, Timothy Nugent, and Jochen L. Leidner.   A Comparison of Two Paraphrase Models for Taxonomy Augmentation.  Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 315--320, 2018.
https://www.aclweb.org/anthology/N18-2051

Taxonomies are often used to look up the concepts they contain in text documents (for instance, to classify a document). The more comprehensive the taxonomy, the higher recall the application has that uses the taxonomy. In this paper, we explore automatic taxonomy augmentation with paraphrases. We compare two state-of-the-art paraphrase models based on Moses, a statistical Machine Translation system, and a sequence-to-sequence neural network, trained on a paraphrase datasets with respect to their abilities to add novel nodes to an existing taxonomy from the risk domain. We conduct component-based and task-based evaluations. Our results show that paraphrasing is a viable method to enrich a taxonomy with more terms, and that Moses consistently outperforms the sequence-to-sequence neural...

attr2vec: Jointly Learning Word and Contextual Attribute Embeddings with Factorization Machines

Fabio Petroni, Vassilis Plachouras, Timothy Nugent, and Jochen L. Leidner.   attr2vec: Jointly Learning Word and Contextual Attribute Embeddings with Factorization Machines.  Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 453--462, 2018.
https://www.aclweb.org/anthology/N18-1042

The widespread use of word embeddings is associated with the recent successes of many natural language processing (NLP) systems. The key approach of popular models such as word2vec and GloVe is to learn dense vector representations from the context of words. More recently, other approaches have been proposed that incorporate different types of contextual information, including topics, dependency relations, n-grams, and sentiment. However, these models typically integrate only limited additional contextual information, and often in ad hoc ways. In this work, we introduce attr2vec, a novel framework for jointly learning embeddings for words and contextual attributes based on factorization machines. We perform experiments with different types of contextual information. Our experimental...

TipMaster: A Knowledge Base of Authoritative Local News Sources on Social Media

Xin Shuai, Xiaomo Liu, Nourbakhsh Armineh, Sameena Shah, and Tonya Custis.   TipMaster: A Knowledge Base of Authoritative Local News Sources on Social Media.  13th Conference on Innovative Applications of Artificial Intelligence, IAAI-2018, 2018.

Introduction to the special issue on legal text analytics

Jack G. Conrad and Luther Karl Branting  Introduction to the special issue on legal text analytics.  Artif. Intell. Law, 26, 99--102, 2018.
https://doi.org/10.1007/s10506-018-9227-z

The E2E NLG Challenge: A Tale of Two Systems

Charese Smiley, Elnaz Davoodi, Dezhao Song, and Frank Schilder.   The E2E NLG Challenge: A Tale of Two Systems.  Proceedings of the 11th International Conference on Natural Language Generation, 472--477, 2018.

An Extensible Event Extraction System With Cross-Media Event Resolution

Fabio Petroni, Natraj Raman, Tim Nugent, Armineh Nourbakhsh, Žarko Panić, Sameena Shah, and Jochen L. Leidner.   An Extensible Event Extraction System With Cross-Media Event Resolution.  Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '18, 626--635, 2018.
http://doi.acm.org/10.1145/3219819.3219827

2017

Scenario analytics: analyzing jury verdicts to evaluate legal case outcomes

Jack G. Conrad and Khalid Al-Kofahi.   Scenario analytics: analyzing jury verdicts to evaluate legal case outcomes.  Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law, ICAIL 2017, London, United Kingdom, June 12-16, 2017, 29--37, 2017.
https://doi.org/10.1145/3086512.3086516

Say the right thing right: Ethics issues in natural language generation systems

Charese Smiley, Frank Schilder, Vassilis Plachouras, and Jochen L Leidner.   Say the right thing right: Ethics issues in natural language generation systems.  Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, 103--108, 2017.

Building and querying an enterprise knowledge graph

Dezhao Song, Frank Schilder, Shai Hertz, Giuseppe Saltini, Charese Smiley, Phani Nivarthi, Oren Hazai, Dudi Landau, Mike Zaharkin, Tom Zielund, et al.  Building and querying an enterprise knowledge graph.  IEEE Transactions on Services Computing, 2017.

A sequence approach to case outcome detection

Tom Vacek and Frank Schilder.   A sequence approach to case outcome detection.  Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law, 209--215, 2017.

A Multidimensional Investigation of the Effects of Publication Retraction on Scholarly Impact

Xin Shuai, Jason Rollins, Isabelle Moulinier, Tonya Custis, Mathilda Edmunds, and Frank Schilder  A Multidimensional Investigation of the Effects of Publication Retraction on Scholarly Impact.  Journal of the Association for Information Science & Technology, 68, 2225-2236, 2017.

Hashtag Mining: Discovering Relationship Between Health Concepts and Hashtags

Quanzhi Li, Sameena Shah, Rui Fang, Armineh Nourbakhsh, and Xiaomo Liu (2017).  In Public Health Intelligence and the Internet, Hashtag Mining: Discovering Relationship Between Health Concepts and Hashtags.  (pp. 75--85). Springer.

Reuters Tracer: Toward Automated News Production Using Large Scale Social Media Data

Xiaomo Liu, Armineh Nourbakhsh, Quanzhi Li, Sameena Shah, Robert Martin, and John Duprey.   Reuters Tracer: Toward Automated News Production Using Large Scale Social Media Data.  2017 IEEE International Conference on Big Data, 2017.

Mapping the echo-chamber: detecting and characterizing partisan networks on Twitter

Armineh Nourbakhsh, Xiaomo Liu, Quanzhi Li, and Sameena Shah.   Mapping the echo-chamber: detecting and characterizing partisan networks on Twitter.  International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS), 2017.

" Breaking" Disasters: Predicting and Characterizing the Global News Value of Natural and Man-made Disasters

Armineh Nourbakhsh, Quanzhi Li, Xiaomo Liu, and Sameena Shah.   " Breaking" Disasters: Predicting and Characterizing the Global News Value of Natural and Man-made Disasters.  KDD Workshop on Data Science + Journalism, 2017.

funSentiment at SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs Using Word Vectors Built from StockTwits and Twitter

Quanzhi Li, Sameena Shah, Armineh Nourbakhsh, Rui Fang, and Xiaomo Liu.   funSentiment at SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs Using Word Vectors Built from StockTwits and Twitter.  Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017, Vancouver, Canada, August 3-4, 2017, 852--856, 2017.

funSentiment at SemEval-2017 Task 4: Topic-Based Message Sentiment Classification by Exploiting Word Embeddings, Text Features and Target Contexts

Quanzhi Li, Armineh Nourbakhsh, Xiaomo Liu, Rui Fang, and Sameena Shah.   funSentiment at SemEval-2017 Task 4: Topic-Based Message Sentiment Classification by Exploiting Word Embeddings, Text Features and Target Contexts.  Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017, Vancouver, Canada, August 3-4, 2017, 741--746, 2017.

Data Sets: Word Embeddings Learned from Tweets and General Data

Quanzhi Li, Sameena Shah, Xiaomo Liu, and Armineh Nourbakhsh.   Data Sets: Word Embeddings Learned from Tweets and General Data.  The 11th International Conference on Weblogs and Social Media (ICWSM), 2017.

Real-time novel event detection from social media

Quanzhi Li, Armineh Nourbakhsh, Sameena Shah, and Xiaomo Liu.   Real-time novel event detection from social media.  2017 IEEE 33rd International Conference on Data Engineering (ICDE), 1129--1139, 2017.

2016

Fifteenth International Conference on Artificial Intelligence and Law (ICAIL 2015)

Katie Atkinson, Jack G. Conrad, Anne Gardner, and Ted Sichelman  Fifteenth International Conference on Artificial Intelligence and Law (ICAIL 2015).  AI Magazine, 37, 107--108, 2016.
http://www.aaai.org/ojs/index.php/aimagazine/article/view/2633

Semi-Supervised Events Clustering in News Retrieval

Jack G. Conrad and Michael Bender.   Semi-Supervised Events Clustering in News Retrieval.  Proceedings of the First International Workshop on Recent Trends in News Information Retrieval co-located with 38th European Conference on Information Retrieval (ECIR 2016), Padua, Italy, March 20, 2016., 21--26, 2016.
http://ceur-ws.org/Vol-1568/paper4.pdf

When to Plummet and When to Soar: Corpus Based Verb Selection for Natural Language Generation

Charese Smiley, Vassilis Plachouras, Frank Schilder, Hiroko Bretz, Jochen Leidner, and Dezhao Song.   When to Plummet and When to Soar: Corpus Based Verb Selection for Natural Language Generation.  Proceedings of the 9th International Natural Language Generation conference, 36--39, 2016.

Interacting with financial data using natural language

Vassilis Plachouras, Charese Smiley, Hiroko Bretz, Ola Taylor, Jochen L Leidner, Dezhao Song, and Frank Schilder.   Interacting with financial data using natural language.  Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 1121--1124, 2016.

Witness identification in twitter

Rui Fang, Armineh Nourbakhsh, Xiaomo Liu, Sameena Shah, and Quanzhi Li.   Witness identification in twitter.  Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, 65--73, 2016.

Reuters tracer: A large scale system of detecting & verifying real-time news events from twitter

Xiaomo Liu, Quanzhi Li, Armineh Nourbakhsh, Rui Fang, Merine Thomas, Kajsa Anderson, Russ Kociuba, Mark Vedder, Steven Pomerville, Ramdev Wudali, et al..   Reuters tracer: A large scale system of detecting & verifying real-time news events from twitter.  Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 207--216, 2016.

Hashtag recommendation based on topic enhanced embedding, tweet entity data and learning to rank

Quanzhi Li, Sameena Shah, Armineh Nourbakhsh, Xiaomo Liu, and Rui Fang.   Hashtag recommendation based on topic enhanced embedding, tweet entity data and learning to rank.  Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2085--2088, 2016.

Tweetsift: Tweet topic classification based on entity knowledge base and topic enhanced word embedding

Quanzhi Li, Sameena Shah, Xiaomo Liu, Armineh Nourbakhsh, and Rui Fang.   Tweetsift: Tweet topic classification based on entity knowledge base and topic enhanced word embedding.  Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2429--2432, 2016.

Tweet topic classification using distributed language representations

Quanzhi Li, Sameena Shah, Xiaomo Liu, Armineh Nourbakhsh, and Rui Fang.   Tweet topic classification using distributed language representations.  2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), 81--88, 2016.

Tweet Sentiment Analysis by Incorporating Sentiment-Specific Word Embedding and Weighted Text Features

Quanzhi Li, Sameena Shah, Rui Fang, Armineh Nourbakhsh, and Xiaomo Liu.   Tweet Sentiment Analysis by Incorporating Sentiment-Specific Word Embedding and Weighted Text Features.  2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), 568--571, 2016.

Sentiment Analysis of Political Figures across News and Social Media

Quanzhi Li, Armineh Nourbakhsh, Rui Fang, Xiaomo Liu, and Sameena Shah.   Sentiment Analysis of Political Figures across News and Social Media.  2016 International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS), 2016.

How Much Data Do You Need? Twitter Decahose Data Analysis

Quanzhi Li, Sameena Shah, Merine Thomas, Kajsa Anderson, Xiaomo Liu, Armineh Nourbakhsh, and Rui Fang.   How Much Data Do You Need? Twitter Decahose Data Analysis.  2016 International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS), 2016.

Discovering Relevant Hashtags for Health Concepts: A Case Study of Twitter

Quanzhi Li, Sameena Shah, Rui Fang, Armineh Nourbakhsh, and Xiaomo Liu.   Discovering Relevant Hashtags for Health Concepts: A Case Study of Twitter.  AAAI Workshop: WWW and Population Health Intelligence, 2016.

User Behaviors in Newsworthy Rumors: A Case Study of Twitter

Quanzhi Li, Xiaomo Liu, Rui Fang, Armineh Nourbakhsh, and Sameena Shah.   User Behaviors in Newsworthy Rumors: A Case Study of Twitter.  The 10th International Conference on Weblogs and Social Media (ICWSM), 627--630, 2016.

Georeferencing

Jochen L. Leidner (2016).  In Wiley International Encyclopedia of Geography, Georeferencing.  Oxford, England, UK: Wiley-Blackwell.

Newton: Building an authority-driven company tagging and resolution system

Merine Thomas, Hiroko Bretz, Thomas Vacek, Benjamin Hachey, Sudhanshu Singh, and Frank Schilder (2016).  In Working With Text: Tools, Techniques and Approaches for Text Mining, Tonkin, Emma and Taylor, Stephanie (Eds.), Newton: Building an authority-driven company tagging and resolution system.  (pp. 159--187). Chandos Publishing.
https://www.sciencedirect.com/science/article/pii/B9781843347...

2015

Quantifying socio-economic indicators in developing countries from mobile phone communication data: applications to Côte d'Ivoire

Huina Mao, Xin Shuai, Yong-Yeol Ahn, and Johan Bollen  Quantifying socio-economic indicators in developing countries from mobile phone communication data: applications to Côte d'Ivoire.  EPJ Data Science, 4, 2015.
http://dx.doi.org/10.1140/epjds/s13688-015-0053-1

Natural Language Question Answering and Analytics for Diverse and Interlinked Datasets

Dezhao Song, Frank Schilder, Charese Smiley, and Chris Brew.   Natural Language Question Answering and Analytics for Diverse and Interlinked Datasets.  Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 101--105, 2015.
http://www.aclweb.org/anthology/N15-3021

Newsworthy rumor events: A case study of twitter

Armineh Nourbakhsh, Xiaomo Liu, Sameena Shah, Rui Fang, Mohammad Mahdi Ghassemi, and Quanzhi Li.   Newsworthy rumor events: A case study of twitter.  2015 IEEE International Conference on Data Mining Workshop (ICDMW), 27--32, 2015.

Real-time rumor debunking on twitter

Xiaomo Liu, Armineh Nourbakhsh, Quanzhi Li, Rui Fang, and Sameena Shah.   Real-time rumor debunking on twitter.  Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM), 1867--1870, 2015.

The Role of Evaluation in AI and Law: An Examination of Its Different Forms in the AI and Law Journal

Jack G. Conrad and John Zeleznikow.   The Role of Evaluation in AI and Law: An Examination of Its Different Forms in the AI and Law Journal.  Proceedings of the 15th International Conference on Artificial Intelligence and Law, ICAIL '15, 181--186, 2015.
http://doi.acm.org/10.1145/2746090.2746116

Real-time Rumor Debunking on Twitter

Xiaomo Liu, Armineh Nourbakhsh, Quanzhi Li, Rui Fang, and Sameena Shah.   Real-time Rumor Debunking on Twitter.  Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM '15, 1867--1870, 2015.

Newsworthy Rumor events: A Case Study of Twitter

Armineh Nourbakhsh, Xiaomo Liu, Sameena Shah, Rui Fang, Mohammad Ghassemi, and Quanzhi Li.   Newsworthy Rumor events: A Case Study of Twitter.  Proceedings of the ICDM workshop on Event Analytics using social media data, 2015.

Information Extraction of Regulatory Enforcement Action: From Anti-Money Laundering Compliance to Countering Terrorism Finance

Vassilis Plachouras and Jochen L. Leidner.   Information Extraction of Regulatory Enforcement Action: From Anti-Money Laundering Compliance to Countering Terrorism Finance.  International Symposium on Open Source Intelligence and Security Informatics, FOSINT-SI, 2015.

Multimodal Entity Coreference for Cervical Dysplasia Diagnosis

Dezhao Song, Edward Kim, Xiaolei Huang, Joseph Patruno, Héctor Muñoz-Avila, Jeff Heflin, L. Rodney Long, and Sameer Antani  Multimodal Entity Coreference for Cervical Dysplasia Diagnosis.  IEEE Transactions on Medical Imaging (IEEE TMI), 34, 229--245, 2015.

TR Discover: A Natural Language Interface for Querying and Analyzing Interlinked Datasets

Dezhao Song, Frank Schilder, Charese Smiley, Chris Brew, Tom Zielund, Hiroko Bretz, Robert Martin, Chris Dale, John Duprey, Tim Miller, and Johanna Harrison (2015).  In The Semantic Web - ISWC 2015, TR Discover: A Natural Language Interface for Querying and Analyzing Interlinked Datasets.  (pp. 21-37). Springer International Publishing.
http://dx.doi.org/10.1007/978-3-319-25010-6_2

Currently, the dominant technology for providing non-technical users with access to Linked Data is keyword-based search. This is problematic because keywords are often inadequate as a means for expressing user intent. In addition, while a structured query language can provide convenient access to the information needed by advanced analytics, unstructured keyword-based search cannot meet this extremely common need. This makes it harder than necessary for non-technical users to generate analytics. We address these difficulties by developing a natural language-based system that allows non-technical users to create well-formed questions. Our system, called TR Discover, maps from a fragment of English into an intermediate First Order Logic representation, which is in turn mapped into SPARQL...

2014

Text Analytics at Thomson Reuters

Jochen L. Leidner  Text Analytics at Thomson Reuters.  Invited Talk, London Text Analytics Meetup, London, England, 2014-10-16, 2014.
http://www.meetup.com/textanalytics/events/207765012/

Thomson Reuters is an information company that develops and sells information products to professionals in verticals such as Finance, Risk/Compliance, News, Law, Tax, Accounting, Intellectual Property, and Science. In this talk, I will describe how making money from information differs from making money from advertising, and the role of state-of-the-art text analytics techniques in the process will be described using some case studies. In addition, I will compare and contrast our industry research work with academic research.

Research and Development in Information Access at Thomson Reuters Corporate R&D

Jochen L. Leidner  Research and Development in Information Access at Thomson Reuters Corporate R&D.  Invited Talk, Language and Computation Day (LAC), University of Essex, Colchester, England, 2014-10-06, 2014.
http://lac.essex.ac.uk/language-and-computation-day-2014

Thomson Reuters is a modern information company. In this talk, I characterise the nature of carrying out research, development and innovation activities as part of its Corporate R&D group that add value to end customers and translate into additional revenue. A couple of R&D projects in the are of natural language processing, information retrieval and applied machine learning will be described, covering the legal, scientific, financial and news areas. The talk will conclude with a cautious outlook of what the near future may hold. Additionally, I will attempt a comparison of doing research in a company with pursuing academic research at a university.

A practical SIM learning formulation with margin capacity control

Thomas Vacek.   A practical SIM learning formulation with margin capacity control.  Proceedings of 2014 International Joint Conference on Neural Networks (IJCNN), 4160-4167, 2014.
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6889963

Given a finite i.i.d. dataset of the form (yi, Xi), the Single Index Model (SIM) learning problem is to estimate a regression of the form u o f(xi) where u is some Lipschitz-continuous nondecreasing function and / is a linear function. This paper applies Vapnik's Structural Risk Minimization principle to SIM learning. I show that a risk structure for the space of model functions/gives a risk structure for the space of functions u o f. Second, I provide a practical learning formulation for SIM using a risk structure defined by margin-based capacity control. The new learning formulation is compared with support vector regression.

Winning by Following the Winners: Mining the Behaviour of Stock Market Experts in Social Media

Wenhui Liao, Sameena Shah, and Masoud Makrehchi.   Winning by Following the Winners: Mining the Behaviour of Stock Market Experts in Social Media.  Proceedings of the International Social Computing, Behavioral-Cultural Modeling and Prediction Conference (SBP 2014), 2014.

A novel yet simple method is proposed to exercise in stock market by following successful stock market expert in social media. The problem of "how and where to invest" is translated into "who to follow in my investment". In other words, looking for stock market investment strategy is converted into stock market expert search. Fortunately, many stock market experts are active in social media and openly express their opinion about market. By analyzing their behaviour and mining their opinions and suggested actions in Twitter, and virtually exercise based on their suggestions, we are able to score each expert based on his/her performance. Using this scoring system, experts with most successful trading are recommended. The main objective in this research is to identify traders that...

Social Informatics: Revised Selected Papers from SocInfo 2013 International Workshops, QMC and HISTOINFORMATICS, Kyoto, Japan, November 25, 2013

http://www.springer.com/computer/database+management+%26+info...

This book constitutes the refereed post-proceedings of two workshops held at the 5th International Conference on Social Informatics, SocInfo 2013, in Kyoto, Japan, in November 2013: the First Workshop on Quality, Motivation and Coordination of Open Collaboration, QMC 2013, and the First International Workshop on Histoinformatics, HISTOINFORMATICS 2013. The 11 revised papers presented at the workshops were carefully reviewed and selected from numerous submissions. They cover specific areas of social informatics. The QMC 2013 workshop attracted papers on new algorithms and methods to improve the quality or to increase the motivation of open collaboration, to reduce the cost of financial motivation or to decrease the time needed to finish collaborative tasks. The papers presented at...

Exploring Linked Data with contextual tag clouds

Xingjian Zhang, Dezhao Song, Sambhawa Priya, Zachary Daniels, Kelly Reynolds, and Jeff Heflin  Exploring Linked Data with contextual tag clouds.  Web Semantics: Science, Services and Agents on the World Wide Web, 24, 33 - 39, 2014.
http://www.sciencedirect.com/science/article/pii/S1570826814000055

Abstract In this paper we present the contextual tag cloud system: a novel application that helps users explore a large scale \{RDF\} dataset. Unlike folksonomy tags used in most traditional tag clouds, the tags in our system are ontological terms (classes and properties), and a user can construct a context with a set of tags that defines a subset of instances. Then in the contextual tag cloud, the font size of each tag depends on the number of instances that are associated with that tag and all tags in the context. Each contextual tag cloud serves as a summary of the distribution of relevant data, and by changing the context, the user can quickly gain an understanding of patterns in the data. Furthermore, the user can choose to include \{RDFS\} taxonomic and/or domain/range entailment...

2013

A Statistical NLG Framework for Aggregated Planning and Realization

Ravi Kondadadi, Blake Howald, and Frank Schilder.   A Statistical NLG Framework for Aggregated Planning and Realization.  Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1406--1415, 2013.
http://www.aclweb.org/anthology/P13-1138

We present a hybrid natural language generation (NLG) system that consolidates macro and micro planning and surface realization tasks into one statistical learning process. Our novel approach is based on deriving a template bank automatically from a corpus of texts from a target domain. First, we identify domain specific entity tags and Discourse Representation Structures on a per sentence basis. Each sentence is then organized into semantically similar groups (representing a domain specific concept) by k-means clustering. After this semi-automatic processing (human review of cluster assignments), a number of corpus-level statistics are compiled and used as features by a ranking SVM to develop model weights from a training corpus. At generation time, a set of input data, the collection...

GenNext: A Consolidated Domain Adaptable NLG System

Frank Schilder, Blake Howald, and Ravi Kondadadi.   GenNext: A Consolidated Domain Adaptable NLG System.  Proceedings of the 14th European Workshop on Natural Language Generation, 178--182, 2013.
http://www.aclweb.org/anthology/W13-2124

We introduce GenNext, an NLG system designed specifically to adapt quickly and easily to different domains. Given a domain corpus of historical texts, GenNext allows the user to generate a template bank organized by semantic concept via derived discourse representation structures in conjunction with general and domain-specific entity tags. Based on various features collected from the training corpus, the system statistically learns template representations and document structure and produces well-formed texts (as evaluated by crowdsourced and expert evaluations). In addition to domain adaptation, the GenNext hybrid approach significantly reduces complexity as compared to traditional NLG systems by relying on templates (consolidating micro-planning and surface realization) and...

Domain Adaptable Semantic Clustering in Statistical NLG

Blake Howald, Ravikumar Kondadadi, and Frank Schilder.   Domain Adaptable Semantic Clustering in Statistical NLG.  Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) -- Long Papers, 143--154, 2013.
http://www.aclweb.org/anthology/W13-0113

We present a hybrid natural language generation system that utilizes Discourse Representation Structures (DRSs) for statistically learning syntactic templates from a given domain of discourse in sentence micro planning. In particular, given a training corpus of target texts, we extract semantic predicates and domain general tags from each sentence and then organize the sentences using supervised clustering to represent the conceptual meaning of the corpus. The sentences, additionally tagged with domain specific information (determined separately), are reduced to templates. We use a SVM ranking model trained on a subset of the corpus to determine the optimal template during generation. The combination of the conceptual unit, a set of ranked syntactic templates, and a given set of...

Next Generation Legal Search - It's Already Here

Qiang Lu and Jack G. Conrad  Next Generation Legal Search - It's Already Here.  Vox Populii blog, Legal Information Institute (LII), Cornell University, 2013.
http://blog.law.cornell.edu/voxpop/2013/03/28/next-generation

Editor's Note: We are pleased to publish this piece from Qiang Lu and Jack Conrad, both of whom worked with Thomson Reuters R&D on the WestlawNext research team. Jack Conrad continues to work with Thomson Reuters, though currently on loan to the Catalyst Lab at Thomson Reuters Global Resources in Switzerland. Qiang Lu is now based at Kore Federal in the Washington, D.C. area. We read with interest their 2012 paper from the International Conference on Knowledge Engineering and Ontology Development (KEOD), ``Bringing order to legal documents: An issue-based recommendation system via cluster association'', and are grateful that they have agreed to offer some system-specific context for their work in this area. Their current contribution represents a practical description of the advances...

Evaluating Entity Linking with Wikipedia

Ben Hachey, Will Radford, Joel Nothman, Matthew Honnibal, and James R. Curran  Evaluating Entity Linking with Wikipedia.  Artificial Intelligence, 194, 130-150, 2013.
http://www.sciencedirect.com/science/article/pii/S0004370212000446

Named Entity Linking (nel) grounds entity mentions to their corresponding node in a Knowledge Base (kb). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate entities and then disambiguate them, returning either the best candidate or nil. However, comparison has focused on disambiguation accuracy, making it difficult to determine how search impacts performance. Furthermore, important approaches from the literature have not been systematically compared on standard data sets. We reimplement three seminal nel systems and present a detailed evaluation of search strategies. Our experiments find that coreference and acronym handling lead to substantial improvement, and search strategies account...

The Significance of Evaluation in AI and Law: A Case Study Re-examining ICAIL Proceedings

Jack G. Conrad and John Zeleznikow.   The Significance of Evaluation in AI and Law: A Case Study Re-examining ICAIL Proceedings.  Proceedings of the 14th International Conference on Artificial Intelligence and Law (ICAIL), 186-191, 2013.

This paper examines the presence of performance evaluation in works published at ICAIL conferences since 2000. As such, it is a self-reflexive, meta-level study that investigates the proportion of works that include some form of performance assessment in their contribution. It also reports on the categories of evaluation present as well as their degree. In addition the paper compares current trends in performance measurement with those of earlier ICAILs, as reported in the Hall and Zeleznikow work on the same topic (ICAIL 2001). The paper also develops an argument for why evaluation in formal Artificial Intelligence and Law reports such as ICAIL proceedings is imperative. It underscores the importance of answering the question: how good is the system?, how reliable is the approach?,...

Ants find the shortest path: A mathematical Proof

Jayadeva, Sameena Shah, A. Bhaya, R. Kothari, and S. Chandra  Ants find the shortest path: A mathematical Proof.  Swarm Intelligence, 7, 43-62, 2013.

In the most basic application of Ant Colony Optimization (ACO), a set of artificial ants find the shortest path between a source and a destination. Ants deposit pheromone on paths they take, preferring paths that have more pheromone on them. Since shorter paths are traversed faster, more pheromone accumulates on them in a given time, attracting more ants and leading to reinforcement of the pheromone trail on shorter paths. This is a positive feedback process that can also cause trails to persist on longer paths, even when a shorter path becomes available. To counteract this persistence on a longer path, ACO algorithms employ remedial measures, such as using negative feedback in the form of uniform evaporation on all paths. Obtaining high performance in ACO algorithms typically requires...

Making Structured Data Searchable via Natural Language Generation with an Application to ESG Data

Jochen L. Leidner and Darya Kamkova.   Making Structured Data Searchable via Natural Language Generation with an Application to ESG Data.  Proceedings of the 10th International Conference Flexible Query Answering Systems (FQAS 2013), Granada, Spain, September 18-20, 2013, Lecture Notes in Computer Science, 8132, 495--506, 2013.

Relational Databases are used to store structured data, which is typically accessed using report builders based on SQL queries. To search, forms need to be understood and filled out, which demands a high cognitive load. Due to the success of Web search engines, users have become acquainted with the easier mechanism of natural language search for accessing unstructured data. However, such keyword-based search methods are not easily applicable to structured data, especially where structured records contain non-textual content such as numbers. We present a method to make structured data, including numeric data, searchable with a Web search engine-like keyword search access mechanism. Our method is based on the creation of surrogate text documents using Natural Language Generation (NLG)...

Stock Prediction Using Event-based Sentiment Analysis

M Makrehchi, Sameena Shah, and W. Liao.   Stock Prediction Using Event-based Sentiment Analysis.  Proceedings of IEEE/ACM International Conference on Web Intelligence, 2013.

We propose a novel approach to label social media text using significant stock market events (big losses or gains). Since stock events are easily quantifiable using returns from indices or individual stocks, they provide meaningful and automated labels. We extract significant stock movements and collect appropriate pre, post and contemporaneous text from social media sources (for example, tweets from twitter). Subsequently, we assign the respective label (positive or negative) for each tweet. We train a model on this collected set and make predictions for labels of future tweets. We aggregate the net sentiment per each day (amongst other metrics) and show that it holds significant predictive power for subsequent stock market movement. We create successful trading strategies based on...

Benchmarks for Enterprise Linking: Thomson Reuters R&D at TAC 2013

Thomas Vacek, Hiroko Bretz, Frank Schilder, and Ben Hachey.   Benchmarks for Enterprise Linking: Thomson Reuters R&D at TAC 2013.  proceeding of Text Analysis Conference (TAC), 2013.

2012

Event Linking: Grounding Event Reference in a News Archive

Joel Nothman, Matthew Honnibal, Ben Hachey, and James R. Curran.   Event Linking: Grounding Event Reference in a News Archive.  Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 228-232, 2012.
http://www.aclweb.org/anthology/P12-2045

Interpreting news requires identifying its constituent events. Events are complex linguistically and ontologically, so disambiguating their reference is challenging. We introduce event linking, which canonically labels an event reference with the article where it was first reported. This implicitly relaxes coreference to co-reporting, and will practically enable augmenting news archives with semantic hyperlinks. We annotate and analyse a corpus of 150 documents, extracting 501 links to a news archive with reasonable inter-annotator agreement.

Bringing Order to Legal Documents - An Issue-based Recommendation System Via Cluster Association

Qiang Lu and Jack G. Conrad.   Bringing Order to Legal Documents - An Issue-based Recommendation System Via Cluster Association.  KEOD, 76-88, 2012.

The task of recommending content to professionals (such as attorneys or brokers) differs greatly from the task of recommending news to casual readers. A casual reader may be satisfied with a couple of good recommendations, whereas an attorney will demand precise and comprehensive recommendations from various content sources when conducting legal research. Legal documents are intrinsically complex and multi-topical, contain carefully crafted, professional, domain specific language, and possess a broad and unevenly distributed coverage of issues. Consequently, a high quality content recommendation system for legal documents requires the ability to detect significant topics from a document and recommend high quality content accordingly. Moreover, a litigation attorney preparing for a case...

A history of AI and Law in 50 papers: 25 years of the international conference on AI and Law

Trevor J.M. Bench-Capon, Michal Araszkiewicz, Kevin D. Ashley, Katie Atkinson, Floris Bex, Filipe Borges, Daniele Bourcier, Paul Bourgine, Jack G. Conrad, Enrico Francesconi, Thomas F. Gordon, Guido Governatori, Jochen L. Leidner, David D. Lewis, Ronald Prescott Loui, L. Thorne McCarty, Henry Prakken, Frank Schilder, Erich Schweighofer, Paul Thompson, Alex Tyrrell, Bart Verheij, Douglas N. Walton, and Adam Zachary Wyner  A history of AI and Law in 50 papers: 25 years of the international conference on AI and Law.  Artif. Intell. Law, 20, 215-319, 2012.
http://dx.doi.org/10.1007/s10506-012-9131-x

Convergence of the Dynamic Load Balancing Problem to Nash Equilibrium using Distributed Local Interactions

Sameena Shah and R. Kothari  Convergence of the Dynamic Load Balancing Problem to Nash Equilibrium using Distributed Local Interactions.  Information Sciences, 221, 297-305, 2012.

Load balancers distribute workload across multiple nodes based on a variation of the round robin algorithm, or a more complex algorithm that optimizes a specified objective or allows for horizontal scalability and higher availability. In this paper, we investigate whether robust load balancing can be achieved using a local co-operative mechanism between the resources (nodes). The local aspect of the mechanism implies that each node interacts with a small subset of the nodes that define its neighborhood. The co-operative aspect of the mechanism implies that a node may offload some of load to its neighbor nodes that have lesser load or accept jobs from neighbor nodes that have higher load. Each node is thus only aware of the state of its neighboring nodes and there is no central entity...

2011

The Role of HLT in High-end Search and the Persistent Need for Advanced HLT Technologies

Jack G. Conrad  The Role of HLT in High-end Search and the Persistent Need for Advanced HLT Technologies.  Invited Talk, Workshop on Applying Human Language Technologies to Law (AHLTL 2011), held in conjunction with The Thirteenth International Conference on Artificial Intelligence and Law (ICAIL11), Pittsburgh, PA, 2011.

This talk will first address the multiple 'views' into legal materials that are harnessed by today's high-end legal search engines. These dimensions include the traditional document view (e.g., tf.idf scoring of a document's terms relative to a query), the taxonomic view (the classification of a candidate document using an expansive legal taxonomy such as the Key Number System), the citation network view (where legal documents are characterized by numerous citations, both in-bound and out-bound, some which remain based on solid decisions and some which may be weakened by subsequent judicial opinions), and the user view (records of thousands of user interactions with candidate documents including views, prints, cites, etc.). This is hardly a Saltonian search engine applied to legal...

Public Record Aggregation Using Semi-supervised Entity Resolution

Jack G. Conrad, Christopher Dozier, Hugo Molina-Salgado, Merine Thomas, and Sriharsha Veeramachaneni.   Public Record Aggregation Using Semi-supervised Entity Resolution.  Proceedings of the 13th International Conference on Artificial Intelligence and Law (ICAIL 2011), 239-248, 2011.
http://www.law.pitt.edu/events/2011/06/icail-2011-the-thirtee... 

This paper describes a highly scalable state of the art record aggregation system and the backbone infrastructure developed to support it. The system, called PeopleMap, allows legal professionals to effectively and efficiently explore a broad spectrum of public records databases by way of a single person-centric search. The backbone support system, called Concord, is a toolkit that allows developers to economically create record resolution solutions. The PeopleMap system is capable of linking billions of public records to a master data set consisting of hundreds of millions of person records. It was constructed using successive applications of Concord to link disparate public record data sets to a central person authority file. To our knowledge, the PeopleMap system is the largest of...

Review of: Handbook of Natural Language Processing (second edition) Nitin Indurkhya and Fred J. Damerau (editors) (University of New South Wales; IBM Thomas J. Watson Research Center)Boca Raton, FL: CRC Press, 2010, xxxiii+678 pp; hardbound, ISBN 978-1-4200-8592-1

Jochen L. Leidner  Review of: Handbook of Natural Language Processing (second edition) Nitin Indurkhya and Fred J. Damerau (editors) (University of New South Wales; IBM Thomas J. Watson Research Center)Boca Raton, FL: CRC Press, 2010, xxxiii+678 pp; hardbound, ISBN 978-1-4200-8592-1.  Computational Linguistics, 37, 395--397, 2011.

Detecting geographical references in the form of place names and associated spatial natural language

Jochen L. Leidner and Michael D. Lieberman  Detecting geographical references in the form of place names and associated spatial natural language.  SIGSPATIAL Special, 3, 5--11, 2011.

Legal Document Clustering With Build-in Topic Segmentation

Qiang Lu, Jack G. Conrad, Khalid Al-Kofahi, and William Keenan.   Legal Document Clustering With Build-in Topic Segmentation.  Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM-11), 2011.

Clustering is a useful tool for helping users navigate, summarize, and organize large quantities of textual documents available on the Internet, in news sources, and in digital libraries. A variety of clustering methods have also been applied to the legal domain, with various degrees of success. Some unique characteristics of legal content as well as the nature of the legal domain present a number of challenges. For example, legal documents are often multi-topical, contain carefully crafted, professional, domain-specific language, and possess a broad and unevenly distributed coverage of legal issues. Moreover, unlike widely accessible documents on the Internet, where search and categorization services are generally free, the legal profession is still largely a fee-for-service field...

Summarize this! - Recipes for multi-lingual automatic summarization

Frank Schilder and Liang Zhou (2011).  In Multilingual Natural Language Applications: From Theory to Practice, Imed Zitouni and Daniel M. Bikel (Eds.), Summarize this! - Recipes for multi-lingual automatic summarization.  IBM Press.

2010

Using Cross-Lingual Projections to Generate Semantic Role Labeled Annotated Corpus for Urdu - A Resource Poor Language

Smruthi Mukund, Debanjan Ghosh, and Rohini Srihari.   Using Cross-Lingual Projections to Generate Semantic Role Labeled Annotated Corpus for Urdu - A Resource Poor Language.  Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), 797--805, 2010.
http://www.aclweb.org/anthology/C10-1090

In this paper we explore the possibility of using cross lingual projections that help to automatically induce role-semantic annotations in the PropBank paradigm for Urdu, a resource poor language. This technique provides annotation projections based on word alignments. It is relatively inexpensive and has the potential to reduce human effort involved in creating semantic role resources. The projection model exploits lexical as well as syntactic information on an English-Urdu parallel corpus. We show that our method generates reasonably good annotations with an accuracy of 92\% on short structured sentences. Using the automatically generated annotated corpus, we conduct preliminary experiments to create a semantic role labeler for Urdu. The

Hunting for the Black Swan: Risk Mining from Text

Jochen Leidner and Frank Schilder.   Hunting for the Black Swan: Risk Mining from Text.  Proceedings of the ACL 2010 System Demonstrations, 54--59, 2010.
http://www.aclweb.org/anthology/P10-4010

In the business world, analyzing and dealing with risk permeates all decisions and actions. However, to date, risk identification, the first step in the risk management cycle, has always been a manual activity with little to no intelligent software tool support. In addition, although companies are required to list risks to their business in their annual SEC filings in the USA, these descriptions are often very high-level and vague. In this paper, we introduce Risk Mining, which is the task of identifying a set of risks pertaining to a business area or entity. We argue that by combining Web mining and Information Extraction (IE) techniques, risks can be detected automatically before they materialize, thus providing valuable business intelligence. We describe a system that induces a risk...

Brain connectivity analysis by reduction to pair classification

Emanuele Olivetti, Sriharsha Veeramachaneni, Susanne Greiner, and Paolo Avesani.   Brain connectivity analysis by reduction to pair classification.  Proceedings of 2nd International Workshop on Cognitive Information Processing (CIP), 275-280, 2010.
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5604101

Brain connectivity studies aim at describing the connections within the brain. Diffusion and functional MRI techniques provide different kinds of information to understand brain connectivity non-invasively. Fiber tract segmentation is the task of identifying pathways of neuronal axons connecting different brain areas from MRI data. In this work we propose a method to investigate the role of both diffusion and functional MRI data for supervised tract segmentation based on learning the pairwise relationships between streamlines. Experiments on real data demonstrate the promise of the approach.

Concord - A Tool that Automates the Construction of Record Resolution Systems

Christopher Dozier, Hugo Molina-Salgado, Merine Thomas, and Sriharsha Veeramachaneni.   Concord - A Tool that Automates the Construction of Record Resolution Systems.  Proceedings of the Workshop on Named Entity Resolution at the Eighth International Conference on Language Resources and Evaluation (LREC 2010), 2010.

We describe an application we created called Concord that enables software engineers to build and execute Java based record resolution systems (RRS) quickly. Concord allows developers to interactively configure a RRS by specifying match feature functions, blocking functions, and unsupervised machine learning methods for a specific resolution problem. From the developer's defined configuration parameters, Concord creates a Java based RRS that generates training data, learns a matching model and resolves the records in the input files. As far as we know, Concord is unique among RRS generators in that it allows users to select feature functions which are customized for particular field types and in that it allows users to create matching models in a novel unsupervised way using a...

Book Review: Representation and Management of Narrative Information: Theoretical Principles and Implementation

Frank Schilder  Book Review: Representation and Management of Narrative Information: Theoretical Principles and Implementation.  Computational Linguistics, 36, 151-156, 2010.

Gian Piero Zarri's book summarizes more than a decade of his research on knowledge representation for narrative text. The centerpiece of Zarri's work is the Narrative Knowledge Representation Language (NKRL), which he describes and compares to other competing theories. In addition, he discusses how to model the meaning of narrative text by giving many real-world examples. NKRL provides three different components or capabilities: (a) a representation system, (b) inferencing, and (c) an implementation. It is implemented via a Java-based system that shows how a representational theory can be applied to narrative texts.

Building and Operating a Hadoop/MapReduce Cluster from Commodity Components: A Case Study

Jochen L. Leidner and Gary Berosik  Building and Operating a Hadoop/MapReduce Cluster from Commodity Components: A Case Study.  ;login:, 26--37, 2010.
http://www.usenix.org/publications/login/2010-02/openpdfs/leidner.pdf

This tutorial presents a recipe for the construction of a compute cluster for processing large volumes of data, using cheap, easily available personal computer hardware (Intel/AMD based PCs) and freely available open source software (Ubuntu Linux, Apache Hadoop).

E-Discovery Revisited: the Need for Artificial Intelligence beyond Information Retrieval

Jack G. Conrad  E-Discovery Revisited: the Need for Artificial Intelligence beyond Information Retrieval.  Artificial Intelligence and Law, 18, 1-25, 2010.
http://dx.doi.org/10.1007/s10506-010-9096-6

In this work, we provide a broad overview of the distinct stages of E-Discovery. We portray them as an interconnected, often complex workflow process, while relating them to the general Electronic Discovery Reference Model (EDRM). We start with the definition of E-Discovery. We then describe the very positive role that NIST's Text REtrieval Conference (TREC) has added to the science of E-Discovery, in terms of the tasks involved and the evaluation of the legal discovery work performed. Given the critical nature that data analysis plays at various stages of the process, we present a pyramid model, which complements the EDRM model: for gathering and hosting; indexing; searching and navigating; and finally consolidating and summarizing E-Discovery findings. Next we discuss where the...

Filter-based Data Partitioning for Training Multiple Classifier Systems

Rozita A. Dara, Masoud Makrehchi, and Mohamed S. Kamel  Filter-based Data Partitioning for Training Multiple Classifier Systems.  IEEE Transactions on Knowledge and Data Engineering, 22, 508-522, 2010.

Data partitioning methods such as bagging and boosting have been extensively used in multiple classifier systems. These methods have shown a great potential for improving classification accuracy. This study is concerned with the analysis of training data distribution and its impact on the performance of multiple classifier systems. In this study, several feature-based and class-based measures are proposed. These measures can be used to estimate statistical characteristics of the training partitions. To assess the effectiveness of different types of training partitions, we generated a large number of disjoint training partitions with distinctive distributions. Then, we empirically assessed these training partitions and their impact on the performance of the system by utilizing the...

Simultaneous measurement of RBC velocity, flux, hematocrit and shear rate in vascular networks

Walid S Kamoun, Sung-Suk Chae, Delphine A Lacorre, James A Tyrrell, Mariela Mitre, Marijn A Gillissen, Dai Fukumura, Rakesh K Jain, and Lance L Munn  Simultaneous measurement of RBC velocity, flux, hematocrit and shear rate in vascular networks.  Nature Methods, 7, 655-660, 2010.
http://www.nature.com/nmeth/journal/v7/n8/full/nmeth.1475.html

Not all tumor vessels are equal. Tumor-associated vasculature includes immature vessels, regressing vessels, transport vessels undergoing arteriogenesis and peritumor vessels influenced by tumor growth factors. Current techniques for analyzing tumor blood flow do not discriminate between vessel subtypes and only measure average changes from a population of dissimilar vessels. We developed methodologies for simultaneously quantifying blood flow (velocity, flux, hematocrit and shear rate) in extended networks at single-capillary resolution in vivo. Our approach relies on deconvolution of signals produced by labeled red blood cells as they move relative to the scanning laser of a confocal or multiphoton microscope and provides fully resolved three-dimensional flow profiles within vessel...

Unsupervised Learning for Reranking-based Patent Retrieval

Wenhui Liao and Sriharsha Veeramachaneni.   Unsupervised Learning for Reranking-based Patent Retrieval.  3rd International Workshop on Patent Information Retrieval, in 19th ACM Conference on Information and Knowledge Management (ICKM), 2010.

We present a reranking-based patent retrieval system where the query text is a patent claim, which may be from an existing patent. The novelty of our approach is the automatic generating of training data for learning the ranker. The ranking is based on several features of the candidate patent, such as the text similarity to the claim, international patent code overlap, and internal citation structure of the candidates. Our approach more than doubles the average number of relevant patents in the top 5 over a strong baseline retrieval system.

An Information Theoretic Approach to Generating Fuzzy Hypercubes for If-Then Classifiers

Masoud Makrehchi and M.S. Kamel  An Information Theoretic Approach to Generating Fuzzy Hypercubes for If-Then Classifiers.  Journal of Intelligent and Fuzzy Systems, 21, 2010.

In this paper, a framework for automatic generation of fuzzy membership functions and fuzzy rules from training data is proposed. The main focus of this paper is designing fuzzy if-then classifiers; however the proposed method can be employed in designing a wide range of fuzzy system applications. After the fuzzy membership functions are modeled by their supports, an optimization technique, based on a multi-objective real coded genetic algorithm with adaptive cross over and mutation probabilities, is implemented to find near optimal supports. Employing interpretability constraint in parameter representation and encoding, we ensure that the generated fuzzy membership function does have a semantic meaning. The fitness function of the genetic algorithm, which estimates the quality of the...

1999

Name Recognition and Retrieval Performance

Paul Thompson and Christopher Dozier (1999).  Natural Language Information Retrieval. Strzalkowski, Tomek (Eds.), Name Recognition and Retrieval Performance.  (pp. 261--272). Dordrecht: Kluwer Academic.
http://www.amazon.com/gp/product/0792356853

The main application of name searching has be name matching in a database of names. This paper discusses a different application: improving information retrieval through name recognition. It investigates name recognition accuracy, and the effect on retrieval performance of indexing and searching personal names differently from non-name terms in the context of ranked retrieval. The main conclusions are: that name recognition in text can be effective; that names occur frequently enough in a variety of domains, including those of legal documents and news databases, to make recognition worthwhile; and that retrieval performance can be improved using name searching.

Genetic Algorithms

Ken Williams and Brad Murray  Genetic Algorithms.  The Perl Journal, 4, 1999.
http://www.foo.be/docs/tpj/issues/vol4_3/tpj0403-0005.html 

Evolving algebraic expressions.

1998

The Structure of Judicial Opinions: Identifying Internal Components and their Relationships

Jack G. Conrad and Daniel P. Dabney.   The Structure of Judicial Opinions: Identifying Internal Components and their Relationships.  Proceedings of the 5th International ISKO Conference (ISKO-98), Structures and Relations in Knowledge Organization, 413 ff., 1998.

Empirical research on basic components of American judicial opinions has only scratched the surface. Lack of a coordinated pool of legal experts or adequate computational resources are but two reasons responsible for this deficiency. We have undertaken a three phase study to uncover fundamental components of judicial opinions found in American case law. The study was aided by a team of twelve expert attorney-editors with a combined total of 135 years of legal editing experience. The hypothesis underlying the experiment was that after years of working closely with thousands of judicial opinions, expert attorneys would develop a refined and internalized schema of the content and structure of legal cases. In this study participants were permitted to describe both concept-related and...

1997

Name Searching and Information Retrieval

Paul Thompson and Christopher Dozier.   Name Searching and Information Retrieval.  Proceedings of 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP 1997), 134--140, 1997.

The main application of name searching has be name matching in a database of names. This paper discusses a different application: improving information retrieval through name recognition. It investigates name recognition accuracy, and the effect on retrieval performance of indexing and searching personal names differently from non-name terms in the context of ranked retrieval. The main conclusions are: that name recognition in text can be effective; that names occur frequently enough in a variety of domains, including those of legal documents and news databases, to make recognition worthwhile; and that retrieval performance can be improved using name searching.

1996

Uncertainty in Information Retrieval Systems

Howard R. Turtle and W. Bruce Croft (1996).  In Uncertainty Management in Information Systems, Uncertainty in Information Retrieval Systems.  (pp. 189-224).

Any effective retrieval system includes three major components: the identification and representation of document content, the acquisition and representation of the information need, and the specification of a matching function that selects relevant documents based on these representations. Uncertainty must be dealt with in each of these components.

1995

Text Retrieval in the Legal World

Howard R. Turtle  Text Retrieval in the Legal World.  Artificial Intelligence and Law, 3, 5-54, 1995.

The ability to find relevant materials in large document collections is a fundamental component of legal research. The emergence of large machine-readable collections of legal materials has stimulated research aimed at improving the quality of the tools used to access these collections. Important research has been conducted within the traditional information retrieval, the artificial intelligence, and the legal communities with varying degrees of interaction between these groups. This article provides an introduction to text retrieval and surveys the main research related to the retrieval of legal materials.

Query Evaluation: Strategies and Optimizations

Howard R. Turtle and James Flood  Query Evaluation: Strategies and Optimizations.  Information Processing & Management, 31, 831-850, 1995.
http://dx.doi.org/10.1016/0306-4573(95)00020-H

This paper discusses the two major query evaluation strategies used in large text retrieval systems and analyzes the performance of these strategies. We then discuss several optimization techniques that can be used to reduce evaluation costs and present simulation results to compare the performance of these optimization techniques when evaluating natural language queries with a collection of full text legal materials.

1994

A System for Discovering Relationships by Feature Extraction from Text Databases

Jack G. Conrad and Mary Hunter Utt.   A System for Discovering Relationships by Feature Extraction from Text Databases.  Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Dublin, Ireland, 3-6 July 1994 (Special Issue of the SIGIR Forum), 260-270, 1994.

A method for accessing text-based information using domain-specific features rather than documents alone is presented. The basis of this approach is the ability to automatically extract features from large text databases, and identify statistically significant relationships or associations between those features. The techniques supporting this approach are discussed, and examples from an application using these techniques, named the Associations System, are illustrated using the Wall Street Journal database. In this particular application, the features extracted are company and person names. The series of tests run on the Associations System demonstrate that feature extraction can be quite accurate, and that the relationships generated are reliable. In addition to conventional measures...

TREC-3 Ad Hoc Retrieval and Routing Experiments using the WIN System

Paul Thompson, Howard R. Turtle, Bokyung Yang, and James Flood.   TREC-3 Ad Hoc Retrieval and Routing Experiments using the WIN System.  TREC, 1-7, 1994.

The WIN retrieval engine is West's implementation of the inference network retrieval model. The inference net model ranks documents based on the combination of different evidence, e.g., text representations, such as words, phrases, or paragraphs, in a consistent probabilistic framework. WIN is based on the same retrieval model as the INQUERY system that has been used in previous TREC competitions. The two retrieval engines have common roots but have evolved separately -- WIN has focused on the retrieval of legal materials from large (>50 gigabyte) collections in a commercial online environment that supports both Boolean and natural language retrieval. For TREC-3 we decided to run an essentially unmodified version of WIN to see how well a state-of-the-art commercial system compares to...

Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance

Howard R. Turtle.   Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance.  Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Dublin, Ireland, 3-6 July 1994 (Special Issue of the SIGIR Forum), 212-220, 1994.

The results of experiments comparing the relative performance of natural language and Boolean query formulations are presented. The experiments show that on average a current generation natural language system provides better retrieval performance than expert searchers using a Boolean retrieval system when searching full-text legal materials. Methodological issues are reviewed and the effect of database size on query formulation strategy is discussed.