Natural language processing
Language (spoken and written) is unique to humans and is at the center of our social and business interactions. Language enables us to communicate, collaborate, negotiate and socialize with each other. Language allows us to record our own experiences, how we learn from others, how we share knowledge and how we preserve and advance civilization. At Thomson Reuters, we operate in language (text) rich industries. Laws, regulations, news, disputes and business transactions are all captured in text. The amount of text is growing exponentially, and processing and acting upon it is a competitive advantage for all of our customers.
The ability to process massive amounts of text, to mine it for insights and information nuggets, to organize it, to connect it, to contrast it, to understand it and to answer questions over it, is of utmost importance for our customers and for us. This is why natural language processing (NLP) has been one of our core research areas for the last 20 years.
But what exactly is NLP? NLP is a sub-area of Artificial Intelligence (AI) research that focusses on the processing of language, either in the form of text or speech. More generally, it can be defined as the modeling of how signs (sound/characters/words) representing some meaning are used in order to fulfill a pre-defined task such as translation, summarization or question answering. In contrast to linguistic studies, NLP is an engineering discipline focusing on achieving a pre-defined task instead of leading to a deeper understanding of language.
Natural language processing is one of the most active research areas in AI and provides a rich target for machine learning research as well. There is a plethora of tools from academia and industry for various NLP tasks including tokenizers, part of speech taggers, chunkers, parsers and classifiers as well as tools to mine for concepts, entities and relationships. Some of the popular tools used by researchers include NLTK, AllenNLP, StanfordNLP and SpaCy. These tools provide basic building blocks of more complex NLP tasks such as question answering and summarization.
Some of these tools are already operating at human-level performance, especially when restricted to well-research domains such as News, but the performance of most tools drops significantly when used on other domains. Domain adaptation is the process of extending a tool (e.g., named entity extractor) that performs well on one domain (e.g., News) to another domain (e.g., the Law). Domain adaptation of the various NLP tools is one of our active areas of research. Sometimes all we need is domain-specific training data, often times the processes is much more nuanced.
The objectives of our NLP research span our editorial processes as well as our customer-facing products. On the editorial front, the primary focus is on building tools for mining, enhancing and organizing content. Some of these tools are meant to augment and scale our editorial staff (e.g., classifying legal summaries to the Key Number taxonomy – a topical taxonomy of 100,000 topics), others are meant to run automatically (e.g., various named entity extraction and resolution processes). On the product front, we are typically focused on higher level objectives (e.g., question answering, document analysis) but these objectives often require lower level NLP building blocks (e.g., concept and relationship mining). With this context in mind, our NLP focus areas span a wide range of NLP problems including:
Morphological analysis (e.g., decompounding), tokenization, sentence boundary detection, Named Entity Extraction and Resolution, concept and terms of art extraction, relation-extraction, record linkage, classification, question answering, single and multi-document summarization, language generation, risk mining, document zoning, abnormality and deviation analysis, language models, syntactic and semantic similarity, etc.
Our primary focus is to solve problems and create capabilities in a scalable way. This often requires a mixture of tools and resources. As such, we are versed in both classical (e.g., parsers and rule-based systems) as well as machine-learning approaches including the most recent deep learning methodologies.
The ability to classify text spans and documents to various topical taxonomies is critical for the findability problem and for tackling the information overload problem. For example, our legal business has dozens if not hundreds of topical taxonomies with some (e.g., key number system) containing more than 100,000 topics. Organizing legal content under these taxonomies is a task that has been done manually by attorney editors for decades and in some cases since the late 1800s. Given the diversity of our classification tasks, we opted to develop a classification framework (CaRE) that deploys an ensemble of classifiers, each optimized on specific content type and feature set. Classifier decisions are then combined using a set of meta-classifiers whose output is then passed to a decision layer. The framework is highly configurable, and each set of classifiers can be trained independently. This framework is widely used across Thomson Reuters for content classification both in fully automated mode or to machine-assist manual tasks.
Attorneys formulate their litigation strategies based on their own experiences. One of the primary objectives of the Litigation Analytics product is to help attorneys formulate legal strategies based on data. To help them estimate how long a matter in front of a particular judge will take, whether the opposing party is likely to settle or litigate to the end, whether or not to file a motion for summary judgement and so on. The data to support these insights is readily available in US dockets – which contain a detailed ledger of activities in a lawsuit. But this data is represented in natural language and is spread across hundreds, if not thousands, of docket entries. Parsing this data, extracting who is asking for what (e.g., motion) against whom and what the judge decides is a complex NLP task. We used an ensemble of rule-based approaches and latest methods in Deep learning (e.g., hierarchical RNN) to transform unstructured, free text document entries into a knowledge graph so attorneys could find answers to such questions as: how often a particular judge has granted a motion for summary judgment in an employment discrimination lawsuit or other types of suits. Now imagine doing this at scale, with 8 million dockets, some of which contain thousands of entries resulting in a knowledge graph with billions of nodes.
In any information-intensive task, the ability to ask questions and receive answers is a critical component of the overall user information experience. Question Answering can be a significant time saver – reducing hours into minutes and seconds. The nuanced nature of the legal domain makes this a particularly hard task. Our objective is not to solve simple factoid questions (e.g., who is the president of the United States) but to find answers for complex legal questions, whether they follow certain forms, e.g., “what are the elements of Fraud?” or are completely open ended, e.g., “Does an attorney-client relationship between an attorney and a corporation extend to shareholders?” Answering these questions require a deeper semantic analysis of the questions, the concepts and the relationships in the questions and the content, and creating a mapping between the two. We used supervised learning technologies to teach the algorithms (1) how to ‘read’ the law, (2) how to parse questions, and (3) how to answer questions. Currently, we are exploring the use of deep learning language models (e.g, Bert) on question answering tasks.