Machine learning (ML) is concerned with the study of algorithms that learn from data to perform certain tasks. Example tasks include classification, language translation, text mining (e.g., for terms of art), abnormality detection, recommender systems, ranking search results and so on. From a practitioner’s point of view, machine learning is often seen as the scientist’s toolbox with powerful tools that could be used to solve a wide range of problems across many domains.
At Thomson Reuters, we are in the business of building information-based solutions for professional knowledge workers. Our machine learning research focuses on solving business-relevant problems as opposed to inventing new machine learning algorithms. We are a text-heavy organization and machine learning plays a critical role in what we do and is embedded in our products and services. Our problem space spans a number of areas including, but not limited to the following:
- Text Mining including extracting terms of art, concepts, named entities, relations, risks and events.
- Concordance and Resolution, including resolving named entities to authority files, database concordance, and web-scale concordance creating knowledge graphs of billions of nodes and tens of billions of relationships.
- Content Organization including document classification, extracting document structures (e.g., from pdf text images), clustering, document deduplication (exact and fuzzy)
- Text Generation and Summarization including single-document and multi-document summarization, runtime snippet / seeded-summary generation.
- Research & Discovery including search, vertical search, content- and behavior-based recommender systems, question answering over structured data and knowledge graphs, and question answering over text and documents, dynamic related concepts, and dialog systems.
- Document Analysis including analyzing such documents as contracts, legal briefs & motions and SEC filings. Tasks include extracting terms and conditions (e.g., of contracts) detecting abnormalities and risks (e.g., in clauses and SEC filings), updating old documents (e.g., legal briefs) or identifying weaknesses and gaps in documents (e.g., litigation) from opposing parties in adversarial environments.
- Analytics including analyzing structured data and records for abnormality, looking for trends, understanding customer journeys and pain points, to name just a few.
In each of the above areas – and others – our primary objective is not to invent new machine learning algorithms, but rather to solve business-relevant problems and to build new capabilities that simplify how knowledge work gets done. As such, we often focus on extending/adapting state of the art machine learning algorithms into our data and domains as well as designing innovative solution architectures to solve complex problems to quality and scale.
Our choice of technology is not driven by recent trends or popularity. We strive to use the right tool/approach for the task with an eye towards the operational cost of these technologies (e.g., in terms of skills needed for care and feeding of solutions). We have significant expertise in both supervised and unsupervised ML approaches. We use active learning to reduce the cost of annotation tasks as well as transfer-learning so we are not starting from scratch every time. Historically, we relied heavily on classical ML approaches that require capturing the nuances of the domain in a way that an algorithm can compute, or what is often referred to as feature engineering. More recently though, we are shifting towards deep learning (DL) approaches which reduces if not eliminates the need for complex, hand-curated features that are engineered and tuned iteratively. So far, deep learning is living up to its promise and we are getting state of the art results on summarization (abstractive and extractive) as well as question answering tasks.
Regardless of the particular choice of technology, we believe we are uniquely positioned to do ground-breaking machine learning research because in most vertical domains machine learning requires three ingredients: data, subject matter expertise and machine-learning skills.
- Data is key because it is what is used to train the algorithms. For many tasks, one often needs quality and quantity. We have both. Our data is accurate, current, comprehensive and has been carefully curated through editorial and algorithmic processes.
- Subject matter experts are key part of the puzzle because (1) they are often responsible for annotation tasks (creating training data), (2) they verify solution performance, and (3) together with scientists they perform error analysis tasks to understand algorithm behavior and make adjustments.
- Machine learning skills are not just about expertise in the various algorithms and tools but also in how to adapt them to our vertical domains.
Staying up to date with the latest machine learning technologies is critical if we are to remain relevant to our business and customers. This is why we attend, publish and present at scientific conferences and industry meetings. We also collaborate with universities and research institutions on domain-agnostic problems.