We recently witnessed impressive successes of deep neural networks on a variety of challenging AI problems, particularly in natural language processing (NLP).
Machine reading comprehension (MRC) is a subfield of NLP that has particularly benefitted from these advances. It has the ambition of endowing machines with the capability to read, understand, reason, and answer questions about unstructured natural language text, in a much more sophisticated way than the symbolic matching heuristics traditionally used so far. The combination of powerful neural network models, computational resources, and the availability of large human-annotated datasets, have determined the emergence of notable breakthroughs in this field. Many academic institutions (Stanford, Carnegie Mellon, …) and technology companies (Google, Facebook, Microsoft, …) are investing heavily in this type of research. The objectives of this research project are (1) to investigate state of the art deep learning-based approaches to the MRC problem, (2) to extend state of the art to our domains and (3) to apply developed methods to a variety of relevant problems, starting with Question Answering tasks.
Academic MRC overview
In its most general definition, MRC aims at giving an answer to a given natural language query, based on a provided textual context . Historically, different definitions of answer and context characterize the specific subset of MRC techniques. Some types of tasks that MRC has been designed to solve are the following ones (Fig. 1): Cloze-style: find which text entity, extracted from the textual context, is able to fill-in the blank in a sentence with missing words to meaningfully complete the answer; Multiple choice: given a textual premise and a list of possible predefined answers, select the correct answer from the list; Span extraction: extract a span of text from a given context that can represent an answer for the question; Free answer: dynamically generate an answer with a sequence of words that may not necessarily appear in the text, but which is semantically coherent and appropriate to the question.
Our approach to MRC
Members of this research project already have significant expertise developing question-answering solutions and played a key role in developing question answering technology for Westlaw Edge and Checkpoint Edge. Both products are in productions and have been very well received by our customers. These complex projects required 15-18 months, each, of intense research and development efforts and have reached remarkable levels of performance. Since we know their capabilities, and somehow also their limitations, we asked ourselves if we could advance their performance even further and with qualitatively different approaches. In this research project we're investigating the most advanced MRC techniques to port them over and enhance our own Question Answering systems technological portfolio.
In our approach, similarly to what had been developed already for the existing QA systems, we pursued a more information retrieval-based paradigm : given a list of candidate documents for a given query, we semantically parse each of the candidate documents, rank them based on their relevance, and return the one that best answer the question.
The definition of document changes based on the domain of application: in the legal domain, it can be a short snippet of text, the length of a sentence, e.g., a legal headnote. In the tax and accounting domain instead, an answer could be a series of paragraphs that are together related to a single topic.
Each question-answer pair is scored by the model and the answer with highest score is proposed as candidate answer for the question. The candidates are generated by a candidate generation stage based on a traditional search engine, in the order of a few hundreds, and then reranked by a more advanced semantic model trained on annotated data.
MRC neural model vs traditional QA
The original QA systems internally deployed at TR are based on an established machine learning paradigm based on defining a set of features relevant for the QA scoring task, and then training a highly performant “shallow” model, e.g., gradient boosted trees, to learn the classification function. A solution often requires hundreds of features to capture the nuances and linguistic characteristics of the target domain.
There are a number of different neural models that have been recently proposed. We have experimented with the now celebrated BERT , which is a relatively simple feedforward architecture with self-attention having shown a very remarkable property: by pre-training the model on predicting missing words from a large corpus of text, it learns an implicit representation that discovers, without any annotation, a very similar set of general purpose linguistic features as the ones which are developed manually by human experts .
Once the model is presented with the actual training data for the specific task at hand, i.e., in our case deciding if an answer is relevant for a question, the model is able to leverage the prior knowledge that it has discovered in the preliminary pre-training phase and fulfills the task with excellent performance, matching or even outperforming classic non-neural baselines. Crucially, the model achieves high performance without any manual features engineering.
One of the most important characteristics of BERT is self-attention, which is a set of weights and connections that the model uses to learn how to combine words in a text with all other words appearing in context from the same text. When plotted, these weights emphasize semantic aspects of text that human experts can easily associate with linguistic characteristics. Again, these associations are learnt without any explicit bias from human annotators.
Our trained model
We have taken the base variant of BERT, i.e., the smallest one. It has 12 layers of stacked transformer encoders, with hidden size of 768 dimensions, and 12 attention heads. The maximum supported sequence length is 512 tokens. Following the protocol proposed by several research papers, we have taken the pre-trained version released by Google, which has been pre-trained on a combination of Wikipedia (2.5 billion words) and Toronto Book Corpus (0.8 billion words) data, and we fine-tuned it on our training data.
The Legal corpus is composed of 42K questions, for a total of 352K editorial QA pairs. Answers are headnotes, typically made of one or two short sentences.
The Tax corpus restricted to the Federal practice area contains 3K questions, for 28K QA pairs. Answers are documents made of several paragraphs.
We formulated the fine-tuning problem as a binary classification task. Each QA pair is graded by a pool of subject matter experts (SMEs) assigning one of four different grades, A, C, D, F, where A is for perfect answers, and F for completely wrong errors. The grades of each QA pair are converted to numbers, averaged across graders, and binarized.
BERT is a fixed-length model, which means that in its standard form it can only parse sequences up to a predetermined number of word tokens. The model we used supports 512 tokens, which is sufficient for the Legal task but imposes a significant constraint for the Tax task because the documents are much longer. We are actively investigating ways to tackle this long document problem. One possible line of research is utilizing novel variants of BERT that introduce a form of recurrence and memory storage that can in theory allow the model to scan through a much larger number of tokens than is currently possible. An example of that is the XLNet model .
Results so far
For both Legal and Tax tasks, BERT produced competitive results - as compared with the solutions in production. But we are optimistic that with further work (see next section) we will push the system’s performance higher.
Language model finetuning on our corpora
One research question we’re actively trying to answer, which is related to understanding the maximum capabilities and the limitations of BERT models and its variants, is this: does supplying an unlimited amount of domain-specific, un-annotated data, improve the performance by re-running the language model pre-training? This aspect has important consequences both from the potential of performance improvement, and from the assessment of the competitive advantage for us to have exclusive access to a large amount of SME-annotated data. We’re conducting preliminaries tests on legal case and headnotes data to verify what the behavior of the model is when its language representation is being finetuned again on in-domain corpora.
Full corpus re-ranking
We’re also currently looking at ways to overcome the problem of un-answerable questions, i.e., questions such that no good answer appears in the candidate pool produced by the first-stage search engine. We’re conducting large scale experiments to try and apply the more advanced semantic model to the whole corpus, bypassing the need for a preliminary screening from a non-data-driven, term-frequency-based search engine.
There are a number of exciting future research directions, tailored for the specific needs of our domains. Among the most prominent ones there’s the need to not only tackle the semantic understanding of long documents, but also returning a more focused, targeted snippet of text from the larger documents that can appear in the Tax domain. This is typically called passage extraction. An evolution of this method would look at an abstractive generation of an answer that may not be present word-for-word in a document, rather than “simply” selecting an existing part of a text.
Other complementary directions will require a more robust way of perform reasoning and inference over a collection of multiple sources, and integrate, distill, and synthesize that knowledge in a dynamic and possibly personalized way for each user.
For the long term, we envision a full multi-turn, conversational system able to fulfill a user information need by a continuous interaction and feedback with them, able to keep track of the state of the conversation and refer to previous questions asked and information exchanged.
 Liu, S., Zhang, X., Zhang, S., Wang, H., & Zhang, W. (2019). Neural machine reading comprehension: Methods and trends. Applied Sciences, 9(18), 3698.
 Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
 Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). What Does BERT Look At? An Analysis of BERT's Attention. arXiv preprint arXiv:1906.04341.
 Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237.
 Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085.