Document analysis

ARTIFICIAL INTELLIGENCE

Document analysis 

Professional knowledge work is often document-centric. This research project focuses on developing algorithms and tools to automate or machine-assist document review and analysis tasks.

Experts in the Knowledge Economy are required to create, review, and respond to long documents as a frequent and critical component of their work. Whether drafting a new brief for a trial motion, repurposing or updating an old brief, reviewing client tax documents, drafting a contract, or ensuring that client filings comply with relevant regulations, these tasks have some traits in common: most notably, they are all work-intensive and time-consuming.

Document Analysis is a research project that aims to simplify document review tasks and at the same time help our customers produce higher quality documents. In this article, we focus on the underlying research and AI capabilities that we have developed or are developing. We mention certain products to demonstrate how some of these capabilities can be used and to make the discussion more concrete.

Notice that document review is different from document authoring, which focuses on helping knowledge workers write documents faster, especially when there are specific formatting rules, e.g., Drafting Assistant or Contract Express. Nevertheless, we think of document review and document authoring as complimentary process that even requires some common capabilities.

Document analysis products require solving technical hurdles. Some of these are applications of well-known tasks in the Natural Language Processing (NLP) space. In many cases these are adapted for additional robustness. Other hurdles are totally new and require a combination of algorithm development and user study. In the paragraphs that follow, we’ll describe some of technical problems that we have worked on for our document analysis products.

Chapter Two

Document structure

Documents are a complex web of references (explicit and implicit), expansions of ideas, and resolutions of conflicting ideas. Some lines of reasoning are grounded in precedent, while others are proposed or suggested. Humans give significant clues to the big ideas and their purposes through the structure and layout of a document. As a liminal task, document analysis tools have to accurately extract those structural clues. This includes capturing such information as document titles, sections and subsections and their headings, as well as paragraph and sentence boundaries. It also includes elements of document zoning (determining the purpose of each section of a document) which is often domain dependent.

A starting principle is that a document analysis product must be agnostic to the source of a document and the formatting choices that the author may choose. However, formatting conventions exhibit wide variance, and the link between structure and content cannot be ignored. For example, are the sections of a document compact and mostly independent, or are there a couple of long, hierarchical sections? Finding segmentation algorithms that are aware of both structure and content is an area of active research.

Chapter Three

Entity & relation extraction

Entity and relation extraction is another well-known NLP task that is used extensively in document analysis products. While off-the-shelf approaches to entity extraction are generally adequate, relation extraction is highly specialized in tax, legal, and regulatory domains. Relevant relations are generally more fine-grained than the classical NLP factoid relations (e.g., residence, employer, or date-of-birth), and as such are easily confused with other similar relations, or are subject to variation in language if a high-level taxonomy is used. Relation extraction would be critical if one, for example, might wish to extract the claims between each party, assuming there are many parties, in a legal action.

Chapter Four

Terms of art and concepts


While the traditional field of entity extraction attempts to find people, locations, organizations, and geopolitical entities in text, a machine may have to detect and localize occurrences of named abstract ideas in text. For example, legal principles often come with tests to distinguish acceptable and unacceptable activity; one such principle is equal protection under the law, a constitutional right. The courts have found that a law will be invalid if it defines a class of individuals without any connection to the matter being regulated; a law prohibiting red-headed people from driving between the hours of 1 am and 3 am would be invalid under this principle. One might wish to detect if some paragraph of a document is discussing the legal principle of irrational classification.

A common approach to this problem would be to create a taxonomy, to collect and annotate instances of each node of the taxonomy, and finally to develop models. The first two steps are considerable investments of time and personnel, even before the beneficial task of model development can begin. Establishing an area of research is one important way to lower the cost of creating and maintaining the taxonomy and getting labeled examples. Research areas around this problem include key-phrase extraction, active learning, distant supervision, and hybrid (ML/expert) systems.

Domain-specific concepts have many use cases. For example, they can be used as navigational markers (of the underlying semantic space) or as wayfinding tools in a search engine (e.g., see “Concept Markers” in Checkpoint Edge).

Chapter Five

Deviation analysis

One of the most common tasks in document review is to identify deviations from the norm. In contracts, for example, clauses often follow specific language constructs because these language norms withstood the tests of time and litigations. Material deviations from the norm — while permitted — is risky and must be done carefully. Similarly, in filings with the U.S. Securities and Exchange Commission, significant deviations from the norm in unstructured sections (e.g., Management Discussion & Analysis, or Notes to Financial Statements) can also be used as a predictor of risk. Other types of deviation include any changes in terms and conditions, for example in the context of contract review.

At the surface, language deviation is can be viewed as constructing language models for the norms and then defining domain-specific functions to measure material deviation. These functions are often aided by a dictionary of concepts and terms of art. Such an approach can be effective for some use cases, but we don’t believe it is capable of capturing structural deviations that are expressed in natural language. This remains an open and active research problem.

Chapter Six

Synthesis: Understanding customers’ needs

We have discussed several tasks that lie at the threshold of creating a document analysis product; however, most of these tasks are trivial for a human, at least at a small scale, and would not make a beneficial product by themselves. Instead, a product in this space needs to be designed according to a deep understanding of the best way to help a user by synthesizing these “atomic” inputs into something useful.

The Thomson Reuters Quick Check product grew from a study of the process of writing, reviewing, and updating trial and appellate briefs. Developers found out that a common workflow process is for experienced attorneys to review briefs written by less-experienced attorneys or to update an old brief. The objective here is to ensure that the brief cites relevant caselaw authorities (ideally relevant to the pertaining facts and the legal issues) that also reach a similar conclusion. Another task involves reviewing briefs filed by the opposing party, looking for relevant authorities that do not support the opposing party’s conclusion.

These use cases gave rise to a host of technical challenges; for instance, creating accurate document segmentation to capture the various issues that are raised in the brief. Further, each issue consists of four elements: i) the facts; ii) the legal principles raised; iii) the authorities that are cited; and iv) the explanatory text. Given segmented documents, one must then find relevant authorities and determine the relative weights of the four elements above.

Often, one can’t find an authority with the same set of facts and legal issues, which is why attorneys often think by analogy. While our algorithms are incapable of ‘thinking’ by analogy (e.g., a bus driver is similar to a machine-operator), we attempt to approximate it in a statistical sense.

From a usability perspective, attorneys may choose not to cite relevant authorities for many reasons and the last thing they want — especially in a time-crunch — is to review a list of these authorities suggested by an algorithm. So, it is prudent that we recommend highly relevant (and if possible) controlling authorities. To further reduce the burden of the review, the product must also provide the user with an explanation of why an authority might be relevant.

The product vision gave rise to a host of secondary questions: How can relevance in this context be approximated so that we can find relevant documents and create explanations? As a constructive answer to this question, we made hypotheses based on atomic tasks like the ones described above. Of course, not all of these were proven, nor did we think of the best hypothesis at first. Thus, there is a nexus of interaction between the product vision and the atomic tasks that are needed to implement it. One can envision developing a product as the process of forming hypotheses and constructively validating them.