How the Thomson Reuters Center for AI & Cognitive Computing is using AI to help scientists find answers to the virus that is vexing us all.
The race to rid the world of COVID-19 has engaged the global scientific community in an unprecedented experiment of data-gathering and information-sharing, one that has produced hundreds of thousands of research papers since the pandemic began, with hundreds more added every day.
One problem with so much research being published so fast, however, is that no human being can possibly read it all, and the articles themselves are not coherently organized or easily searchable.
To address this challenge, a team of researchers in the Thomson Reuters Center for AI & Cognitive Computing (C3) has been competing with data scientists around the world to develop AI-enabled data and text-mining tools that can sort and classify publicly available articles related to COVID-19. The effort is part of a so-called “Kaggle competition” initiated in March by the White House Office of Science and Technology Policy, and involves partnerships with several organizations including the Allen Institute for Artificial Intelligence and the National Institutes of Health.
What is a Kaggle competition?
Kaggle competitions are a popular way for data scientists to push data research forward by competing to solve problems posed by other researchers. The Kaggle community online has more than 1 million members from 194 countries, and members share their work with each other to maintain maximum scientific transparency. A nominal cash prize is awarded to the winner of each Kaggle contest, but the competition’s true purpose is to spur innovative thinking, particularly in the areas of machine learning and artificial intelligence (AI).
The White House initiated a Kaggle competition by publicly releasing the COVID-19 Open Research Dataset (CORD-19) — containing almost 140,000 publications and 60,000 full-text papers on SARS, COVID-19, and other coronaviruses — and asking the AI community to develop better tools for mining this growing body of COVID-19 literature. The Thomson Reuters C3 team accepted the challenge and, as part of a larger effort spearheaded by the Vector Institute, used its expertise in AI-enabled search capabilities to develop a superior COVID-19 information retrieval system.
“With these real-world challenges, we hope to help biomedical researchers explore and mine information from articles more efficiently,” says Luna Feng, a research scientist on the Thomson Reuters C3 team. “Hopefully, what we have built will help the medical community find answers to their high-priority questions.”
AI vs. COVID-19
Many experts believe that using AI to accelerate the dissemination of scientific knowledge may be an important key to eradicating COVID-19, and that methods of scientific crowdsourcing used during this pandemic may help shorten or prevent future outbreaks.
For Kaggle competitors working with the CORD-19 database, the most troubling issue is that much of the research has not gone through the normal academic peer-review process, so the credibility of individual articles can’t be confirmed. Indeed, the urgency of the pandemic has prompted many scientists to publish their results immediately as online “pre-prints” rather than submit their work for peer review, which can delay publication by six months to a year or more.
“With these real-world challenges, we hope to help biomedical researchers explore and mine information from articles more efficiently. Hopefully, what we have built will help the medical community find answers to their high-priority questions.”
One major goal for a COVID-19 text-mining tool, then, is to somehow separate credible articles with reliable information from questionable, misleading, or incomplete articles that have not been vetted. In order for an AI algorithm to accomplish this task, however, it must be “taught” or “trained” to tell the difference.
Combining human expertise and AI
To address this part of the challenge, Thomson Reuters researchers devised an approach that combines the expert judgment of human subject-matter experts (SMEs) with a form of machine learning called active learning.
First, the team ran a list of common COVID-19 research questions through the CORD-19 database using an AI-enabled search algorithm. A group of qualified SMEs then evaluated the accuracy of the algorithm’s answers on a simple four-point scale, along with an assessment of the source’s reliability. Each answered question constituted a question/answer (QA) “pair,” and those pairs were used to “teach” the algorithm what the best answer to each question was — information the algorithm then can use to find answers to future questions.
Though this approach yielded promising results on a small scale, it involves significant logistical challenges on a larger scale. Finding enough qualified SMEs to do the QA assessments could be difficult, for example, and the work itself is both time-consuming and expensive. To make better use of each SME’s time, the Thomson Reuters team developed a way to select those QA pairs that once judged by an SME would most improve the performance of the overall system. This active learning approach maximizes the usefulness of the work performed by each SME.
“If we choose samples to be annotated in an intelligent way, we can significantly reduce the amount of required labeled data points to design an accurate classifier,” explains researcher Dawn Sepehr. Fewer labeled data points means less need for SMEs, which saves both time and money.
Separating the best from the rest
Even with the use of an active learning strategy that chooses the best QA pairs for annotation, it would take hundreds of SMEs thousands of hours to feed enough data into an algorithm to boost its accuracy sufficiently. So, to minimize the need for human SMEs, the Thomson Reuters team decided to teach the algorithm to recognize the features of a credible medical article by leveraging several well-established, publicly available medical databases, such as PubMedQA and BioASQ.
The idea here is for the algorithm to “learn” the characteristics of a high-quality medical article from these other databases, then use that knowledge — further targeted with some basic statistical modeling strategies — to provide more accurate search results on the government’s CORD-19 data.
In search of answers
Work on the Thomson Reuters CORD-19 text-mining tool is ongoing; more SMEs are being recruited, for instance, and the application’s public interface is still in development. Ultimately, the goal of the C3 team is to provide COVID-19 researchers with a tool that can help them cut through the publication clutter and find the answers they’re looking for faster and more accurately than is currently possible. And that doesn’t just mean pointing scientists to articles that include the answer — it means identifying the exact sentence or paragraph that contains the answer, so that researchers don’t have to waste time wading through information they don’t need.
“Successfully managing COVID-19 requires making sense of the research,” says researcher Conner Cowling, another member of the C3 team. “The ability to transform questions into promising leads — and to make connections that might otherwise remain undiscovered — goes a long way in supporting that goal.” That capability may be out of reach at the moment, but as their COVID-19 search algorithm matures, the plan is to introduce SME scoring of text snippets to further increase the accuracy of the application.
In the end, AI may not cure COVID-19 — but with any luck, it can help speed the development of vaccines and therapeutics the world needs to resume life as we used to know it.