With each new model release, we hear the same bold claim: “This AI can reason.” But what does that actually mean—and why does it matter? At Thomson Reuters, we’ve spent the past year rigorously testing and evaluating the next generation of AI systems—not just for what they can generate, but for how they reach conclusions. For professionals working in legal, tax, and regulatory environments, traceable reasoning isn’t a luxury—it’s a requirement.
Not All AI Thinking Is Equal
Traditional Large Language Models (LLMs) excel at generating fluent, well-structured responses providing a direct answer to a specific question (e.g., what is the capital of France?). But when a task demands multi-step logic, interpretation of legal nuance, or structured argumentation, those same models can often fall short because they cannot simply produce the memorized response. That’s where Large Reasoning Models (LRMs) come in. These systems are trained to work through problems step-by-step, show their logic, and produce outputs that are transparent, reviewable, and aligned with how professionals make decisions. It’s an exciting shift, but it also demands a different level of scrutiny.
What We’ve Learned So Far
At Thomson Reuters Labs, we’ve been testing reasoning-capable AI across a variety of high-stakes domains. Our work includes both proprietary evaluation frameworks and live deployments that put models to the test under real-world legal complexity.
We’ve found that:
- Models may return the right answer, but they may have used incorrect reasoning and vice versa.
- Multi-step reasoning increases the risk of hard-to-detect hallucinations, in particular when the reasoning part is not exposed to the user.
- As questions get more complex, models may fail at one point to produce the correct answer—or give up entirely.
That’s why we’ve built a robust testing and benchmarking process, including human-in-the-loop validation and domain-specific scoring. You can read more about that process here.
Putting New Models to the Test
Most recently, we tested OpenAI’s new Deep Research model—evaluating its performance on legal queries that demand not just accuracy, but verifiability. As J.P. Mohler, Senior Machine Learning and Applied Research Scientist at Thomson Reuters, put it: “OpenAI’s deep research model helps us synthesize legal briefs, case records, and case law into analyses for appellate judges. Its ability to autonomously gather, assess, and clearly cite information from a broad range of public and private sources—paired with its depth of analysis—fills a critical need for reliable, verifiable research. The model empowers us to scale advanced research capabilities and support complex, data-driven knowledge work.” This type of evaluation gives us insight into how models reason in the wild—and how they perform under the pressures of real legal analysis.
Why Model Strategy Matters
No single model excels at everything. That’s why we take a multi-model approach at Thomson Reuters—working with partners while continually refining our own proprietary models. We select the right model for the right task, based on accuracy, explainability, and trustworthiness. This orchestration-first approach ensures we deliver results professionals can actually use—not just impressive demos.
Want the Deeper Dive?
If you’re curious about how reasoning models are built, how they differ from traditional LLMs, and where they succeed (and struggle), I’ve written a more technical breakdown: Are reasoning models introducing the age of reason for AI? It explores why reasoning remains one of the most challenging frontiers in AI—and why it’s essential to get it right.
About the author:
This post was authored by Frank Schilder is a Senior Director, Research at Thomson Reuters Labs, where he focuses on knowledge representation and reasoning, explainability, and applied AI research in legal and regulatory domains.