May 04, 2026 |

Why Legal AI Needs a New Standard: Inside Thomson Reuters CoCoBench

By Tyler Alexander, Director of CoCounsel AI Reliability, Thomson Reuters

A lawyer submits a filing supported by a citation that doesn’t exist. The system produced a polished answer. It just wasn’t grounded in reality.

This is the gap facing legal AI today. Not whether systems can generate  sophisticated answers, but whether those answers are actually good enough for real legal work.

In practice, there is a consistent and measurable gap between how systems perform on traditional benchmarks and how they perform on real legal work.

Most evaluations still rely on benchmarks that were never designed for how legal work actually happens. Bar exam questions, clause extraction, single turn prompts. These tests evaluate discrete components of the work. But they fail to capture how a system performs across the iterative spectrum of tasks that make up real legal work.

As a result, systems are often optimized to perform well on benchmarks that do not reflect how legal work is actually done.

And critically, they fail in ways those benchmarks are not designed to catch, and as agentic systems proliferate, those small errors cascade into more frequent and even harder to identify failures.

Starting with the work

When we set out to build the next generation of CoCounsel Legal, we didn’t start with models or features. We started with the work itself: what does legal work actually look like in practice?

“This isn’t build first, ask later. It’s ask first, build second,” our teams often reiterate.

CoCounsel Legal has been in the market since August, already supporting legal professionals in research, drafting, and review. But as we looked ahead to the next generation, now in beta, a clear shift emerged. The focus is moving from point-in-time assistance to systems capable of handling longer unaided task horizons and more end-to-end workflows. That shift required us to rethink not only how we build CoCounsel, but how we evaluate it.

From Single tasks to Work, Completed

Through research with hundreds of legal professionals and over 100 Practical Law attorney editors, a consistent pattern emerged. The challenge was not any single task, being too difficult. It was the number of steps required and the effort of keeping them coherent.

Legal work doesn’t happen in isolated prompts. It moves across research, drafting, review, and revision. Context builds, decisions compound, and small errors early can affect everything that follows. That is not what traditional benchmarks are designed to measure.

A different kind of system

The next generation of CoCounsel Legal reflects that shift. A single instruction can now trigger a complete workflow.

Ask it to draft a motion to dismiss. It plans the work, reviews the relevant documents, conducts legal research, pulls secondary sources, and produces a draft grounded in authority, validating citations for its conclusions throughout the work and returning a final output grounded in those facts.

That’s not a task. It’s a complete workflow. And it’s exactly where traditional benchmarks break down.

And it raises a different question. How do you comprehensively evaluate something like that?

Building CoCoBench

We needed a way to measure performance at the level of real legal work. That’s why we built CoCoBench, a framework designed to evaluate AI systems at the level of real legal work, and one we are now making more visible externally.

CoCoBench measures whether an AI system can complete real legal tasks to a fiduciary-grade standard. It is built around hundreds of attorney-authored benchmark tasks, with a fixed core dataset used to track performance over time. More than 100 legal subject matter experts have contributed to the legal dataset, alongside research and engineering teams at Thomson Reuters Labs who developed the evaluation infrastructure, representing over 15,000 hours of practitioner and engineering work.

Each test reflects real practice: a query written the way a practitioner would ask it, supporting materials drawn from representative contracts, pleadings, or correspondence, and a gold-standard response drafted and reviewed by attorneys. This approach is grounded in what we internally refer to as ideal-response evaluation, defining what correct, complete legal work actually looks like and measuring system output against that standard.

The goal is not to measure whether a system can produce a response. It is to measure whether that response (and it’s sequence of work to reach that response) constitutes complete, accurate legal work.

Evaluating how the work gets done

Legal workflows are multi-step, which means evaluation cannot stop at the final output. A system can produce a coherent answer even while relying on flawed reasoning -traditional benchmarks often fail to detect this as a failure mode.

In agentic systems, an error in one step carries forward. A result may appear coherent while being built on an error upstream. CoCoBench addresses this by evaluating the final deliverable alongside the citation record the system produced along the way. Specifically, what it cited, where it sourced it, and whether the source actually supports the claim. 

These evaluations span core categories of legal work, including research, drafting, review, and multi-step reasoning across workflows.

A higher standard

Every output is evaluated against what a practicing attorney would consider acceptable. That includes correct application of the law, completeness of analysis, accurate use of sources, and work product that meets fiduciary-grade standards and is usable in practice.

No capability is considered ready until it demonstrates improvement against that standard. Progress is measured through real-world performance, evaluated by the attorneys best positioned to judge it.

What we’re seeing so far

In practice, we are seeing a consistent gap between how systems perform on traditional benchmarks and how they perform on real legal tasks. Systems optimized for general-purpose benchmarks often struggle when evaluated against real workflows, revealing gaps in completeness, source fidelity, and multi-step reasoning that are not visible in standard benchmark results.

When evaluation shifts from task-level performance to the workflow level, the bar changes. What counts as good changes and which systems actually meet that bar changes as well.

More detailed findings will be shared as CoCoBench continues to evolve. The direction is clear. Evaluating AI at the task level changes not only how performance is measured, but what needs to be built.

In the next post in this series, we’ll share what happens when you apply this standard in practice, and how different approaches to legal AI perform when evaluated against real legal work. 

Building what comes next

The next generation of CoCounsel Legal, currently in beta, is being built on this foundation. The focus is not on isolated capabilities. It is helping attorneys complete their work reliably, efficiently, and to a fiduciary-grade standard.

As AI systems take on more of that work, how they are evaluated becomes as important as what they can do, because without the right standard, progress can be overstated.

Because in legal work, almost right is not good enough. 

Share