Skip to content
Best Practices in Courts & Administration

Generative AI in legal: A risk-based framework for courts

Natalie Runyon  Director / Sustainability content / Thomson Reuters Institute

· 6 minute read

Natalie Runyon  Director / Sustainability content / Thomson Reuters Institute

· 6 minute read

Courts and legal practitioners should adopt a risk‑based, principle‑driven framework for using generative AI (GenAI) that balances innovation with accountability and requires meaningful human oversight

Key highlights:

      • Risk varies by workflow and context — Practitioners should apply risk ratings based on workflow and context, such as low for productivity, moderate for research, moderate to high for drafting and public‑facing tools, and high for decision-support.

      • Courts need their own developed benchmarks — Courts should develop and regularly review their own independent benchmarks and evaluation datasets instead of relying solely on vendor claims, because vendors may optimize systems for known tests.

      • Need for benchmarking to detect drift, degradation, and bias — Continuous, rigorous benchmarking of AI models is essential for courts and legal professionals to maintain confidence in these systems, since both the law and AI models change over time.


AI is not monolithic technology, and a risk-based assessment process is needed when using it. Indeed, courts and legal professionals must scale their scrutiny to match risk levels.

This approach — which balances innovation with accountability, along with other essential best practices — is detailed in a recent publication, Key Considerations for the Use of Generative AI Tools in Legal Practice and Courts, created as part of the National Center of State Courts and Thomson Reuters Institute AI Policy Consortium.

In a recent webinar, Dean Megan Carpenter, one of the co-authors of the document, explained the purpose of the document: “The central aim of what we were thinking about in these best practices is to give courts and legal professionals a principle-based architecture when you’re thinking about the adoption of GenAI tools.”

Risk and human judgement serve as central elements

What is unique about this framework is that it categorizes risk based on key workflow actions of lawyering, for example:

      • Productivity tools carry minimal to moderate risk
      • Research tools are assigned moderate risk
      • Drafting tools range from moderate to high risk
      • Public-facing tools carry moderate to high risk
      • Decision-support tools pose high risk

The framework holds that risk is dynamic rather than static, and there can be shifts in risk levels based on use cases. For example, a scheduling tool typically poses minimal risk; however, the same tool becomes high risk when used for urgent national security cases. And translation tools can shift from lower risk research support to high-risk decision-support depending on their use.

Similarly, when tools range from moderate risk to high risk, users need to be especially discerning in order to understand the underlying risks — and if the task should be delegated to AI at all.

“You can’t just rely on categories,” explains Judge Bowon Kwon from the IP High Court of Korea. “You need to understand the underlying risks and ask yourself: Would I delegate this task to another person? Am I comfortable delegating it publicly? If the answer is no, then you probably shouldn’t be delegating it to an AI either.”

In addition, clear red lines around when AI should never be used and classified as unacceptable risk exist for judicial use. “I believe the clear red line is automated final decisions or AI systems that assess a person’s credibility or determine fundamental rights involving incarceration, housing, family,” says Judge Kwon, adding that fundamental rights require human judgment.


“You can’t just rely on categories. You need to understand the underlying risks and ask yourself: Would I delegate this task to another person? Am I comfortable delegating it publicly? If the answer is no, then you probably shouldn’t be delegating it to an AI either.”


The extent of human judgment also has layers. Hank Greenberg, Shareholder at Greenberg Traurig, says he believes that AI for any legal use currently requires human oversight. “The human supervision piece… is utterly critical in the real world of practicing lawyers and law firms,” Greenberg says. “You have to supervise the lawyers in the firm that are using the technology, including young lawyers.”

To help distinguish which type of human oversight is appropriate, the framework in the Key Considerations document defines two forms of such oversight: i) human in the loop, which means active human involvement in decisions; and ii) human on the loop, which means monitoring automated processes and intervening when needed.

What the difference between what each concept could look like in a court setting shows that a human in the loop is, for example, a law clerk using AI to do research on relevant case law and checking to make sure that the references are legally sound; and a human on the loop is a clerk monitoring an established robotic process to extract data for the case management system and spot-checking for accuracy.

Practical guidance for courts

In addition to judges considering the risk level of AI tools, Judge Kwon, Greenberg, and Carpenter, noted the importance of technical AI competence as part of lawyers’ and judges’ ethical duty, especially around verification, transparency, and independent benchmarks as part of accountability, as well as the need for understandable documentation to maintain public trust. To reinforce the latter point, Grace Cheng, Director in Government Practice for Thomson Reuters Practical Law states: “It’s very vital, especially as we usher in the age of AI, that the public be informed as much as they can be about how that decision-making process is taking place.”

In addition, Judge Kwon, Greenberg, and Carpenter highlighted additional guidance on the criticality of benchmarking, including:

      • Court-developed benchmarks prevent overreliance on vendor data — Courts should develop their own benchmarks and independent evaluation datasets rather than relying entirely on vendor claims and review evaluation scenarios regularly. Vendors may optimize their systems for known tests, which leads to overfitting, in which a model learns patterns specific to its training data so well that it performs poorly on new, unseen data. This gives a misleading impression of reliability.
      • Ongoing rigorous benchmarking to detect model drift & degradation — To build confidence in AI models, courts and legal professionals must approach AI model evaluation with rigor and ongoing vigilance. Continuous benchmarking is essential, and it cannot be a one-time process because the law evolves constantly and precedents shift. In addition, AI models themselves update regularly, and courts need to monitor performance over time to detect AI degradation or bias drift.

Adopting a thoughtful, risk-informed approach to GenAI in legal practice and courts will help realize its benefits for efficiency and access to justice while protecting ethical obligations, due process, and public trust in the legal system.


You can find out more about how AI and other advanced technologies are impacting best practices in courts and administration here

More insights