Newsletter

The State of Legal Industry AI Benchmarks in 2025: What Lawyers Should Know Before Choosing AI Tools, Part Two

Image of desk and keyboard, text reads: The State of Legal Industry AI Benchmarks in 2025: What Lawyers Should Know Before Choosing AI Tools, Part Two Good Journey Consulting Newsletter Issue 46

Issue 46 

This is the conclusion of a two-part series on AI benchmarks for the legal industry. Part One explained what lawyers need to know about AI benchmarks, and began breaking down what benchmarks and evaluations have revealed about AI tools for lawyers. Part Two, below, completes the picture of what legal industry AI benchmarks and evaluations have revealed about AI tools for lawyers, and explains what lawyers should keep in mind when interpreting the results. 

What AI benchmarks and evals exist for the legal industry?  

Part One summarized three independent benchmarks and evals that have been released for the legal industry. Below are summaries of three additional independent studies: 

Law Student Study 

The University of Minnesota published a study in March 2025, called AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice.[i] In this study, law students tested a retrieval augmented generation (“RAG”)-tuned AI tool, Vincent AI, and an AI reasoning model, OpenAI’s o1-preview, on six legal tasks, finding that both AI tools significantly enhanced the quality of the legal work compared to the legal work performed without AI in four out of six tasks.[ii] Additionally, the study found that both AI tools significantly boosted productivity in five out of six legal tasks, with particular power in tasks like analyzing complaints and drafting persuasive letters.[iii] The study found that AI may improve the speed at which lawyers can complete some tasks, and also raises the possibility that integrating RAG with a reasoning model could yield additive or possibly even multiplicative benefits for users.[iv] For a more detailed recap of this study, see this article.  

Vals LegalBench Contributions 

As noted in Part One, in 2023, researchers created a benchmark called LegalBench, which included 162 legal reasoning tasks evaluated across 20 large language models (“LLMs”).[xxix] Vals, which released the VLAIR summarized in Part One, has contributed to the LegalBench benchmark, with a September 2025 update finding that GPT 5 is the highest-performing AI model out of 87 models evaluated, with an accuracy rate of 84.6%.[xxx]  

Hallucination Study 

Stanford RegLab published a preprint study in May 2024 called, Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools.[xxxi] This study tested three legal industry AI tools tuned with RAG: Westlaw’s AI-Assisted Research, Ask Practical Law AI (both Thomson Reuters products), and Lexis+ AI, and concluded that all three tools hallucinate.[xxxii] The study found that Westlaw’s AI-Assisted Research hallucinated one third of the time, while Ask Practical Law AI and Lexis+ AI produced hallucinations in more than one of every six responses.[xxxiii] The study concluded that the hallucination rates of the RAG-tuned AI tools tested in the study were reduced compared to GPT-4 (which it found hallucinated 43 percent of the time), yet remained substantial.[xxxiv] LexisNexis and Thomson Reuters both responded that their internal testing and customer feedback demonstrate higher rates of accuracy than the study results, with Thomson Reuters asserting an accuracy rate of approximately 90% for their AI-Assisted Research tool.[xxxv] The Stanford study identified the most important takeaway of its results as the need for thorough and transparent benchmarks and evaluations of AI tools for the legal industry.[xxxvi]  

What can we learn from the legal industry benchmarks and evals that have been released?  

The benchmarks and evals published to date for the legal industry provide valuable information for lawyers who are considering how to incorporate AI into their practices. However, these benchmarks and evals only provide information about the tip of the iceberg. To date, these studies have assessed only a handful of the hundreds of legal industry AI tools on the market.  

Nonetheless, from the information available, we can draw some conclusions. For example, AI tools should not be summarily dismissed as hype by lawyers, because using an AI tool on certain tasks may elevate a lawyer’s work. Further, it appears that the accuracy of OpenAI’s GPT AI models has improved significantly over the past year and a half. Some of the studies concluded that it’s a toss-up whether you can presently get better output from a general-purpose AI tool or a legal industry AI tool. This may indicate improvement in the accuracy of some legal industry AI tools as well. Some legal industry AI tools are distinguishing themselves from general-purpose competitors by offering better workflow integration and support. (Additionally, some legal industry AI tools may offer data privacy and security advantages.)   

What should lawyers keep in mind as they consider benchmarking and evals results?  

Benchmarking and evals are important because they provide lawyers with some data about AI tools at points in time. Over time, the benchmarks can be used to measure how AI tools are evolving. Some benchmarking/evals data, even if imperfect and incomplete, will be more useful to lawyers than no data. Lawyers who are searching for AI tool solutions beyond the tools that have been recently evaluated should be prepared to do their own testing to determine if an AI tool is a good match for their organization.  

Additionally, lawyers need to distinguish between independent benchmarking and evaluation efforts such as the ones discussed above, and internal benchmarking and evaluation efforts by the companies making AI tools for lawyers. Some benchmarks are conducted by AI companies themselves and publicized for marketing purposes. While an AI tool company’s own benchmarks and evaluations may provide useful data, it’s important to be mindful of the source of any data you utilize for decision-making purposes.  

A quick reminder: I’m currently preparing to record a CLE called “How to Pick the Best AI Tool for Your Law Practice”. Once I release the CLE, I’ll provide my newsletter subscribers with an exclusive discount code. If you already subscribe to my newsletter, thank you! If you know someone who might like access to this discount code for my newsletter subscribers, please share this issue of the newsletter with them, and encourage them to sign up for my newsletter here before the CLE is released. Additionally, if you would like me to prioritize applying for CLE accreditation in your state, please send me an email at [email protected]

Thanks for being here.  

Jennifer Ballard
Good Journey Consulting 

 

[i] Schwarcz, Daniel and Manning, Sam and Barry, Patrick James and Cleveland, David R. and Prescott, J.J. and Rich, Beverly, AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice (March 02, 2025). Minnesota Legal Studies Research Paper No. 25-16, Available at SSRN: https://ssrn.com/abstract=5162111 or http://dx.doi.org/10.2139/ssrn.5162111. 

[ii] Id. at 1-2, 7. 

[iii] Id. at 2. 

[iv] Id. at 36, 45. 

[v] Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer, Zehua Li, LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models, arXiv:2308.11462v1, 1 (2023), https://arxiv.org/pdf/2308.11462

[vi] LegalBench Benchmark, Vals.ai, Sept. 29, 2025, https://www.vals.ai/benchmarks/legal_bench-09-29-2025.  

[vii] Magesh, Varun; Surani, Faiz; Dahl, Matthew; Suzgun, Mirac; Manning, Christopher D; Ho, Daniel E (2024): Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, at 21, Stanford RegLab. Preprint. https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf.

[viii] Id. at 22. 

[ix] Id.  

[x] Id. at 4. 

[xi] Jeremy Kahn, What a study of AI copilots for lawyers says about the future of AI for everyone, Fortune, (Jun. 4, 2024, 13:31), https://fortune.com/2024/06/04/stanford-hai-legal-ai-copilot-study-rag-llms-future-of-ai/. 

[xii] Magesh et al., supra note vii at 24. 

Stay connected with news and updates!

Join our mailing list to receive the latest legal industry AI news and updates.
Don't worry, your information will not be shared.

We will not sell your information.