The State of Legal Industry AI Benchmarks in 2025: What Lawyers Should Know Before Choosing AI Tools, Part One

Issue 45

One year ago, most lawyers had never heard of AI benchmarks. Today, AI benchmarks have emerged as a must-know topic for lawyers, because they are shaping how legal industry AI tools are evaluated, marketed, and trusted. This issue is the first of a two-part series on benchmarks. Part One explains what lawyers need to know about AI benchmarks, and begins breaking down what benchmarks and evaluations have revealed about AI tools for lawyers. Come back for the next issue of the newsletter, which will complete the picture of what legal industry AI benchmarks and evaluations have revealed about AI tools for lawyers, and will also cover what lawyers should keep in mind when interpreting the results.

What are benchmarks?

Benchmarks are datasets and tasks that have been standardized to measure the capabilities of an AI model across an industry.[i] For example, in 2023, researchers created a benchmark called LegalBench, which included 162 legal reasoning tasks evaluated across 20 large language models (“LLMs”).[ii] Benchmarks can be distinguished in the context of AI model assessments from “evals” (short for evaluations), which are intended to measure the real-world performance of an AI tool at a deeper level, as well as from tests, which are intended to validate whether a specific tool performs as anticipated.[iii] It is important to recognize that there may not be perfect consistency across the legal industry in the ways that the terms benchmark, evals, and tests are utilized in relation to AI tools for lawyers.

What are some of the challenges with benchmarking and evaluating legal industry AI tools?

Researchers have noted that it is challenging to evaluate applications built using LLMs due to the open-ended response capabilities and unlimited output space of LLMs.[iv] Additionally, experts have noted the challenge of assessing AI tools in terms of correct answers when the nature of law is to argue about what answer is correct.[v] The speed at which LLMs are evolving also adds challenge, as benchmarking information can be quickly rendered obsolete.[vi] Additionally, the lack of transparency in the operation of LLMs makes benchmarking more difficult.[vii] Moreover, the sheer numbers of different use cases and AI tools that have been built for lawyers in a short period of time add significant challenge to the difficulty of creating comprehensive benchmarks and evaluations. Also, the risk of test leakage, which can occur when an LLM trains itself on data used for testing, can also add significant difficulty.[viii]

What AI benchmarks and evals exist for the legal industry?

Below are summaries of three independent benchmarks and evals that have been released for the legal industry:

Vals Legal AI Report (“VLAIR”)

The VLAIR, released in February 2025, is the first of its kind evaluation of four legal industry AI tools (CoCounsel, Vincent AI, Harvey Assistant, and Oliver), across up to seven legal tasks commonly performed by lawyers, and benchmarking their results against the results of a lawyer control group.[ix] Out of the seven legal tasks evaluated, one or more AI tools beat the lawyer control group on four tasks (document extraction, document question-answering, document summarization, and transcript analysis), while the lawyer control group surpassed the AI tools on two tasks (redlining and EDGAR research), and matched the highest performing tool on one task (chronology generation).[x] Harvey Assistant, which participated in six of the seven tasks, had the strongest performance, receiving the top score on five tasks and the second place score on one task, and beating or matching the lawyer control group in five tasks.[xi] CoCounsel also received one top score and ranked among the best performing tools on four of the tasks.[xii] For a more detailed recap of this study, see this article.

Contract Drafting Study

This September 2025 study from Legalbenchmarks.ai evaluated 13 AI tools (7 legal industry tools and 6 general-purpose tools) against a human baseline that consisted of in-house commercial lawyers with an average of 10 years of working experience.[xiii] The legal industry tools included were August, Brackets, GC AI, InstaSpace, SimpleDocs, Wordsmith, and an anonymous tool, while the general-purpose tools were ChatGPT, Claude, CoPilot, Gemini, Le Chat, and Qwen.[xiv] The study found that some AI tools outperformed the human baseline in producing reliable first drafts of contracts, and the legal industry AI tools raised risk warnings more often than general-purpose AI tools, while humans missed the same risks entirely.[xv] The study did not find a meaningful difference in the output reliability or output usefulness between the general-purpose and legal industry AI tools.[xvi] The top performing tools for output were Gemini, ChatGPT, GC AI, Brackets, August, and SimpleDocs.[xvii] The study concluded that while the legal industry AI tools weren’t outperforming general-purpose tools on output, they were beginning to differentiate themselves with the functionalities they’ve built for lawyers, like integrating with Microsoft Word, and offering clause libraries and templates.[xviii] The most meaningful differentiator the study found between the legal industry AI tools was whether the tool integrated with existing workflow and technology.[xix] For workflow integration or support, the top performers were Brackets, GC AI, and SimpleDocs.[xx]

Information Extraction Study

This is a second study from Legalbenchmarks.ai that focused on information extraction tasks for in-house lawyers.[xxi] This study was released in April 2025. The study evaluated six AI tools, including two legal industry AI tools: GC AI and Vecflow’s Oliver, as well as two general purpose AI assistant tools: Google’s Notebook LM and Microsoft CoPilot, and two general-purpose LLM chatbots: DeepSeek and ChatGPT.[xxii] The AI tools were scored on both accuracy and usefulness.[xxiii] The study found that the two legal-industry tools, GC AI and Oliver, received the highest combined scores, concluding that while general-purpose tools could match legal industry tools in accuracy, the legal industry tools delivered more value in usability and workflow integration.[xxiv]

Sign up by October 23, 2025 to receive Part Two delivered to your inbox. Part Two will complete the picture of what legal industry AI benchmarks and evaluations have revealed about AI tools for lawyers, and will also cover what lawyers should keep in mind when interpreting the results.

A quick reminder: I’m currently preparing to record a CLE called “How to Pick the Best AI Tool for Your Law Practice”. Once I release the CLE, I’ll provide my newsletter subscribers with an exclusive discount code. If you already subscribe to my newsletter, thank you! If you know someone who might like access to this discount code for my newsletter subscribers, please share this issue of the newsletter with them, and encourage them to sign up for my newsletter here before the CLE is released. Additionally, if you would like me to prioritize applying for CLE accreditation in your state, please send me an email at [email protected].

Thanks for being here.

Jennifer Ballard
Good Journey Consulting

[i] Shayan Mohanty, John Singleton, and Parag Mahajani, LLM benchmarks, evals and tests, Thoughtworks, (Oct. 31, 2024), https://www.thoughtworks.com/en-us/insights/blog/generative-ai/LLM-benchmarks,-evals,-and-tests#:~:text=While%20benchmarks%20offer%20a%20general,emotions%2C%20or%20handles%20ambiguous%20queries.

[ii] Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer, Zehua Li, LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models, arXiv:2308.11462v1, 1 (2023), https://arxiv.org/pdf/2308.11462.

[iii] Mohanty et al., supra note i.

[iv] Id.

[v] Bob Ambrogi, At Law Librarians’ Annual Meeting, Panel Tackles the Challenge of Benchmarking AI Research Tools, LawSites (Jul. 22, 2025), https://www.lawnext.com/2025/07/at-law-librarians-annual-meeting-panel-tackles-the-challenge-of-benchmarking-ai-research-tools.html?utm_source=chatgpt.com.

[vi] Magesh, Varun; Surani, Faiz; Dahl, Matthew; Suzgun, Mirac; Manning, Christopher D; Ho, Daniel E (2024): Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, at 21, Stanford RegLab. Preprint. https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf.

[vii] Sean Harrington, Evaluating Generative AI for Legal Research: A Benchmarking Project, AI Law Librarians, (May 24, 2024), https://www.ailawlibrarians.com/2024/05/24/new-project-evaluating-genai/.

[viii] Magesh et al., supra note vi at 21.

[ix] Executive summary, Vals Legal AI Report, https://www.vals.ai/vlair (last visited Sept. 18, 2025).

[x] Id.

[xi] Id.

[xii] Id.

[xiii] Guo, Anna, and Souza Rodrigues, Arthur, and Al Mamari, Mohammed, and Udeshi, Sakshi, and Astury, Marc, Benchmarking Humans & AI in Contract Drafting, Preliminary Findings (Sept. 2025), https://www.legalbenchmarks.ai/research/phase-2-research.

[xiv] Id.

[xv] Id.

[xvi] Id.

[xvii] Id.

[xviii] Id.

[xix] Id.

[xx] Id.

[xxi] Guo, Anna, and Souza Rodrigues, Arthur, Putting AI to the Test in Real-World Legal Work (Apr. 2025), https://www.legalbenchmarks.ai/research/phase-1-research.

[xxii] Id.

[xxiii] Id.

[xxiv] Id.

Stay connected with news and updates!

Join our mailing list to receive the latest legal industry AI news and updates.
Don't worry, your information will not be shared.

We will not sell your information.