{"id":109448,"date":"2025-02-28T07:02:21","date_gmt":"2025-02-28T15:02:21","guid":{"rendered":"https:\/\/xira.com\/p\/2025\/02\/28\/legal-ai-tools-show-promise-in-first-of-its-kind-benchmark-study-with-harvey-and-cocounsel-leading-the-pack\/"},"modified":"2025-02-28T07:02:21","modified_gmt":"2025-02-28T15:02:21","slug":"legal-ai-tools-show-promise-in-first-of-its-kind-benchmark-study-with-harvey-and-cocounsel-leading-the-pack","status":"publish","type":"post","link":"https:\/\/xira.com\/p\/2025\/02\/28\/legal-ai-tools-show-promise-in-first-of-its-kind-benchmark-study-with-harvey-and-cocounsel-leading-the-pack\/","title":{"rendered":"Legal AI Tools Show Promise in First-of-its-Kind Benchmark Study, with Harvey and CoCounsel Leading the Pack"},"content":{"rendered":"<p>Are you still on the fence about whether generative artificial intelligence can do the work of human lawyers? If so, I urge you to read this new study. Published yesterday, this first-of-its-kind study evaluated the performance of four legal AI tools across seven core legal tasks. In many cases, it found, AI tools can perform [\u2026]<\/p>\n<p>Are you still on the fence about whether generative artificial intelligence can do the work of human lawyers? If so, I urge you to read this new study.<\/p>\n<p>Published yesterday, this first-of-its-kind study evaluated the performance of four legal AI tools across seven core legal tasks. In many cases, it found, AI tools can perform at or above the level of human lawyers, while offering significantly faster response times.<\/p>\n<p>The <a href=\"https:\/\/www.vals.ai\/vlair\" rel=\"nofollow noopener\" target=\"_blank\">Vals Legal AI Report<\/a> (VLAIR) represents the first systematic attempt to independently benchmark legal AI tools against a lawyer control group, using real-world tasks derived from Am Law 100 firms.<\/p>\n<p>It evaluated AI tools from four vendors \u2014 Harvey, Thomson Reuters (CoCounsel), vLex (Vincent AI), and Vecflow (Oliver) \u2014 on tasks including document extraction, document Q&amp;A, summarization, redlining, transcript analysis, chronology generation, and EDGAR research.<\/p>\n<p>LexisNexis originally participated in the benchmarking but, after the report was written, it chose to withdraw from all the tasks in which it participated except legal research. The results of the legal research benchmarking will be published in a separate report.<\/p>\n<h3>Key Findings<\/h3>\n<p>Harvey Assistant emerged as the standout performer, achieving the highest scores in five of the six tasks it participated in, including an impressive 94.8% accuracy rate for document Q&amp;A. Harvey exceeded lawyer performance in four tasks and matched the baseline in chronology generation.<\/p>\n<p>(Each vendor could choose which of the evaluated skills they wished to opt into.)<\/p>\n<p>\u201cHarvey\u2019s platform leverages models to provide high-quality, reliable assistance for legal professionals,\u201d the report said. \u201cHarvey draws upon multiple LLMs and other models, including custom fine-tuned models trained on legal processes and data in partnership with OpenAI, with each query of the system involving between 30 and 1,500 model calls.\u201d<\/p>\n<p>CoCounsel from Thomson Reuters was the only other vendor whose AI tool received a top score \u2014 77.2% for document summarization \u2014 and consistently ranked among top-performing tools across all four tasks it participated in, with scores ranging from 73.2% to 89.6%.<\/p>\n<p>The Lawyer Baseline (the results produced by a lawyer control group) outperformed the AI tools on two tasks \u2014 EDGAR research (70.1%) and redlining (79.7%), suggesting these areas may remain, for now at least, better suited to be done by humans. AI tools collectively surpassed the Lawyer Baseline on document analysis, information retrieval and data extraction tasks.<\/p>\n<p>Perhaps not surprisingly, the study found a dramatic difference in response times between AI and humans. The report found that AI tools were \u201csix times faster than the lawyers at the lowest end, and 80 times faster at the highest end,\u201d making a strong case for AI tools as efficiency drivers in legal workflows.<\/p>\n<p>\u201cThe generative AI-based systems provide answers so quickly that they can be useful starting points for lawyers to begin their work more efficiently,\u201d the report concluded.<\/p>\n<div id=\"attachment_49075\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/i0.wp.com\/www.lawnext.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-28-085636.png?ssl=1\" rel=\"nofollow noopener\" target=\"_blank\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-49075\" class=\"size-full wp-image-49075\" src=\"https:\/\/i0.wp.com\/www.lawnext.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-28-085636.png?resize=657%2C412&#038;ssl=1\" alt=\"\" width=\"657\" height=\"412\" title=\"\"><\/a><\/p>\n<p id=\"caption-attachment-49075\" class=\"wp-caption-text\"><strong>Document Q&amp;A produced the highest scores out of any task in the study, leading the report to conclude that it is a task for which lawyers should find value in using generative AI.<\/strong><\/p>\n<\/div>\n<p>The report found that Harvey Assistant was consistently the fastest, with CoCounsel also being \u201cextraordinarily quick,\u201d both providing responses in less than a minute.<\/p>\n<p>But it also said that Vincent AI \u201cgave responses exceptionally quickly as generally one of the fastest products we evaluated.\u201d<\/p>\n<p>Oliver was found to be the slowest, often taking five minutes or more per query. The report said this is likely due to Oliver\u2019s agentic workflow, which breaks tasks into multiple steps.<\/p>\n<h3>Vendor-Specific Performance<\/h3>\n<p>Harvey, the fastest-growing legal technology startup in the space (having raised over $200 million and achieved unicorn status since its founding in 2022), opted into more tasks than any other vendor and received the highest scores in document Q&amp;A, document extraction, redlining, transcript analysis, and chronology generation.<\/p>\n<p>\u201cHarvey Assistant either matched or outperformed the Lawyer Baseline in five tasks and it outperformed the other AI tools in four tasks evaluated,\u201d the report said. \u201cHarvey Assistant also received two of the three highest scores across all tasks evaluated in the study, for Document Q&amp;A (94.8%) and Chronology Generation (80.2% \u2014 matching the Lawyer Baseline).\u201d<\/p>\n<p>CoCounsel 2.0 from Thomson Reuters was submitted for four of the tasks and consistently performed well, the study found, achieving an average score of 79.5% across its four evaluated tasks \u2014 the highest average score in the study. It particularly excelled at document Q&amp;A (89.6%) and document summarization (77.2%).<\/p>\n<p>\u201cCoCounsel surpassed the Lawyer Baseline in those four tasks alone by more than 10 points,\u201d the study said.<\/p>\n<div id=\"attachment_49076\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/i0.wp.com\/www.lawnext.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-28-090014.png?ssl=1\" rel=\"nofollow noopener\" target=\"_blank\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-49076\" class=\"size-full wp-image-49076\" src=\"https:\/\/i0.wp.com\/www.lawnext.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-28-090014.png?resize=653%2C409&#038;ssl=1\" alt=\"\" width=\"653\" height=\"409\" title=\"\"><\/a><\/p>\n<p id=\"caption-attachment-49076\" class=\"wp-caption-text\">For document summarization, all the gen AI tools performed better than the Lawyer Baseline.\u00a0<\/p>\n<\/div>\n<p>Vincent AI from vLex participated in six tasks \u2014 second only to Harvey in number of tasks \u2014 with scores ranging from 53.6% to 72.7%, outperforming the Lawyer Baseline on document Q&amp;A, document summarization, and transcript analysis.<\/p>\n<p>The report said that Vincent AI\u2019s design is particularly noteworthy for its ability to infer the appropriate subskill to execute based on the user\u2019s question, and that the answers it provided were \u201cimpressively thorough.\u201d<\/p>\n<p>Oddly (I thought), the report praised Vincent AI for refusing to answer questions when it did not have sufficient data to answer, rather than give a hallucinated answer. But the report said those refusals to answer also negatively affected its scores.<\/p>\n<p>Oliver, released last September from the startup Vecflow, was described in the report as \u201cthe best-performing AI tool\u201d on the challenging EDGAR research task. That would seem a given, since it was the only AI tool to participate in that task. It scored 55.2% against the Lawyer Baseline\u2019s 70.1%.<\/p>\n<p>The report highlighted Oliver\u2019s \u201cagentic workflow\u201d approach as potentially valuable for complex research tasks requiring multiple steps and iterative decision-making, and said it excels at explaining its reasoning and actions as it works.<\/p>\n<p>\u201cOliver bested at least one other product for every task it opted into,\u201d the report said. \u201cOliver also outperformed the Lawyer Baseline for Document Q&amp;A and Document Summarization.\u201d<\/p>\n<h3>Methodology<\/h3>\n<p>The study was developed in partnership with Legaltech Hub and a consortium of law firms including Reed Smith, Fisher Phillips, McDermott Will &amp; Emery, and Ogletree Deakins, along with four anonymous firms. The consortium created a dataset of over 500 samples reflecting real-world legal tasks.<\/p>\n<p>Vals AI developed an automated evaluation framework to provide consistent assessment across tasks. The study notes that the lawyer control group was \u201cblind\u201d \u2014 participating lawyers were unaware they were part of a benchmarking study and received assignments formatted as typical client requests.<\/p>\n<p>Tara Waters was Vals AI\u2019s project lead for the study.<\/p>\n<h3>Future Directions<\/h3>\n<p>The report indicates this benchmark is the first iteration of what its says will be a regular evaluation of legal industry AI tools, with plans to repeat this study annually and add others. Future iterations may expand to include more vendors, additional tasks, and coverage of international jurisdictions beyond the current U.S. focus.<\/p>\n<p>\u201cThere is growing momentum across the legal industry for standardized methodologies, benchmarking, and a shared language for evaluating AI tools,\u201d the report notes.<\/p>\n<p>Nicola Shaver and Jeroen Plink of Legaltech Hub were credited for their \u201cpartnership in conceptualizing and designing the study and bringing together a high-quality cohort of vendors and law firms.\u201d<\/p>\n<p>\u201cOverall, this study\u2019s results support the conclusion that these legal AI tools have value for lawyers and law firms,\u201d the study concludes, \u201calthough there remains room for improvement in both how we evaluate these tools and their performance.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Are you still on the fence about whether generative artificial intelligence can do the work of human lawyers? If so, I urge you to read this new study. Published yesterday, this first-of-its-kind study evaluated the performance of four legal AI tools across seven core legal tasks. In many cases, it found, AI tools can perform [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":109449,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","_et_gb_content_width":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[24],"tags":[],"class_list":["post-109448","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-lawsite"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/xira.com\/p\/wp-content\/uploads\/2025\/02\/Vals-Study-Legal-AI-task-results-1024x576-GJPaK1.png?fit=1024%2C576&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/posts\/109448","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/comments?post=109448"}],"version-history":[{"count":0,"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/posts\/109448\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/media\/109449"}],"wp:attachment":[{"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/media?parent=109448"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/categories?post=109448"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/tags?post=109448"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}