{"id":108186,"date":"2025-02-11T17:02:45","date_gmt":"2025-02-12T01:02:45","guid":{"rendered":"https:\/\/xira.com\/p\/2025\/02\/11\/beauty-is-in-the-ai-of-the-beholder\/"},"modified":"2025-02-11T17:02:45","modified_gmt":"2025-02-12T01:02:45","slug":"beauty-is-in-the-ai-of-the-beholder","status":"publish","type":"post","link":"https:\/\/xira.com\/p\/2025\/02\/11\/beauty-is-in-the-ai-of-the-beholder\/","title":{"rendered":"Beauty Is In The AI Of The Beholder"},"content":{"rendered":"<figure class=\"wp-block-image alignright is-resized\"><img data-recalc-dims=\"1\" height=\"347\" width=\"620\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/abovethelaw.com\/wp-content\/uploads\/sites\/4\/2024\/10\/ai-generated-8540921_1280-1-620x347.jpg?resize=620%2C347&#038;ssl=1\" alt=\"\" title=\"\"><figcaption><\/figcaption><\/figure>\n<p><strong><em>How speed and accuracy benchmarks misrepresent the real value of legal AI<\/em><\/strong><\/p>\n<p>Welcome to the era of the AI superlative. While the first two years of generative artificial intelligence\u00a0 (GenAI) development were an all-out sprint to create new models, establish proof-of-concept\u00a0solutions, and define optimal use cases, the next phase to deliver increased efficiency and better\u00a0work product to clients in the AI lifecycle will be dominated by marketing as well.\u00a0\u00a0<\/p>\n<p>Product claims of the fastest, most accurate large language model (LLM) or \u201challucination-free\u201d\u00a0 results have entered the marketplace. As more companies develop AI solutions and start-ups seek\u00a0capital investment in an increasingly crowded field, customers will seek benchmarks to evaluate\u00a0the efficacy of these tools. For benchmarks to be valuable, they must test real-world problems that\u00a0legal professionals face and measure what customers care about.\u00a0<\/p>\n<p>The challenge is one-dimensional metrics do not offer a reliable representation of the real value of\u00a0 GenAI in the legal research process. No LLM-based legal research products in the market today provide answers with 100% accuracy, so users must engage in a two-step process of 1) getting the answer and 2) checking the answer for accuracy.\u00a0\u00a0<\/p>\n<p>It\u2019s the end result of this two-step process that matters. Benchmarking just part of this process does not provide useful information \u2014 unless there is a part of the process that is completely broken.\u00a0<\/p>\n<p>In drag racing, cars need to accelerate as fast as they can and then brake quickly. For braking, they typically deploy a parachute behind the car to increase drag and traditional braking methods. What drag racers care about is how quickly and safely the car brakes. If we wanted to benchmark different braking systems, we\u2019d test them from the time of deployment to the time the car stopped and measure time and distance. Instead, imagine benchmarking braking systems by measuring how fast the parachutes deployed.\u00a0<\/p>\n<p>Similarly, with a research product where all answers must be checked, what matters most is how quickly and accurately researchers can get to the end of that process. For instance, which legal research system would you prefer? One where:\u00a0<\/p>\n<p>a) LLM-generated answers are accurate 95% of the time, and researchers, on average, can verify accuracy within 25 minutes and get to an accurate answer 97% of the time, or\u00a0<\/p>\n<p>b) LLM-generated answers are accurate 85% of the time, and researchers, on average, can verify accuracy within 15 minutes and get to an accurate answer 100% of the time.\u00a0<\/p>\n<p>Since all researchers need to engage in this two-step process 100% of the time, it\u2019s clear that Option B would be better. So why would we just benchmark the first part of the process?\u00a0<\/p>\n<p>Technology companies care deeply about benchmarking. However, benchmarks must measure products the way they\u2019re designed to be used and should focus on results customers care about.<\/p>\n<p>It makes sense that the legal field would become an early test bed for this type of analysis. From the earliest days of mainstream GenAI development when ChatGPT aced the LSAT, legal use cases\u00a0have been prime examples of both the power and the risks associated with AI. The legal field is no stranger to AI; leading companies have been using it for decades in our legal research platform, and likewise, lawyers have been benefitting from it.\u00a0<\/p>\n<p><strong>Measuring the Full Scope\u00a0<\/strong><\/p>\n<p>Working with our customers to continually improve legal research, we understand it is a multiphase process with many inputs and factors \u2014 with GenAI capabilities being just one part of it. The entire legal research process is detailed and complex, and lawyers must check sources and validate material \u2014 in essence, follow holistic sound research practices to ensure their research is\u00a0comprehensive and accurate. Benchmarking one part of this process cannot measure the full\u00a0 scope or true value of legal research.\u00a0\u00a0<\/p>\n<p>\u201cThere is a widespread misperception around how law firms are using AI and how we conduct legal research. We are not bringing in AI and saying: \u2018Go do all the research and write a brief,\u2019 and then replacing all of our junior associates with automated results,\u201d said Meredith Williams-Range, chief legal operations officer, Gibson, Dunn &amp; Crutcher LLP. \u201cWe\u2019re using AI-enabled tools that are integrated directly into the research and drafting tools we were using already, and, as a result, we\u2019re getting deeper, more nuanced, and more comprehensive insights faster. We have highly trained professionals doing sophisticated information analysis and reporting, augmented by technology.\u201d\u00a0<\/p>\n<p><strong>Looking Beyond the Basics of AI Evaluation\u00a0<\/strong><\/p>\n<p>To state the obvious, benchmark testing should evaluate solutions in accordance with their\u00a0intended use. In legal research, GenAI has demonstrated significant benefits; however, it is meant to be integrated into a comprehensive workflow that includes reviewing primary law, verifying citations, and utilizing statute annotations to ensure a thorough understanding of the law.\u00a0\u00a0<\/p>\n<p>\u201cAt Husch Blackwell, we have focused on end-to-end project efficiency in building and deploying\u00a0our in-house AI tools,\u201d said Blake Rooney, the firm\u2019s chief information officer. \u201cWhile performance metrics that focus on task efficiency can be helpful, project-level performance metrics for efforts such as contract drafting or discovery in litigation do a much better job at underscoring the efficiencies that resonate with both our lawyers and our clients because they provide a clearer picture of overall value and time savings. Time is a finite resource that we always wish we could have more of, and our lawyers understand that \u2014 when used properly and\u00a0 responsibly \u2014 AI tools enable them to finish projects faster (and oftentimes better) than they could without AI, thereby delivering true value to our clients and ultimately enabling our lawyers to do more work (or spend more time with family) with the time that they have.\u201d\u00a0<\/p>\n<p>For legal research, accuracy, consistency, and speed do matter \u2014 but none of them offers a single indicator of success. When it comes to evaluating the performance of professional-grade solutions in specialized fields like law, it is critical not to let isolated snapshots of a single performance metric distort our perspective.\u00a0<\/p>\n<p>The value of legal AI \u2014 of any technological innovation for that matter \u2014 is in how it gets used in the real world and how well all the different components come together to help lawyers do their jobs\u00a0more effectively.\u00a0\u00a0<\/p>\n<p><strong>About the author\u00a0<\/strong><\/p>\n<p><em>Raghu Ramanathan is president of Legal Professionals at Thomson Reuters.<\/em><\/p>\n<\/p>\n<p>The post <a href=\"https:\/\/abovethelaw.com\/2025\/02\/beauty-is-in-the-ai-of-the-beholder\/\" rel=\"nofollow noopener\" target=\"_blank\">Beauty Is In The AI Of The Beholder<\/a> appeared first on <a href=\"https:\/\/abovethelaw.com\/\" rel=\"nofollow noopener\" target=\"_blank\">Above the Law<\/a>.<\/p>\n<figure class=\"wp-block-image alignright is-resized\"><img data-recalc-dims=\"1\" height=\"347\" width=\"620\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/abovethelaw.com\/wp-content\/uploads\/sites\/4\/2024\/10\/ai-generated-8540921_1280-1-620x347.jpg?resize=620%2C347&#038;ssl=1\" alt=\"\" title=\"\"><figcaption><\/figcaption><\/figure>\n<p><strong><em>How speed and accuracy benchmarks misrepresent the real value of legal AI<\/em><\/strong><\/p>\n<p>Welcome to the era of the AI superlative. While the first two years of generative artificial intelligence\u00a0 (GenAI) development were an all-out sprint to create new models, establish proof-of-concept\u00a0solutions, and define optimal use cases, the next phase to deliver increased efficiency and better\u00a0work product to clients in the AI lifecycle will be dominated by marketing as well.\u00a0\u00a0<\/p>\n<p>Product claims of the fastest, most accurate large language model (LLM) or \u201challucination-free\u201d\u00a0 results have entered the marketplace. As more companies develop AI solutions and start-ups seek\u00a0capital investment in an increasingly crowded field, customers will seek benchmarks to evaluate\u00a0the efficacy of these tools. For benchmarks to be valuable, they must test real-world problems that\u00a0legal professionals face and measure what customers care about.\u00a0<\/p>\n<p>The challenge is one-dimensional metrics do not offer a reliable representation of the real value of\u00a0 GenAI in the legal research process. No LLM-based legal research products in the market today provide answers with 100% accuracy, so users must engage in a two-step process of 1) getting the answer and 2) checking the answer for accuracy.\u00a0\u00a0<\/p>\n<p>It\u2019s the end result of this two-step process that matters. Benchmarking just part of this process does not provide useful information \u2014 unless there is a part of the process that is completely broken.\u00a0<\/p>\n<p>In drag racing, cars need to accelerate as fast as they can and then brake quickly. For braking, they typically deploy a parachute behind the car to increase drag and traditional braking methods. What drag racers care about is how quickly and safely the car brakes. If we wanted to benchmark different braking systems, we\u2019d test them from the time of deployment to the time the car stopped and measure time and distance. Instead, imagine benchmarking braking systems by measuring how fast the parachutes deployed.\u00a0<\/p>\n<p>Similarly, with a research product where all answers must be checked, what matters most is how quickly and accurately researchers can get to the end of that process. For instance, which legal research system would you prefer? One where:\u00a0<\/p>\n<p>a) LLM-generated answers are accurate 95% of the time, and researchers, on average, can verify accuracy within 25 minutes and get to an accurate answer 97% of the time, or\u00a0<\/p>\n<p>b) LLM-generated answers are accurate 85% of the time, and researchers, on average, can verify accuracy within 15 minutes and get to an accurate answer 100% of the time.\u00a0<\/p>\n<p>Since all researchers need to engage in this two-step process 100% of the time, it\u2019s clear that Option B would be better. So why would we just benchmark the first part of the process?\u00a0<\/p>\n<p>Technology companies care deeply about benchmarking. However, benchmarks must measure products the way they\u2019re designed to be used and should focus on results customers care about.<\/p>\n<p>It makes sense that the legal field would become an early test bed for this type of analysis. From the earliest days of mainstream GenAI development when ChatGPT aced the LSAT, legal use cases\u00a0have been prime examples of both the power and the risks associated with AI. The legal field is no stranger to AI; leading companies have been using it for decades in our legal research platform, and likewise, lawyers have been benefitting from it.\u00a0<\/p>\n<p><strong>Measuring the Full Scope\u00a0<\/strong><\/p>\n<p>Working with our customers to continually improve legal research, we understand it is a multiphase process with many inputs and factors \u2014 with GenAI capabilities being just one part of it. The entire legal research process is detailed and complex, and lawyers must check sources and validate material \u2014 in essence, follow holistic sound research practices to ensure their research is\u00a0comprehensive and accurate. Benchmarking one part of this process cannot measure the full\u00a0 scope or true value of legal research.\u00a0\u00a0<\/p>\n<p>\u201cThere is a widespread misperception around how law firms are using AI and how we conduct legal research. We are not bringing in AI and saying: \u2018Go do all the research and write a brief,\u2019 and then replacing all of our junior associates with automated results,\u201d said Meredith Williams-Range, chief legal operations officer, Gibson, Dunn &amp; Crutcher LLP. \u201cWe\u2019re using AI-enabled tools that are integrated directly into the research and drafting tools we were using already, and, as a result, we\u2019re getting deeper, more nuanced, and more comprehensive insights faster. We have highly trained professionals doing sophisticated information analysis and reporting, augmented by technology.\u201d\u00a0<\/p>\n<p><strong>Looking Beyond the Basics of AI Evaluation\u00a0<\/strong><\/p>\n<p>To state the obvious, benchmark testing should evaluate solutions in accordance with their\u00a0intended use. In legal research, GenAI has demonstrated significant benefits; however, it is meant to be integrated into a comprehensive workflow that includes reviewing primary law, verifying citations, and utilizing statute annotations to ensure a thorough understanding of the law.\u00a0\u00a0<\/p>\n<p>\u201cAt Husch Blackwell, we have focused on end-to-end project efficiency in building and deploying\u00a0our in-house AI tools,\u201d said Blake Rooney, the firm\u2019s chief information officer. \u201cWhile performance metrics that focus on task efficiency can be helpful, project-level performance metrics for efforts such as contract drafting or discovery in litigation do a much better job at underscoring the efficiencies that resonate with both our lawyers and our clients because they provide a clearer picture of overall value and time savings. Time is a finite resource that we always wish we could have more of, and our lawyers understand that \u2014 when used properly and\u00a0 responsibly \u2014 AI tools enable them to finish projects faster (and oftentimes better) than they could without AI, thereby delivering true value to our clients and ultimately enabling our lawyers to do more work (or spend more time with family) with the time that they have.\u201d\u00a0<\/p>\n<p>For legal research, accuracy, consistency, and speed do matter \u2014 but none of them offers a single indicator of success. When it comes to evaluating the performance of professional-grade solutions in specialized fields like law, it is critical not to let isolated snapshots of a single performance metric distort our perspective.\u00a0<\/p>\n<p>The value of legal AI \u2014 of any technological innovation for that matter \u2014 is in how it gets used in the real world and how well all the different components come together to help lawyers do their jobs\u00a0more effectively.\u00a0\u00a0<\/p>\n<p><strong>About the author\u00a0<\/strong><\/p>\n<p><em>Raghu Ramanathan is president of Legal Professionals at Thomson Reuters.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How speed and accuracy benchmarks misrepresent the real value of legal AI Welcome to the era of the AI superlative. While the first two years of generative artificial intelligence\u00a0 (GenAI) development were an all-out sprint to create new models, establish proof-of-concept\u00a0solutions, and define optimal use cases, the next phase to deliver increased efficiency and better\u00a0work [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":108187,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","_et_gb_content_width":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[16],"tags":[],"class_list":["post-108186","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-above_the_law"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/xira.com\/p\/wp-content\/uploads\/2025\/02\/ai-generated-8540921_1280-1-620x347-z0B552.jpeg?fit=620%2C347&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/posts\/108186","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/comments?post=108186"}],"version-history":[{"count":0,"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/posts\/108186\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/media\/108187"}],"wp:attachment":[{"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/media?parent=108186"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/categories?post=108186"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/xira.com\/p\/wp-json\/wp\/v2\/tags?post=108186"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}