ai generated 8540921 1280 1
Beauty Is In The AI Of The Beholder 3

How speed and accuracy benchmarks misrepresent the real value of legal AI

Welcome to the era of the AI superlative. While the first two years of generative artificial intelligence  (GenAI) development were an all-out sprint to create new models, establish proof-of-concept solutions, and define optimal use cases, the next phase to deliver increased efficiency and better work product to clients in the AI lifecycle will be dominated by marketing as well.  

Product claims of the fastest, most accurate large language model (LLM) or “hallucination-free”  results have entered the marketplace. As more companies develop AI solutions and start-ups seek capital investment in an increasingly crowded field, customers will seek benchmarks to evaluate the efficacy of these tools. For benchmarks to be valuable, they must test real-world problems that legal professionals face and measure what customers care about. 

The challenge is one-dimensional metrics do not offer a reliable representation of the real value of  GenAI in the legal research process. No LLM-based legal research products in the market today provide answers with 100% accuracy, so users must engage in a two-step process of 1) getting the answer and 2) checking the answer for accuracy.  

It’s the end result of this two-step process that matters. Benchmarking just part of this process does not provide useful information — unless there is a part of the process that is completely broken. 

In drag racing, cars need to accelerate as fast as they can and then brake quickly. For braking, they typically deploy a parachute behind the car to increase drag and traditional braking methods. What drag racers care about is how quickly and safely the car brakes. If we wanted to benchmark different braking systems, we’d test them from the time of deployment to the time the car stopped and measure time and distance. Instead, imagine benchmarking braking systems by measuring how fast the parachutes deployed. 

Similarly, with a research product where all answers must be checked, what matters most is how quickly and accurately researchers can get to the end of that process. For instance, which legal research system would you prefer? One where: 

a) LLM-generated answers are accurate 95% of the time, and researchers, on average, can verify accuracy within 25 minutes and get to an accurate answer 97% of the time, or 

b) LLM-generated answers are accurate 85% of the time, and researchers, on average, can verify accuracy within 15 minutes and get to an accurate answer 100% of the time. 

Since all researchers need to engage in this two-step process 100% of the time, it’s clear that Option B would be better. So why would we just benchmark the first part of the process? 

Technology companies care deeply about benchmarking. However, benchmarks must measure products the way they’re designed to be used and should focus on results customers care about.

It makes sense that the legal field would become an early test bed for this type of analysis. From the earliest days of mainstream GenAI development when ChatGPT aced the LSAT, legal use cases have been prime examples of both the power and the risks associated with AI. The legal field is no stranger to AI; leading companies have been using it for decades in our legal research platform, and likewise, lawyers have been benefitting from it. 

Measuring the Full Scope 

Working with our customers to continually improve legal research, we understand it is a multiphase process with many inputs and factors — with GenAI capabilities being just one part of it. The entire legal research process is detailed and complex, and lawyers must check sources and validate material — in essence, follow holistic sound research practices to ensure their research is comprehensive and accurate. Benchmarking one part of this process cannot measure the full  scope or true value of legal research.  

“There is a widespread misperception around how law firms are using AI and how we conduct legal research. We are not bringing in AI and saying: ‘Go do all the research and write a brief,’ and then replacing all of our junior associates with automated results,” said Meredith Williams-Range, chief legal operations officer, Gibson, Dunn & Crutcher LLP. “We’re using AI-enabled tools that are integrated directly into the research and drafting tools we were using already, and, as a result, we’re getting deeper, more nuanced, and more comprehensive insights faster. We have highly trained professionals doing sophisticated information analysis and reporting, augmented by technology.” 

Looking Beyond the Basics of AI Evaluation 

To state the obvious, benchmark testing should evaluate solutions in accordance with their intended use. In legal research, GenAI has demonstrated significant benefits; however, it is meant to be integrated into a comprehensive workflow that includes reviewing primary law, verifying citations, and utilizing statute annotations to ensure a thorough understanding of the law.  

“At Husch Blackwell, we have focused on end-to-end project efficiency in building and deploying our in-house AI tools,” said Blake Rooney, the firm’s chief information officer. “While performance metrics that focus on task efficiency can be helpful, project-level performance metrics for efforts such as contract drafting or discovery in litigation do a much better job at underscoring the efficiencies that resonate with both our lawyers and our clients because they provide a clearer picture of overall value and time savings. Time is a finite resource that we always wish we could have more of, and our lawyers understand that — when used properly and  responsibly — AI tools enable them to finish projects faster (and oftentimes better) than they could without AI, thereby delivering true value to our clients and ultimately enabling our lawyers to do more work (or spend more time with family) with the time that they have.” 

For legal research, accuracy, consistency, and speed do matter — but none of them offers a single indicator of success. When it comes to evaluating the performance of professional-grade solutions in specialized fields like law, it is critical not to let isolated snapshots of a single performance metric distort our perspective. 

The value of legal AI — of any technological innovation for that matter — is in how it gets used in the real world and how well all the different components come together to help lawyers do their jobs more effectively.  

About the author 

Raghu Ramanathan is president of Legal Professionals at Thomson Reuters.

The post Beauty Is In The AI Of The Beholder appeared first on Above the Law.

ai generated 8540921 1280 1
Beauty Is In The AI Of The Beholder 4

How speed and accuracy benchmarks misrepresent the real value of legal AI

Welcome to the era of the AI superlative. While the first two years of generative artificial intelligence  (GenAI) development were an all-out sprint to create new models, establish proof-of-concept solutions, and define optimal use cases, the next phase to deliver increased efficiency and better work product to clients in the AI lifecycle will be dominated by marketing as well.  

Product claims of the fastest, most accurate large language model (LLM) or “hallucination-free”  results have entered the marketplace. As more companies develop AI solutions and start-ups seek capital investment in an increasingly crowded field, customers will seek benchmarks to evaluate the efficacy of these tools. For benchmarks to be valuable, they must test real-world problems that legal professionals face and measure what customers care about. 

The challenge is one-dimensional metrics do not offer a reliable representation of the real value of  GenAI in the legal research process. No LLM-based legal research products in the market today provide answers with 100% accuracy, so users must engage in a two-step process of 1) getting the answer and 2) checking the answer for accuracy.  

It’s the end result of this two-step process that matters. Benchmarking just part of this process does not provide useful information — unless there is a part of the process that is completely broken. 

In drag racing, cars need to accelerate as fast as they can and then brake quickly. For braking, they typically deploy a parachute behind the car to increase drag and traditional braking methods. What drag racers care about is how quickly and safely the car brakes. If we wanted to benchmark different braking systems, we’d test them from the time of deployment to the time the car stopped and measure time and distance. Instead, imagine benchmarking braking systems by measuring how fast the parachutes deployed. 

Similarly, with a research product where all answers must be checked, what matters most is how quickly and accurately researchers can get to the end of that process. For instance, which legal research system would you prefer? One where: 

a) LLM-generated answers are accurate 95% of the time, and researchers, on average, can verify accuracy within 25 minutes and get to an accurate answer 97% of the time, or 

b) LLM-generated answers are accurate 85% of the time, and researchers, on average, can verify accuracy within 15 minutes and get to an accurate answer 100% of the time. 

Since all researchers need to engage in this two-step process 100% of the time, it’s clear that Option B would be better. So why would we just benchmark the first part of the process? 

Technology companies care deeply about benchmarking. However, benchmarks must measure products the way they’re designed to be used and should focus on results customers care about.

It makes sense that the legal field would become an early test bed for this type of analysis. From the earliest days of mainstream GenAI development when ChatGPT aced the LSAT, legal use cases have been prime examples of both the power and the risks associated with AI. The legal field is no stranger to AI; leading companies have been using it for decades in our legal research platform, and likewise, lawyers have been benefitting from it. 

Measuring the Full Scope 

Working with our customers to continually improve legal research, we understand it is a multiphase process with many inputs and factors — with GenAI capabilities being just one part of it. The entire legal research process is detailed and complex, and lawyers must check sources and validate material — in essence, follow holistic sound research practices to ensure their research is comprehensive and accurate. Benchmarking one part of this process cannot measure the full  scope or true value of legal research.  

“There is a widespread misperception around how law firms are using AI and how we conduct legal research. We are not bringing in AI and saying: ‘Go do all the research and write a brief,’ and then replacing all of our junior associates with automated results,” said Meredith Williams-Range, chief legal operations officer, Gibson, Dunn & Crutcher LLP. “We’re using AI-enabled tools that are integrated directly into the research and drafting tools we were using already, and, as a result, we’re getting deeper, more nuanced, and more comprehensive insights faster. We have highly trained professionals doing sophisticated information analysis and reporting, augmented by technology.” 

Looking Beyond the Basics of AI Evaluation 

To state the obvious, benchmark testing should evaluate solutions in accordance with their intended use. In legal research, GenAI has demonstrated significant benefits; however, it is meant to be integrated into a comprehensive workflow that includes reviewing primary law, verifying citations, and utilizing statute annotations to ensure a thorough understanding of the law.  

“At Husch Blackwell, we have focused on end-to-end project efficiency in building and deploying our in-house AI tools,” said Blake Rooney, the firm’s chief information officer. “While performance metrics that focus on task efficiency can be helpful, project-level performance metrics for efforts such as contract drafting or discovery in litigation do a much better job at underscoring the efficiencies that resonate with both our lawyers and our clients because they provide a clearer picture of overall value and time savings. Time is a finite resource that we always wish we could have more of, and our lawyers understand that — when used properly and  responsibly — AI tools enable them to finish projects faster (and oftentimes better) than they could without AI, thereby delivering true value to our clients and ultimately enabling our lawyers to do more work (or spend more time with family) with the time that they have.” 

For legal research, accuracy, consistency, and speed do matter — but none of them offers a single indicator of success. When it comes to evaluating the performance of professional-grade solutions in specialized fields like law, it is critical not to let isolated snapshots of a single performance metric distort our perspective. 

The value of legal AI — of any technological innovation for that matter — is in how it gets used in the real world and how well all the different components come together to help lawyers do their jobs more effectively.  

About the author 

Raghu Ramanathan is president of Legal Professionals at Thomson Reuters.