Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

AI on Trial: Legal Models Hallucinate in 1 out of 6 Queries

Stanford University Human-Centered Artificial Intelligence – A new study reveals the need for benchmarking and public evaluations of AI tools in law. “Artificial intelligence (AI) tools are rapidly transforming the practice of law. Nearly three quarters of lawyers plan on using generative AI for their work, from sifting through mountains of case law to drafting contracts to reviewing documents to writing legal memoranda. But are these tools reliable enough for real-world use? Large language models have a documented tendency to “hallucinate,” or make up false information. In one highly-publicized case, a New York lawyer faced sanctions for citing ChatGPT-invented fictional cases in a legal brief; many similar cases have since been reported. And our previous study of general-purpose chatbots found that they hallucinated between 58% and 82% of the time on legal queries, highlighting the risks of incorporating AI into legal practice. In his 2023 annual report on the judiciary, Chief Justice Roberts took note and warned lawyers of hallucinations.  Across all areas of industry, retrieval-augmented generation (RAG) is seen and promoted as the solution for reducing hallucinations in domain-specific contexts. Relying on RAG, leading legal research services have released AI-powered legal research products that they claim “avoid” hallucinations and guarantee “hallucination-free” legal citations. RAG systems promise to deliver more accurate and trustworthy legal information by integrating a language model with a database of legal documents. Yet providers have not provided hard evidence for such claims or even precisely defined “hallucination,” making it difficult to assess their real-world reliability.

In a new preprint study by Stanford RegLab and HAI researchers, we put the claims of two providers, LexisNexis and Thomson Reuters (the parent company of Westlaw), to the test. We show that their tools do reduce errors compared to general-purpose AI models like GPT-4. That is a substantial improvement and we document instances where these tools can spot mistaken premises. But even these bespoke legal AI tools still hallucinate an alarming amount of the time: these systems produced incorrect information more than 17% of the time—one in every six queries…”

See also – Statement from Thomson Reuters: “Thomson Reuters is aware of the recent paper published by Stanford. We are committed to research and fostering relationships with industry partners that furthers the development of safe and trusted AI. Thomson Reuters believes that any research which includes its solutions should be completed using the product for its intended purpose, and in addition that any benchmarks and definitions are established in partnership with those working in the industry. In this study, Stanford used Practical Law’s Ask Practical Law AI for primary law legal research, which is not its intended use, and would understandably not perform well in this environment. Westlaw’s AI-Assisted Research is the right tool for this work. To help the team at Stanford develop the next phase of its research, we have now made this product available to them.”

Sorry, comments are closed for this post.