Workshop Overview

Presentations

The workshop began with presentations from the AdSoLve project. Prof. Maria Liakata (QMUL) introduced the project (video). She outlined its scope and highlighted recent progress on topics such as evaluation metrics, cross-modal alignment, modules for temporal reasoning, and a survey on the limitations of LLMs in law. Maria discussed the potential of generative AI in the legal sector, emphasising its efficiency in repetitive tasks such as drafting documents, reviewing contracts, and legal research. She also highlighted the risks associated with the widespread use of generative AI without proper understanding, underscoring the need for an evaluation framework that addresses concerns and risks specific to the sector to ensure LLM-based applications are fit for purpose.

Prof. Rob Procter (Warwick) gave a talk on current AI benchmarking principles and practices (video), emphasising the need to capture both aspects such as information accuracy and alignment with professional standards to ensure reliable, trustworthy assessments. He discussed the existing disconnect between academic evaluation methods and real-world applications, stressing the importance of involving legal professionals and other stakeholders to develop evaluation tasks that reflect real-world practice and requirements. In the Q&A that followed, he remarked on how the adoption of AI calls for new skills on the part of legal professionals if they are to use them safely.

Mahmud Akhter (QMUL), Guneet Kohli (QMUL), and Dr. Xingwei Tan (Sheffield) presented about the challenges in legal reasoning with LLMs (video 1, 2), introducing reasoning types, processes, and failure cases with examples from claim verification and legal tasks. After outlining the impact of domain and task complexity on performance, they discussed work to improve deductive reasoning through symbolic logic, converting verbose outputs into symbolic structures to guide LLMs in reasoning more effectively. Guneet presented ongoing work on improving reference free evaluation of LLM output by checking the inference validity between steps in reasoning chains.

Next, the workshop featured talks from the legal and third sector. John Craske (CMS) spoke about "Unlocking the Potential of AI in Legal Practice", beginning with the current landscape and adoption trends. The legal AI market is big and changing. He gave some real-world examples of time savings: Harvey (30 mins), Copilot (5 mins), RelativityAI (x25 quicker for eDiscovery). He mapped out time spent on different legal tasks: research, reporting, negotiating, advising, review drafting. He stressed that AI makes different mistakes to people. Through an analysis of model performance against actual time spent, he showcased an approach to prioritise AI implementation and outlined a vision for collaborative workflows in the future, where AI handles routine tasks while humans focus on judgment, strategy, and client relationships. However, successful adoption requires addressing key challenges including hallucinations (which are now better, but not fixed), data privacy and security, IP and copyright, regulatory compliance and accountability, bias and fairness, strategic adaptation of business models, and the need for cultural shifts toward "AI-first" thinking and improved technical skills. The presentation concluded with an introduction to the LITIG AI Benchmarking Initiative, which aims to develop common standards and benchmarks for legal AI systems to promote transparency, accountability, and responsible adoption across the profession.

Tara Waters (TLW Consulting) presented on “Testing task accuracy in legal AI products: Vals Legal AI Report” (video). The Vals study aimed to address the need for independent, specialised benchmarking in the legal market. She first introduced the benchmark creation process, which involved working closely with firms to identify seven core legal tasks, develop test data, and establish evaluation criteria. In terms of methodology, 10 American law firms created the dataset, tech vendors opted in to participate in individual task evaluations, the ALSP firm Cognia Law was invited to establish lawyer baselines, and model responses were scored using LLM-as-a-judge with rubrics as well as human secondary checks on zero-scoring edge cases. Their results indicated that AI delivered speed advantages and surpassed human baselines in document analysis, summarization, and Q&A tasks, but performance-wise lawyers remained superior on complex tasks that required nuanced judgment and iterative decision-making such as redlining and EDGAR research. In her opinion, AI is ready for some legal tasks. The presentation concluded with next steps to expand and scale up evaluation, including followup benchmarking efforts that focus on legal research using human evaluation, and plans to repeat the studies annually to promote transparency and responsible AI adoption as the technology evolves.

Martha de la Roche (The Access to Justice Foundation) discussed “AI and access to justice for marginalised communities” (video). She highlighted AI's potential to improve service delivery efficiency and address immediate needs of marginalised communities, citing examples of current tools used for transcribing client meeting notes and linking advisors with supervisors. She also sees opportunities in helping widen access to legal support through LLM-based systems, noting that 100,000 more people could be supported through better case management tools. However, she also emphasised risks, particularly that AI could exacerbate existing inequalities, especially in “low resource environments” that deal with “high octane” legal problems: housing, immigration, and family law, if applied without proper strategic frameworks. She identified major implementation challenges facing frontline organisations, including resource constraints, limited capacity and funding, skill and specialism gaps, and basic infrastructure issues that affect their ability to interact with IT systems and manage change effectively. To address these challenges, Martha proposed establishing a Justice Tech Fund to support organisations, form partnerships, and implement learning from successful projects. While this fund is currently hypothetical, there are ongoing fundraising efforts through partnerships with organisations like the Ministry of Justice and the National Lottery Community Fund, with the aim to start grant-making by the end of the year.

The AdSoLve team then presented their evaluation platform (video). Sebastian Löbbers (QMUL) and Jenny Chim (QMUL) highlighted the project’s focus on needs-driven evaluation, translating real-life practitioner goals and requirements into concrete multi-aspect evaluations on NLP tasks, operationalised through appropriate datasets, metrics, and assessment strategies. Sebastian described the platform architecture and design of modular bundles, allowing users to either follow existing evaluation setups or build their own based on their specific needs in a secure environment. After a live demo, the feedback session included discussions on potential educational functions of the platform and how organisation-specific decision-making and procurement processes can affect both AI adoption and evaluation.

Panels

The workshop concluded with two panels. In the first panel, “Challenges of Legal AI in Practice”, Jo Owen (Lights-On), Richard Tromans (Tromans Consulting), and Stephen Ingle (Fieldfisher) explored key challenges for AI adoption, identifying trust, time to learn, and too many tools as primary barriers. Additional challenges include IT infrastructure cost, risks from reliance on cloud APIs, and the fundamental problem that the billable hour model creates perverse incentives against efficiency technologies that reduce billable time. For evaluation and benchmarking, panelists distinguished between applications in the business of law (operations, HR, marketing) versus legal practice itself, noting that professional indemnity insurance requires human oversight of all legal work and that firms don't encourage AI use for tasks involving citations due to hallucination risks. Some panelists advocated for a COMPASS approach, which focuses on transparency, accountability, and keeping vendors honest rather than pursuing absolute scores. Panelists stressed that effective benchmarking should show the whole process and ensure tools are usable and consistent. They note the prevalence of “AI washing”, which can arise when marketing - rather than genuine utility - drives interest in innovation. Finally, panelists discussed the need for standardised, interpretable, and scalable tools that address real-world requirements, ensuring AI-driven systems are fit for purpose.

The second panel, “Regulatory and Ethical Considerations of Legal AI”, featured James MacGregor (Ethical eDiscovery), Natalie Leesakul (University of Nottingham, School of Law), and Akber Datoo (D2 Legal Technology, Law Society) discussing barriers to ethical AI adoption in the legal sector. The panelists identified key challenges including digital illiteracy (e.g., AI benefits, limitations, and best practices), risk aversion inherent in legal practice, and the danger of creating a "two-tier justice system" between those with and without access to high-performing technological solutions. They emphasised the importance of human-machine collaboration in the legal sector, arguing that AI should complement human expertise rather than replace it. They discussed the need for a balanced approach to ensure both efficiency and quality, and that the emphasis should not just be on tools but on the process. Furthermore, panelists noted that while professional regulations theoretically cover AI use requirements, specific guidance hasn't caught up with technological developments. The discussion highlighted critical gaps in legal education and professional development, with calls for law schools and professional bodies to integrate AI literacy training and upskilling, establish clearer disclosure duties regarding AI use with clients, and address fundamental infrastructure issues around data quality that underpin effective AI implementation.