Overview

Our launch event took place on December 5th 2024. We started with four talks from the project team. Prof Maria Liakata gave an overview of the project, presenting goals and progress so far in terms of publications and next steps. Then, Prof Greg Slabaugh gave a talk about the first use case of the project, in multi-modal medical diagnostics and monitoring, focussing particularly on the goals and technical challenges involving multi-modal cancer research diagnostics. Next Prof Domenico Giacco gave a talk about another major use case of the project, on AI support for mental health, focussing on requirements and challenges for summarisation, monitoring and dialogue self-management. Finally Dr Jiahong Chen presented our third use case, on AI legal support, highlighting our findings from our recent literature review on applications of LLMs in the legal domain. All three use case presentations ended with a diagram showing the potential translation of requirements for different tasks based on LLM systems into evaluation criteria, metrics and subtasks to evaluate system performance and appropriateness.

Following the presentations by the project team in the morning, we had talks from project partners. Jonathan Pearson from NHS England talked about thoughts on evaluation of LLM-based applications in the context of the NHS. Hannah Richardson from Microsoft Futures talked about steps towards understanding risk management for generative AI as a medical device. Zachary Goldbeg from Trilateral Research discussed how responsible innovation had been incorporated in pre-LLM solutions that could inform current approaches. Henry Sturm from LegalGeek talked about the state of legal tech in the UK and the role of AI.

In the afternoon we had a presentation of an evaluation platform for assessing tasks performed by LLM systems and a themed discussion on this topic, inspired by the diagrams of requirements, tasks and metrics presented in the use case talks in the morning. Following the presentation we split into three groups per use case to discuss requirements, challenges and evaluation needs for each use case:

For use case 1 on multi-modal diagnostics and monitoring, the group focussed on multi-modal diagnostics for pancreatic cancer. The group discussed needs and challenges for this use case including: defining user profiles (personas), with different needs and requirements, capturing ethnic diversity in the patient population, the need for continuous auditing of the system over time, technical requirements for developing quicker and accurate detection, such as collecturing multiple multi-modal samples, incorporating information from previous engagement, temporal modelling and confidence in predictions. Finally emphasis was placed on ensuring the LLM based solutions improve workloads, through integration with existing practices.

For use case 2 on AI support for mental health, the group discussed ways to define requirements and challenges , particularly regarding evaluation. There is a need to distinguish between evaluation of an AI system per se vs patient outcomes and set goals at the individual user level as well as system level. Whether and at what point an AI system should be classified as a medical device was also discussed. In terms of system functionality the group discussed the need to help clinicians with decision making (e.g. by identifying trends or suggesting treatment pathways), support both controlled and open interactions (e.g. structured conversation vs open ended conversation). It should also be designed so that it can be embedded in current routine clinical workflows. AI system acceptability would likely increase if the system was positioned as a general support tool rather than targeting a particular clinical role.

For use case 3 on AI for legal support, the group discussed project-specific recommendations, requirements and challenges to evaluation, and potential directions to addressing such challenges. Specifically it's crucial to understand legal workflows and how various tasks (e.g. summarisation) would fit within these. In terms of evaluation, it is important to consider what attaining professional standards (e.g. passing the bar exam) means in terms of overall professional competence. Also all errors are not considered equal so there needs to be a weighting of e.g. the effect of hallucinations on different tasks and contexts. The development of systems for AI legal support should consider the trade-off between risks and efforts for mitigation, as some errors are not legally permissible. Considering different types of user profiles (personas) as well as ensuring value adding (e.g. that the workload of professional lawyers is not increased rather than alleviated if they need to perform additional checks) was discussed here as well as in the medical diagnostics use case. Technical directions discussed in terms of evaluating the performance of LLMs in legal contexts include comparing reasoning pathways and post editing of generated outputs. The group also discussed transparency and the duty of disclosure in terms of the risks of using LLMs and the broader implications of using AI, which could result in useful training of young lawyers disappearing.