Press Release: Study Shows AI Outperforms Doctors in Summarizing Health Records

Posted on March 05, 2024 by Admin

An international team of scientists identified the best large language models and adaptation methods for clinically summarizing large amounts of electronic health record data and compared the performance of these models to that of medical experts.

Study

In the present study, the researchers evaluated eight large language models across four clinical summarization tasks, namely, patient questions, radiology reports, dialogue between doctor and patient, and progress notes.

They first used quantitative natural language processing metrics to determine which model and adaptation method performed the best across the four summarization tasks. Ten physicians then conducted a clinical reader study where they compared the best summaries from the large language models with those from medical experts along parameters such as conciseness, correctness, and completeness.

Finally, the researchers assessed the safety aspects to determine the challenges, such as the fabrication of information and the potential for medical harm present in the summarization of clinical data by medical experts and large language models.

Two broad language-generation approaches — autoregressive and seq2seq models — were used to evaluate the eight large language models. Training seq2seq models requires paired datasets as they use an encoder-decoder architecture that maps the input to the output. These models perform efficiently in tasks involving summarization and machine translation.

On the other hand, autoregressive models do not require paired datasets, and these models are suitable for tasks such as dialogue and question-answer interactions and text generation. The study evaluated open-sourced autoregressive and seq2seq large language models, as well as some proprietary autoregressive models and two techniques for adapting the general-purpose, pre-trained large language models to perform domain-specific tasks.

The four areas of tasks used to evaluate the large language models consisted of summarization of radiology reports using detailed data of radiology analyses and results, summarization of questions from patients into condensed queries, using progress notes to produce a list of medical problems and diagnoses, and summarizing interactions between the doctor and patient into a paragraph on the assessment and plan.

Findings

The results showed that 45% of the summaries from the best-adapted large language models were equivalent to and 36% of them were superior to those from medical experts. Furthermore, in the clinical reader study, the large language model summaries scored higher than the medical expert summaries across all three parameters of conciseness, correctness, and completeness.

Furthermore, the scientists found that ‘prompt engineering’ or the process of tuning or modifying the input prompts greatly improved the performance of the model. This was apparent, especially along the conciseness parameter, where specific prompts instructing the model to summarize patient questions into queries of specific word counts were helpful in meaningfully condensing the information.

Radiology reports were the one aspect where the conciseness of the large language model summaries was lower than that of medical experts, and the scientists predicted that this could be due to the vagueness of the input prompt since the prompts for summarizing the radiology reports did not specify the word limit. However, they also believe that incorporating checks from other large language models or model ensembles, as well as from human operators, can greatly improve the accuracy of this process.

Conclusion

Overall, the study found that using large language models to summarize data on patient health records performed as well or better than the summarization of data by medical experts. Most of these large language models scored higher than human operators in the natural language processing metrics, concisely, correctly, and completely summarizing the data. This process can potentially be implemented with further modifications and improvements to help clinicians save valuable time and improve patient care.

Source:

https://www.news-medical.net/news/20240228/AI-outperforms-doctors-in-summarizing-health-records-study-shows.aspx