Continued need to improve AI tools in healthcare: Study

Author

First Check Team
View all posts

National Institutes of Health researchers find AI tools significantly less accurate when analysing summaries written by patients about their own health.

While artificial intelligence (AI) tools can make accurate diagnoses from textbook-like descriptions of genetic diseases, the tools are significantly less accurate when analysing summaries written by patients about their own health, according to researchers at the National Institutes of Health (NIH) in the US. The findings, published recently in the American Journal of Human Genetics, demonstrate the need to improve AI tools before they can help make diagnoses and answer patient questions.

The researchers tested 10 large language models, including two recent versions of ChatGPT. Drawing from medical textbooks and other reference materials, the researchers designed questions about 63 different genetic conditions. These included some well-known conditions, such as sickle cell anemia, cystic fibrosis and Marfan syndrome, as well as many rare genetic conditions.

These conditions can show up in a variety of ways among different patients, and the researchers aimed to capture some of the most common possible symptoms. They selected three to five symptoms for each condition and generated questions phrased in a standard format, “I have X, Y and Z symptoms. What’s the most likely genetic condition?”

When presented with these questions, the large language models ranged widely in their ability to point to the correct genetic diagnosis, with initial accuracies between 21 per cent and 90 per cent. The best performing model was GPT-4, one of the latest versions of ChatGPT.

The success of the models generally corresponded with their size, meaning the amount of data the models were trained on. The smallest models have several billion parameters to draw from, while the largest have over a trillion. For many of the lower-performing models, the researchers were able to improve the accuracy over subsequent experiments, and overall, the models still delivered more accurate responses than non-AI technologies, including a standard Google search.

The researchers optimised and tested the models in various ways, including replacing medical terms with more common language. For example, instead of saying a child has “macrocephaly,” the question would say the child has “a big head,” more closely reflecting how patients or caregivers might describe a symptom to a doctor. Overall, the models’ accuracy decreased when medical descriptions were removed. However, seven out of ten models were still more accurate than Google searches when using common language.

The researchers also asked patients from the NIH Clinical Center to provide short write-ups about their own genetic conditions and symptoms. These descriptions ranged from a sentence to a few paragraphs and were also more variable in style and content compared to the textbook-like questions. When presented with these descriptions from real patients, the best-performing model made accurate diagnoses only 21 per cent of the time. Many models performed much worse, even as low as one per cent accurate.

However, the accuracies improved when the researchers wrote standardised questions about the same genetic conditions. This indicates that variable phrasing and format of patient write-ups is difficult for the models to interpret, perhaps because the models are trained on textbooks and other reference materials that tend to be more concise and standardised.

“For these models to be clinically useful in the future, we need more data, and those data need to reflect the diversity of patients,” said Ben Solomon, M.D., senior author of the study and clinical director at the NIH’s National Human Genome Research Institute (NHGRI). “Not only do we need to represent all known medical conditions, but also variation in age, race, gender, cultural background and so on, so that the data capture the diversity of patient experiences. Then these models can learn how different people may talk about their conditions.”

The study highlights the current limitations of large language models and the continued need for human oversight when AI is applied in healthcare. “These technologies are already rolling out in clinical settings,” he noted. “The biggest questions are no longer about whether clinicians will use AI, but where and how clinicians should use AI, and where should we not use AI to take the best possible care of our patients.”

Author

First Check Team
View all posts