A new study led by James Cook University (JCU) has found that human coders outperformed artificial intelligence (AI) in classifying complex medical case documents, though researchers believe AI could still play a crucial role in improving the accuracy and efficiency of disease coding.
The study, which compared the performance of five human clinical document coders to a ChatGPT-based large language model, analyzed 100 randomly selected clinical patient summaries. These documents covered five major disease categories. The results revealed that while AI only achieved 22% accuracy, the top-performing human coder achieved 47%.
“We observed that several human coders consistently outperformed the AI tool across most cases,” said Akram Mustafa, a JCU PhD candidate and the lead author of the study. “Although some coders performed worse, when all five disease categories were considered, human coders outperformed AI overall.”
Human coders are responsible for translating health records into standardized alphanumeric codes, which are vital for state and federal data reporting, health service planning, and hospital funding models. While AI has shown promise in automating certain aspects of this process, the study highlights that human expertise still plays a critical role in handling complex and incomplete clinical data.
Mustafa emphasized that while previous studies have compared human coders with AI in classifying medical documents, this research focused on more challenging cases. “Some clinical cases are relatively easy to classify, where traditional machine learning models perform well. However, we wanted to explore how AI would handle cases where information is missing or unclear,” he said.
In addition to comparing human coders with AI, the study also examined the performance of two versions of ChatGPT: 3.5 and 4. Professor Mostafa Rahimi Azghadi, a co-author of the study and JCU Electronics and Computer Engineering professor, noted that ChatGPT 4 demonstrated significantly improved consistency in disease classification. “ChatGPT 4 was much more stable, providing identical disease predictions 86% to 89% of the time when given the same clinical document,” said Prof Azghadi.
He likened this process to obtaining a diagnosis from a doctor on two separate occasions. “The variation in results reflects how human doctors might interpret the same information differently on different days,” he explained.
Despite AI’s lower performance, the researchers emphasized that large language models could still be valuable tools in the disease coding process. “Currently, human coders review extensive clinical records, including patient assessments, treatments, and medication histories,” Prof Azghadi said. “AI could complement human coders by flagging difficult cases and streamlining the process, enhancing both accuracy and efficiency.”
Looking ahead, the research team plans to improve the explainability of AI models, enabling them to offer detailed justifications for their disease classifications. This could help increase trust and transparency in AI-assisted medical coding.
Dr. Usman Naseem, a researcher from Macquarie University’s School of Computing, also contributed to the study.
While human expertise remains essential, the study suggests that AI has the potential to transform medical coding practices, offering a more efficient, accurate, and scalable solution for the future.
Related Topics