
MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks
Apr 2024 – present
In Jordan, over 4 million people lack health insurance and face limited access to essential healthcare. This motivated the development of MedArabiQ, a novel benchmark dataset evaluating state-of-the-art LLMs for Arabic medical tasks across multiple specialties. This project was driven by the promise that LLMs provided a window for large-scale deployment in underserved areas in Jordan and globally, where access to healthcare services is restricted.
In our work, we benchmarked 5 proprietary and open source SOTA LLMs such as GPT-4, Gemini 1.5, and LLaMa 7B on MCQs and simulated patient-doctor Q&A samples We found that proprietary LLMs outperformed open-source on MCQs, but performance was similar on Q&A, likely due to limitations in the evaluation metric (Bert Score). Our results highlighted the need for diverse metrics and more robust Arabic medical datasets for fair, scalable AI development.

We also developed an evaluation framework to assess LLM susceptibility to injected biases and test mitigation strategies. We found that all models experience significant accuracy declines with bias, highlighting the importance of bias mitigation in applications.

MedArabiQ was recently accepted for presentation at the Machine Learning for Healthcare Conference (Research Track) at Mayo Clinic in August 2025, and will be published at Proceedings for Machine Learning. Currently, we are working on creating MedArabiQ v2, an enhanced dataset with over 33,000 samples for use in fine-tuning for clinical settings. We are also launching a shared task for model evaluation.