Acceptable accuracy for medical AI: a survey of physicians and the general population in Sweden
As artificial intelligence (AI) tools enter clinics and patient smartphones at unprecedented speed, a fundamental question emerges: what level of performance is ‘good enough’ for AI to guide medical decisions? Early applications—such as rules-based expert systems for ECG interpretation—have been in use for decades, but the emergence of large language models has considerably broadened the scope and accessibility of AI tools.1
Adoption is already outpacing regulation: self-reported use of non-medical-grade generative AI rose among UK general practitioners from 20% in 2024 to 25% in 2025,2 3 while 17% of the US general population reported using such tools for health-related queries at least monthly.4 This rapid uptake contrasts with the absence of proper validation and consensus on performance thresholds for safe and trusted deployment.
Trust is central to adoption in healthcare.5 6 For both patients and physicians, performance—typically expressed as sensitivity and specificity—is a cornerstone of trust.7–9 AI classification systems require balancing sensitivity and specificity. Prioritising sensitivity risks increasing false positives, leading to unnecessary examinations, anxiety and resource strain,10–12 whereas prioritising specificity raises the risk of false negatives, missed diagnosis and patient harm.13 Currently, performance targets are largely set by developers, and it is often not known to what extent the preferences of end-users are taken into account.
Importantly, it is not self-evident that all stakeholders must demand AI to outperform humans; even modest accuracy can add value by easing workload, saving resources and expanding access to care.14
To date, studies exploring the use of AI and opinions about AI use in healthcare have been limited in scope, often relying on convenience samples from prerecruited online panels, raising concerns about representativeness.15 16 To our knowledge, none have inquired about minimum acceptable performance levels for AI in medicine using nationally representative random samples.
This study aimed to assess and compare the views of physicians and the general population on the minimum acceptable sensitivity and specificity for medical AI systems across different clinical vignettes.