Medicine and Health
Influence of believed AI involvement on the perception of digital medical advice
M. Reis, F. Reis, et al.
Artificial intelligence (AI) is increasingly used in healthcare, from medical image analysis to detecting drug interactions, and recent advances in large language models (LLMs) have made AI-based medical advice more accessible to the public. Prior work suggests high diagnostic accuracy of LLMs and even higher perceived quality and empathy of LLM-generated responses when authorship is undisclosed. However, when AI authorship is known, people often exhibit algorithm aversion. Much existing research in this area uses small samples, non-experimental designs, or focuses on physicians’ perspectives rather than the public’s. Because patient perceptions influence health behaviors and outcomes, this study examined how the public evaluates identical medical advice when it is labeled as authored by a human physician, an AI, or a human physician collaborating with AI. The core research question is whether labeling medical advice as AI-involved reduces perceived reliability and empathy, and lowers willingness to follow the advice, despite identical content.
The paper situates its contribution within evidence that LLMs can perform competitively in medical tasks (e.g., diagnostic accuracy comparable to physicians; physicians blind to authorship rating LLM answers as higher quality and more empathetic). Yet, across domains, algorithm aversion emerges when AI authorship is made explicit. Reviews of clinical AI highlight persistent public reservations and emphasize that perceptions of the provider (trust, empathy) shape adherence, satisfaction, and outcomes. Additional literature suggests that framing, assurances that humans remain in decision loops, and attention to patient trust can influence acceptance. The authors note that previous studies often lacked large, representative samples or experimental manipulation of AI authorship labels, motivating their preregistered experiments with robust sample sizes and public-focused measures (reliability, comprehensibility, empathy, willingness to follow).
Ethics and preregistration: Both studies followed the Declaration of Helsinki with ethics approval from the University of Würzburg (GZEK 2023-66). Study 1 was preregistered at https://osf.io/6trux; Study 2 at https://osf.io/wn6mj. Data and code are available via OSF (https://osf.io/cxb7s/). Analyses were conducted in R 4.1.1 with Holm-Bonferroni correction for multiple testing.
Study 1 (between-subjects): Participants: 1,090 Prolific recruits; 40 did not finish and were excluded, yielding n = 1,050 (350 per label: human, AI, human+AI). Mean age 33.0 years (SD = 11.5); 555 male, 489 female, 5 non-binary, 1 prefer not to say. Education skewed toward university degrees; ~60 nationalities (most frequent: South Africa, UK, Poland). Power analysis indicated 95% power for d ≥ 0.273 at α = 0.05. Materials: Four case reports (smoking cessation, colonoscopy, agoraphobia, reflux disease), each a brief layperson inquiry and a response. Responses were generated by prompting ChatGPT 3.5 with the inquiries and then edited and validated by a certified physician for medical accuracy and clarity. Thus, all final stimuli were effectively human–AI collaborations, regardless of the label shown to participants. Scales: Participants rated perceived reliability, comprehensibility, and empathy on 7-point Likert scales (extremely unreliable to extremely reliable; extremely difficult to understand to extremely easy to understand; extremely unempathic to extremely empathic). Scales were designed per best practices with clear, symmetric, nonoverlapping categories. Design and procedure: Unifactorial between-subject design manipulating the supposed author label (human, AI, human+AI). In Study 1, labels were conveyed via text and symbols. Participants read all four scenarios (random order) and then reported AI usage frequency, perceived impact of AI in healthcare, perceived risks vs opportunities, and demographics. Analyses: For each dimension (reliability, comprehensibility, empathy), one-way ANOVAs tested the main effect of author label; significant effects were followed by two-tailed two-sample t-tests (Cohen’s d via schoRsch). Non-preregistered mixed-effects regressions were also computed with author label as fixed effect and scenario and participant as random intercepts (human as reference), with p-values via Satterthwaite’s method.
Study 2 (between-subjects, UK-representative): Participants: Recruited 1,456; 89 did not finish and 137 failed an attention check, yielding n = 1,230 (410 per label). UK-only, representative by age, gender, ethnicity. Mean age 47.3 years (SD = 15.6); 595 male, 619 female, 10 non-binary, 6 prefer not to say. Majority with university degrees. Power: 90% for d ≥ 0.270 at α = 0.01. Materials and procedure: Same case reports; labels manipulated via text only. Ratings collected for reliability, comprehensibility, empathy, and additionally willingness to follow the advice. Scales were 5-point Likert (very unreliable to very reliable; very difficult to understand to very easy to understand; very unempathic to very empathic; very unwilling to very willing). At the end, participants could click to save a fictitious link to the platform/tool generating the responses, framed according to condition (physician vs AI-supported chatbot with physician review). Participants also reported AI attitudes, demographics, patient status, and healthcare profession/training. An attention check asked them to identify the stated source of the advice. Analyses: Mixed-effects regressions for each rating dimension; post hoc two-sample t-tests and Cohen’s d as in Study 1. Logistic regression modeled the decision to save the link (yes/no) with author label as fixed effect and participant as random intercept (human as reference). Holm-Bonferroni corrections applied. Exploratory correlations examined associations of AI attitudes and individual characteristics with ratings and link-saving in the AI and human+AI conditions.
Study 1 (n = 1,050; 350/group):
- Empathy: Significant main effect of author label, F(2, 1,047) = 7.98, P < 0.001, η² = 0.02. Human-labeled advice rated more empathic than AI-labeled, t(698) = 3.58, P < 0.001, d = 0.27, 95% CI [0.12, 0.42], and more than human+AI, t(698) = 3.44, P = 0.001, d = 0.26, 95% CI [0.11, 0.41]. No difference between AI and human+AI (t < 1).
- Reliability: Significant main effect, F(2, 1,047) = 9.68, P < 0.001, η² = 0.02. Human > AI, t(698) = 3.72, P < 0.001, d = 0.28, 95% CI [0.13, 0.43]; Human > human+AI, t(698) = 3.90, P < 0.001, d = 0.29, 95% CI [0.15, 0.44]. AI vs human+AI: no difference (t < 1).
- Comprehensibility: No effect of author label (F < 1).
Study 2 (n = 1,230; 410/group):
- Replicated Study 1 patterns: Human-labeled advice rated more empathic and more reliable than AI and human+AI (ts ≥ 3.01, Ps ≤ 0.003, ds ≥ 0.21); no differences in comprehensibility (ts < 1).
- Willingness to follow: Lower when AI was believed involved (AI or human+AI) compared with human (ts ≥ 4.46, Ps ≤ 0.001, ds ≥ 0.31).
- Interest in platform (link saving): Proportions did not differ significantly by label: human 19.3%, AI 18.5%, human+AI 22.9%; logistic regression vs human: AI b = 0.05, z = 0.27, P = 0.789; human+AI b = 0.22, z = 1.28, P = 0.200.
- Validity checks: In the AI-labeled groups, ratings correlated positively with attitudes toward AI (perceived opportunities vs risks, perceived impact on healthcare), Ps ≤ 0.022. Across conditions, higher ratings related positively to link-saving decisions (Ps ≤ 0.012).
Overall, identical medical advice labeled as AI-involved (AI or human+AI) was perceived as less reliable and less empathic than human-labeled advice, with no systematic differences in perceived comprehensibility. In Study 2, AI involvement also reduced willingness to follow the advice.
The experiments show an anti-AI bias in public perceptions of digital medical advice: the same content is judged less reliable and less empathic when labeled as AI-generated or human+AI, and people express lower willingness to follow it. This bias may stem from associations of the human-physician label with care and respect, crucial to patient–physician interactions, and from feelings that AI is dehumanizing. Uniqueness neglect may also contribute, with users fearing AI will not account for individual characteristics; clarifying that AI-generated advice incorporates personal information might mitigate this concern. The absence of label effects on comprehensibility suggests that perceptions of readability and clarity are less susceptible to authorship framing, and counters a halo-effect explanation in which one negative judgment would globally depress all ratings. Despite lower willingness to follow AI-labeled advice, the lack of differences in link-saving behavior indicates continued public interest in exploring such tools. Framing may be pivotal: prior work finds that assuring humans remain in the decision loop can boost trust in human–AI collaboration. Engaging multiple stakeholders (public, clinicians, insurers) and carefully framing AI involvement could enhance acceptance of AI-assisted medical advice.
This study provides causal evidence, across two large preregistered experiments, of an anti-AI bias in public evaluations of digital medical advice: AI-involved authorship labeling reduces perceived reliability and empathy and lowers willingness to follow advice, despite identical content. At the same time, comprehensibility appears robust to authorship framing, and public interest in exploring AI tools persists. The findings underscore the need for thoughtful framing of human–AI collaboration and clear communication that personal information is considered in AI-generated advice. Future research should investigate more interactive, multi-turn settings that mirror real consultations, test messaging and design strategies (e.g., assurances that humans remain decision-makers) to counteract bias, and examine long-term adoption and outcomes across diverse populations.
Participants evaluated brief, single-turn scenarios and adopted the perspective of other individuals rather than posing their own questions, limiting ecological validity relative to real, interactive physician–patient consultations. Labels were manipulated via text (and symbols in Study 1), which may not capture the richness of real-world disclosure and user interfaces. Online, English-language recruitment via Prolific may underrepresent individuals with lower socioeconomic status or limited internet access, and may overrepresent those interested in research participation, affecting generalizability. Although stimuli were standardized and content-identical across conditions, all case responses were created via an AI–physician collaboration before labeling, which could influence perceptions in ways not fully captured. Finally, outcomes were self-reported perceptions and intentions (including willingness to follow), not observed health behaviors.
Related Publications
Explore these studies to deepen your understanding of the subject.

