logo
ResearchBunny Logo
Talking about diseases; developing a model of patient and public-prioritised disease phenotypes

Medicine and Health

Talking about diseases; developing a model of patient and public-prioritised disease phenotypes

K. Slater, P. N. Schofield, et al.

This research develops a novel phenotype model representing the public's perspective on disease using social media data, revealing 24,618 new phenotype associations. Conducted by Karin Slater, Paul N. Schofield, and their team, it highlights the importance of integrating public views to improve clinical awareness and understanding across healthcare stakeholders.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the gap between academic, clinician-centric disease knowledge and the perspectives and priorities of patients and the public. Existing phenotype resources are predominantly derived from literature and experimental data, which can reflect biases in scientific attention and healthcare practice, potentially overlooking symptoms that affect quality of life or those prevalent in underrepresented groups. The research question is whether a social media-derived disease–phenotype model (representing public and patient perspectives) differs significantly from a biomedical database and literature-derived phenotype model (BDLP), and whether such public-oriented associations are clinically plausible and relevant. The purpose is to construct and evaluate a Social Media Phenotype (SMP) model across a broad range of diseases, compare it with BDLP, and assess clinical validity through expert review to inform more patient-centred knowledge resources.
Literature Review
Foundational work in disease phenomics has relied on curated and text-mined literature resources to link diseases and phenotypes, enabling applications such as differential diagnosis and variant prioritisation (e.g., Hoehndorf et al.; Kafkas et al.; Philavanh/Pleihav et al.). Digital phenotyping approaches have begun to use non-traditional data sources (wearables, social media) primarily at the patient level, but have not substantially contributed to disease-level phenotype knowledge bases. Prior social media–based studies focused on single diseases or methodological issues (e.g., Lenz et al. for diabetes forums; Micallef et al. on linking social data to epidemiology) and lacked systematic comparison to literature-derived resources or broad, open databases for secondary use. Additionally, documented mismatches between patient and clinician perspectives (e.g., differing interpretations of “stable” cancer; uveitis forum analyses) motivate integrating patient-reported priorities into phenotype models. The present study extends this body of work by constructing a multi-disease SMP, aligning it to ontology frameworks, systematically contrasting it with BDLP, and performing clinical evaluation.
Methodology
Data sources and collection: Social media posts mentioning disease keywords were acquired via Twitter (X) and Socialgist (including Reddit and online forums) for the period 2019-11-01 to 2021-11-01, under agreements coordinated by the White Swan charity. Disease keyword lists were based on ontology labels and refined by manual curation. No de-duplication by author was performed to mirror literature-mining practices. Text mining and concept mapping: Keyword-based matching identified mentions of Human Phenotype Ontology (HPO) and Disease Ontology (DO) classes in each post. Ontology reasoning was used to propagate mentions along subclass relations to support inference (e.g., mapping “low back pain” within the pain hierarchy). Association scoring and significance: Disease–phenotype co-occurrence was quantified using normalized pointwise mutual information (NPMI). Statistical significance was assessed with q-values controlling the false discovery rate at 0.05; associations meeting this threshold were considered significant. Certain implausible or linguistically ambiguous phenotype labels (e.g., those with unlikely public usage) were excluded following review. Biomedical Database and Literature Phenotype (BDLP): Literature-derived associations were consolidated from prior text-mined resources (e.g., Kafkas et al.; Pleihav/Philavanh et al.). Diseases from ICD-10 and SNOMED were mapped to DO using MONDO/DO cross-references and manual curation where required. The unified set constitutes the BDLP. Novel association identification: For each disease, SMP associations were deemed novel if no equivalent or more specific association existed in BDLP. The novel subset captured significant, maximally specific SMP associations distinct from BDLP. Comparative analyses: High-level categorisations of diseases and phenotypes used DO and HPO hierarchies to assess thematic distributions (e.g., constitutional, digestive, nervous system). Semantic similarity between SMP and BDLP profiles across matched diseases (n≈304) was computed with the Resnik measure using the Semantic Measures Library, and ROC/AUC summarised how well SMP recapitulated BDLP. Clinical review: Twelve UK specialist consultants, blinded to data source, evaluated disease–phenotype associations for selected diseases. Review questions: Q1 (established/feasibility), Q2 (association type: symptom/sign, comorbidity, complication/sequela, treatment side-effect, other/unknown), Q3 (frequency in clinical practice, Likert 1–5). Diseases reviewed included bronchiectasis, cerebrovascular accident, chronic obstructive pulmonary disease, fibromyalgia; a non-clinical literature-based review was performed for neurofibromatosis 1. Responses were analysed with chi-squared tests and Wilcoxon rank-sum tests with appropriate multiple-testing corrections. Ethics, code, and data: Ethical approval (University of Birmingham STEM committee: ERN_2022-0241; amendment ERN_0241-Jun2023). Code and derived association data are available at the provided GitHub and repository; raw social media posts are not publicly released due to ownership restrictions.
Key Findings
- Scale and novelty: From social media, 52,108 possible disease–phenotype associations were identified across 311 diseases. Across 304 matched diseases, 24,618 SMP associations were novel (not in BDLP), with 14,531 reaching statistical significance (FDR q<0.05). - Recapitulation: SMP phenotype profiles substantially recapitulated BDLP across diseases with AUC ≈ 0.872 (95% CI approximately 0.855–0.889) based on semantic similarity. - Thematic over-representation: SMP over-represented phenotypes affecting quality of life and categories such as constitutional symptoms, digestive, endocrine, and reproductive phenotypes relative to BDLP. Pain was the most frequent novel phenotype group, accounting for about 6% of all novel associations. - Constitutional symptoms: BDLP contained 3,693 constitutional symptom associations; SMP had 2,289, with 1,264 of these being novel to SMP. Within constitutional symptoms, pain dominated (BDLP 68%, SMP 74%), with SMP showing notable increases in back pain (+10%), lower limb pain (+4%), and sciatica (+4%). - Clinical review outcomes: Expert ratings for validity (Q1) and association type (Q2) were strongly correlated between SMP and BDLP (Q1: χ²=47.504, p=1.198e−09; Q2: χ²=76.345, p=4.67e−15). ‘Other associated phenotype’ was more common in BDLP, whereas ‘Comorbidity’ and ‘Unknown’ were more frequent for SMP. SMP associations were, on average, reported less frequently in clinic than BDLP associations (Q3; one-tailed Wilcoxon rank-sum p=3.868e−05), though both sets were skewed toward infrequent observations (more than half ‘never’ or ‘rarely’). Shortlisted hypotheses included 23 SMP and 79 BDLP associations that were rated ‘Not established but feasible’ and seen at least ‘sometimes.’ - Distribution across disease categories: Novel SMP associations were relatively more concentrated in digestive diseases and less so in infectious and mental health categories, while maintaining overall correlation with disease set composition.
Discussion
The findings support the hypothesis that social media-derived phenotypes provide a perspective on disease that is distinct from literature and database-derived knowledge, while still substantially overlapping in core phenotype structure. The SMP highlighted phenotypes with strong impact on daily living—particularly pain and other constitutional symptoms—suggesting that public discourse emphasises general, patient-salient manifestations over technical subtypes. Clinical review indicated similar perceived validity between SMP and BDLP associations despite differences in frequency of clinical recognition, implying that many SMP-derived links are plausible and potentially underrepresented in academic resources. The divergence between SMP and BDLP underscores the value of integrating patient and public perspectives into phenotype knowledge bases to inform clinical awareness, hypothesis generation, and more patient-centred research. Ontology-driven analyses further revealed systematic differences in phenotype emphasis, pointing to areas (e.g., digestive, endocrine, reproductive, and constitutional) where public-expressed experiences may enrich existing resources.
Conclusion
This study presents a multi-disease social media phenotype (SMP) model that complements and differs from a biomedical database and literature-derived phenotype (BDLP) model, revealing a large set of novel, statistically supported disease–phenotype associations. SMP tendencies toward quality-of-life and constitutional symptoms highlight patient-salient disease aspects often underrepresented in literature-derived datasets. Clinical expert evaluation suggests comparable perceived validity between SMP and BDLP associations, supporting SMP as a resource for hypothesis generation and clinical awareness. Future work should integrate advanced NLP/NER methods to disambiguate overlapping disease/phenotype terms, address demographic and platform biases, harmonise methodologies for fair multi-context comparisons, and expand clinical validation across more diseases and settings. The publicly available code and resources enable further secondary analyses and development of patient-centred phenotype models.
Limitations
- Data source bias: Social media user demographics (e.g., age, education, geography, race/ethnicity) are not representative of the general population, likely skewing disease and phenotype representation (e.g., fewer age-related or deprivation-associated diseases). - Text-mining constraints: Keyword-based matching and co-occurrence (NPMI) capture mentions rather than incidence or causality and are sensitive to ambiguity, polysemy, and lay versus clinical terminology (e.g., ‘anorexia’ as phenotype vs disease). Limited layperson synonym coverage in HPO may reduce recall for certain phenotype groups (e.g., voice phenotypes). - Significance threshold effects: A relatively stringent FDR threshold likely excludes specific, lower-frequency but valid associations, biasing SMP toward more general phenotypes. - Mapping and integration issues: Harmonising diseases across ICD-10/SNOMED to DO and consolidating literature-derived resources can leave unmapped concepts and introduce inconsistencies. - Author/post-level independence: No de-duplication by author may allow prolific users to disproportionately influence associations; analogous issues may exist in literature mining (author/journal effects). - Clinical review scope: Limited number of diseases and UK specialist reviewers; frequency estimates reflect clinician recognition rather than population-level prevalence. - Data availability: Raw social media posts cannot be shared due to ownership restrictions, limiting replication of the initial extraction. - Potential extraction errors: Inherent noise and evolving language on social media, plus differences in ontology sizes and label specificity, may affect the accuracy and comparability of associations.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny