logo
ResearchBunny Logo
Employing a systematic approach to biobanking and analyzing clinical and genetic data for advancing COVID-19 research

Medicine and Health

Employing a systematic approach to biobanking and analyzing clinical and genetic data for advancing COVID-19 research

S. Daga, C. Fallerini, et al.

The GEN-COVID Multicenter Study has uniquely linked biospecimens from over 1000 individuals infected with SARS-CoV-2 to a wealth of clinical data. Discover the five distinct clinical categories identified in this groundbreaking research by Sergio Daga and colleagues, aiming to unravel the complexities of COVID-19 severity and genetic susceptibility.

00:00
00:00
~3 min • Beginner • English
Introduction
The GEN-COVID Multicenter Study was launched to systematically collect and integrate high-quality biospecimens and granular clinical data from COVID-19 patients across Italy to enable global research access and interoperability. Motivated by the heterogeneous clinical presentation of SARS-CoV-2 infection—from asymptomatic cases to severe respiratory distress and multi-organ failure—the study aims to create a robust biobank and patient registry adhering to FAIR data principles. Italy’s early and severe epidemic underscored the urgency of rapidly collecting, processing, and sharing standardized biological materials and clinical data. The project links biobanked samples to detailed phenotypes and derived genotypes (WES and GWAS) to investigate host genetic factors influencing susceptibility, severity, and multi-organ involvement, with the ultimate goal of improving diagnostics, prognostics, and personalized therapies for COVID-19.
Literature Review
Methodology
Study design: The GEN-COVID Multicenter Study established an interoperable biobank (GEN-COVID Biobank, GCB), a patient registry (GEN-COVID Patient Registry, GCPR), and a genetic data repository (GCGDR) to integrate biospecimens, clinical phenotypes, and genomic data. Objectives included performing WES and GWAS on 2000 patients, associating host genetics with severity/prognosis, and sharing data via the Network for Italian Genomes (NIG) and consortium platforms. Sites and ethics: A network of 22 Italian hospitals (13 North, 5 Central, 4 South), local healthcare units, and preventive medicine departments participated. Activities began March 16, 2020, following Ethical Review Board approval (University of Siena, Protocol no. 16929, May 6, 2020). GDPR compliance and FAIR principles guided all processes. Participants: Inclusion criteria were PCR-confirmed SARS-CoV-2 infection, age ≥18 years, and informed consent. As of July 16, 2020, 1033 individuals (1021 unrelated, 12 with family relations) covering the full spectrum of disease severity were enrolled; recruitment was ongoing toward 2000 participants. Data collection: A comprehensive clinical questionnaire captured socio-demographics (sex, age, ethnicity), family history, comorbidities, and COVID-19-related symptoms. Over 150 clinical variables were consolidated into a single database, updated as knowledge evolved. Biospecimens: Peripheral blood was collected in EDTA tubes. Genomic DNA was isolated centrally (MagCore Genomic DNA Whole Blood Kit). Plasma and serum aliquots were stored. Where possible, leukocytes were isolated via density gradient centrifugation and cryopreserved in DMSO; nasopharyngeal swabs were stored at reference hospitals. Clinical measures and organ involvement definitions: Respiratory involvement quantified by PaO2/FiO2 (P/F) with severity categories (≤100, 101–200, 201–300, >300; unavailable for non-hospitalized). Cardiac involvement: elevated cTnT (>15 ng/L), NT-proBNP (>88 pg/mL males; >153 pg/mL females), or myocardial injury with concomitant elevated GPT/GOT (GPT ref <41 U/L males; <31 U/L females; GOT ref <37 U/L males; <31 U/L females). Pancreatic involvement: pancreatic amylase (13–53 U/L) and lipase (13–60 U/L) out of reference range. Kidney involvement: creatinine above sex-specific reference (0.7–1.20 mg/dL males; 0.5–1.10 mg/dL females). Lymphoid involvement: NK cells <90 cells/µL and/or CD4+ T cells <400 cells/µL. Olfactory/gustatory dysfunction graded via ENT-administered questionnaire. Coagulation involvement: D-dimer >10× reference, with/without low fibrinogen. Pro-inflammatory involvement: IL-6, LDH (135–225 U/L males; 135–214 U/L females), and CRP (>0.5 mg/dL) above reference. Whole-exome sequencing (WES): Libraries prepared using Nextera Flex for Enrichment; bead-based transposome fragmentation and adapter tagging; limited-cycle PCR; hybrid capture with biotinylated Illumina CEX probes; streptavidin bead capture; elution and amplification. Sequencing on Illumina NovaSeq 6000 with ≥97% bases covered at 20×. Genotyping (GWAS): Illumina Global Screening Array (~700,000 markers) on GRCh38. QC performed with GenomeStudio and R (SNP calling quality, cluster separation, Mendelian/replication controls). PLINK v1.90 used for genotype processing and statistics. Data management and sharing: Data stored and analyzed through the Network for Italian Genomes (NIG) using CINECA HPC resources. Data conformed to FAIR principles for interoperability with other omics and reference databases. Statistical analysis: Descriptive statistics by sex, age, and ethnicity. Chi-square tests assessed associations between clinical severity (from no hospitalization to intubation) and categorical variables (gender, ethnicity, blood group, respiratory severity, taste/smell involvement, organ system involvement, cytokine activation, D-dimer, comorbidity count). Linear regression evaluated association between age and COVID-19 severity.
Key Findings
- Established the GEN-COVID Biobank (GCB) and Patient Registry (GCPR), collecting biospecimens and detailed clinical phenotypes from >1000 PCR-positive individuals as of mid-July 2020. - Cohort composition: 74.25% hospitalized and 25.75% non-hospitalized (pauci-/asymptomatic). Among hospitalized, respiratory support distribution: 9.5% intubated, 18.4% CPAP/BiPAP, 31.55% O2 supplementation, 14.8% without respiratory support. - Collected >150 patient-level clinical variables capturing multi-organ involvement (heart, liver, pancreas, kidney, lymphoid, coagulation, inflammatory systems) with standardized laboratory thresholds. - Clustering analysis delineated five clinical categories of COVID-19 disease: (1) severe multisystemic failure (thromboembolic or pancreatic variants); (2) cytokine storm type (with liver involvement or moderate); (3) moderate heart failure (with or without liver damage); (4) moderate multisystemic involvement (with or without liver damage); (5) mild (with or without hypoxia). - Generated genome-wide data: WES with ≥97% coverage at 20× on NovaSeq 6000 and GWAS genotypes (~700k SNPs) on Illumina Global Screening Array; data integrated within the GCGDR and shared via NIG/CINECA. - Implemented GDPR-compliant, FAIR-conformant pipelines enabling rapid statistical analysis and interoperability.
Discussion
By building a GDPR-compliant, FAIR-enabled biobank and patient registry linked to high-throughput genomic data, the study directly addresses the need to understand heterogeneous COVID-19 presentations and their genetic underpinnings. The standardized multi-organ clinical variables and derived clusters provide a framework to phenotype patients beyond respiratory metrics, enabling association analyses between host genetics and specific clinical trajectories (e.g., thromboembolic or cytokine-dominant phenotypes). The availability of WES and GWAS data for the same individuals allows integrative analyses to identify variants influencing susceptibility, severity, and organ-specific complications. Data sharing via NIG/CINECA encourages collaboration and replication, accelerating discovery and potential identification of biomarkers for risk stratification and therapeutic targeting. Overall, the findings support the feasibility and value of a systematic, interoperable approach to capture COVID-19’s clinical complexity and facilitate personalized medicine strategies.
Conclusion
The GEN-COVID Multicenter Study successfully established an interoperable infrastructure—GCB, GCPR, and GCGDR—linking high-quality biospecimens, rich clinical phenotypes, and genomic data for COVID-19. From a cohort exceeding 1000 individuals, the study defined standardized measures of multi-organ involvement and identified five major clinical categories capturing disease heterogeneity. The resource is openly shareable via NIG, supporting national and international research efforts. Future work will expand the cohort toward 2000 participants, deepen genotype–phenotype association analyses to define genetic determinants of susceptibility and severity, and leverage findings to inform patient stratification and repurposing of therapeutics for personalized treatment of COVID-19.
Limitations
- P/F ratio (PaO2/FiO2) was unavailable for non-hospitalized subjects, limiting direct comparison of respiratory severity across the entire cohort. - Data reflect an interim cohort (1033 individuals as of July 16, 2020) with recruitment ongoing, so findings represent early analyses. - Affiliations and broader contextual variables beyond Italy may limit generalizability until expanded or replicated in other populations.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny