logo
ResearchBunny Logo
Substantial underestimation of SARS-CoV-2 infection in the United States

Medicine and Health

Substantial underestimation of SARS-CoV-2 infection in the United States

S. L. Wu, A. N. Mertens, et al.

Explore how a team of researchers, including Sean L. Wu and Andrew N. Mertens, uncovered a staggering 6.4 million cumulative SARS-CoV-2 infections in the U.S. by April 2020, far exceeding the confirmed case numbers. Their innovative semi-Bayesian approach reveals vital insights into the infection burden often overlooked due to limited testing accuracy.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the research question: How much do incomplete testing and imperfect diagnostic accuracy cause underestimation of SARS-CoV-2 infections in the United States? In early 2020, U.S. testing capacity was limited and prioritized for hospitalized or higher-risk symptomatic individuals, despite evidence that many infections are mild or asymptomatic and capable of transmission. RT-PCR tests may also yield false negatives depending on sampling and timing. Consequently, confirmed case counts likely underestimate true infections, affecting situational awareness and estimates of transmissibility. Rather than forecasting, the authors aim to correct empirical case counts for bias due to testing practices and test performance to estimate the true infection burden between late February and mid-April 2020, thereby informing appropriate public health responses and evaluation of mitigation efforts.
Literature Review
The authors note that most prior efforts to estimate infection burden used mathematical transmission models (compartmental or agent-based), which can be sensitive to unverified assumptions about population structure and contact patterns, especially for a novel pathogen. Empirical evidence indicates that 30–70% of test-positive individuals may have mild or no symptoms, and asymptomatic and pre-symptomatic transmission occurs, implying many infections are missed under restricted testing. Early reports also suggest RT-PCR sensitivity varies by specimen type and timing, leading to false negatives. The paper situates its approach alongside seroprevalence studies (e.g., reports ranging from ~0.4% to 4.4% in select U.S. locations within the same timeframe) but cautions that some serological assays had poor specificity and that antibody kinetics were not well understood, complicating comparisons. The findings are broadly consistent with other infection burden estimates and rankings by state from alternative data sources (e.g., influenza-like illness comparisons), and international analyses in settings with better hospitalization data.
Methodology
Design: Semi-Bayesian probabilistic bias analysis applied to empirical confirmed case data to correct for two biases: selection bias from incomplete testing (preferentially of moderate-to-severe symptomatic individuals) and misclassification from imperfect RT-PCR sensitivity and specificity. Scope: U.S. states, from February 28 to April 18, 2020. Data: 2019 projected state populations from the 2010 U.S. Census; daily counts of tests, positive tests, and deaths from the COVID Tracking Project (Feb 23–Apr 18, 2020). COVID Tracking Project data quality grades varied by state; most states met high reporting standards. Assumed diagnostic tests were RT-PCR during the study period. Bias parameters and priors: Defined prior distributions for seven parameters capturing: probability of being moderate-to-severe symptomatic among tested individuals P(S|tested); analogous probability among untested P(S|untested); state-specific ratios relating test positivity among symptomatic and asymptomatic groups (α, β); and RT-PCR sensitivity and specificity. Priors were informed by published evidence, state testing guidelines, and empirical state-level test positivity, with ranges reflecting substantial uncertainty (e.g., sensitivity roughly 65–100% with mean below 100%; specificity very high ~99.8–100%). Estimation approach: Using the priors, the authors performed Monte Carlo probabilistic bias correction to generate a distribution of total infections by state. Steps: (1) Sample bias parameters from defined priors (with constraints to maintain plausible relations among parameters). (2) Correct for incomplete testing by estimating infections among untested individuals, stratified by symptom category, using sampled probabilities and observed testing volumes and positivity. (3) Correct for imperfect test accuracy by adjusting observed positives to account for sensitivity and specificity. (4) Repeat the process 10,000 times to obtain distributions of cumulative infections by state and nationally. Reported estimates are medians with simulation intervals (2.5th–97.5th percentiles). National percentages attributable to incomplete testing vs. imperfect test accuracy were derived by aggregating state-specific estimates weighted by population. Sensitivity analyses: Seven scenarios varied priors for parameters with least empirical support (e.g., positivity among untested with mild/no symptoms, symptom probabilities among tested/untested, dependence between α and β using Gaussian copulas). For each scenario, 10,000 Monte Carlo samples were drawn to reassess robustness of infection estimates.
Key Findings
- As of April 18, 2020, the U.S. had an estimated 6,454,951 cumulative SARS-CoV-2 infections versus 721,245 confirmed cases, corresponding to approximately 1.9% vs. 0.2% of the population. Thus, infections were about 3 to 20 times higher than confirmed cases (median ~9x), implying roughly 89% of infections were undocumented. - Attribution of underestimation: The majority of the discrepancy was due to incomplete testing—estimated at 86% (simulation interval: 64–99%)—with the remainder due to imperfect test accuracy at 14% (0.3–36%). A related national estimate within the paper put these at approximately 84% and 16%, respectively. - State variation: Confirmed case counts ranged from 0.4 to 12.2 per 1,000 population, while estimated infections ranged from 3.0 to 121.5 per 1,000. Ratios of estimated infections to confirmed cases ranged from 5 to 33 across states. Underestimation was particularly pronounced in parts of the South and other regions with low testing rates or higher test positivity. - Testing rates varied widely by state and increased over time, with notable geographic heterogeneity across regions. - Sensitivity analyses showed estimates were generally robust to alternative plausible priors. The scenario increasing the upper bound and mean for positivity among untested individuals with mild/no symptoms yielded higher infection estimates, particularly in states with higher observed test positivity.
Discussion
Adjusting for biased testing practices and imperfect diagnostic accuracy substantially increases estimates of SARS-CoV-2 infection burden, directly addressing the research question. The findings demonstrate that most underestimation stemmed from incomplete testing, not test misclassification, emphasizing the critical role of broad, accessible testing to inform policy. The corrected estimates align with other independent lines of evidence (e.g., undocumented infections facilitating spread; serological and syndromic comparisons) and reinforce that, by mid-April 2020, only a small share of the U.S. population had been infected—far from herd immunity. The approach offers a transparent, data-anchored alternative to complex mechanistic models for assessing burden at a point in time, enabling better evaluation of mitigation efforts and resource allocation. Policymakers should interpret confirmed case counts with caution, especially in jurisdictions with low testing rates or high positivity, where reliance on confirmed cases alone risks underestimating transmission and under-responding.
Conclusion
The study presents a semi-Bayesian probabilistic bias analysis that corrects empirical COVID-19 case counts for incomplete testing and imperfect diagnostic accuracy, producing substantially higher and likely more realistic estimates of infection burden across U.S. states through April 18, 2020. The methodology quantifies the relative contributions of under-testing versus test misclassification and is broadly applicable to other locations and time periods. The results underscore the urgency of expanding testing access and improving data quality to support effective pandemic response. Future work should incorporate improved, time-varying priors as evidence accumulates; leverage age-stratified and setting-specific data (e.g., deaths, hospitalizations) to refine estimates (including IFR); and explore integration with mechanistic models to link burden estimates with projections while maintaining transparency about assumptions.
Limitations
- Priors for key parameters (e.g., symptom probabilities among tested/untested, test sensitivity over time) were based on limited and early clinical evidence and state testing guidelines; true values may differ and vary over time and place. - Testing capacity and policies changed over time and across states; state-level guidance and supply constraints may not be fully captured by static priors. - Reliance on state-reported testing and positivity data from the COVID Tracking Project introduces heterogeneity in data quality; some states had lower reliability or incomplete reporting. - Empirical test positivity among tested individuals in low-testing states may poorly reflect population incidence, affecting corrections. - The approach does not include a mechanistic transmission model and therefore cannot forecast future dynamics. - Accurate IFR estimation would require age-stratified death counts and additional corrections not available/parameterizable in this study.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny