logo
ResearchBunny Logo
The CoRisk-Index: a data-mining approach to identify industry-specific risk perceptions related to Covid-19

Economics

The CoRisk-Index: a data-mining approach to identify industry-specific risk perceptions related to Covid-19

F. Stephany, L. Neuhäuser, et al.

Discover how the CoRisk-Index, developed by authors Fabian Stephany, Leonie Neuhäuser, Niklas Stoehr, Philipp Darius, Ole Teutloff, and Fabian Braesemann, offers a real-time view of corporate risk perceptions during the Covid-19 pandemic, highlighting its correlation with unemployment and market volatility. This groundbreaking research utilizes data mining and natural language processing to provide insights that bridge stock market data and macroeconomic trends.

00:00
00:00
~3 min • Beginner • English
Introduction
The Covid-19 pandemic has caused the largest global economic disruption of the twenty-first century. Travel bans, supply chain failures, and store closures pose significant risks to entire industries, and repeated lockdowns left the economy in unprecedented uncertainty. While governments mobilised substantial financial support, targeted and continuous countermeasures require reliable, up-to-date information on economic trends. Existing approaches face a gap: macroeconomic indicators and surveys arrive with sizeable lags and offer only aggregated pictures, while high-frequency signals such as stock markets, news, and search queries are noisy, prone to herd behaviour, and often detached from firms’ operational realities or lack sector granularity. To bridge this gap, the paper proposes analysing institutional language in legally mandated 10-K SEC filings, which include forward-looking risk assessments and have proven informative for risk analysis. The authors introduce the CoRisk-Index, built from two components—the count of ‘corona’ keywords and the negativity of sentences mentioning corona—to measure industry-specific risk perceptions in near real time across sectors representing more than one-third of the US workforce. They argue the index is both effective—anticipating unemployment developments and correlating with established risk indicators like VIX—and efficient compared to survey-based measures. The work focuses on industry-level dynamics and complements prior firm-level studies by leveraging standardized, reproducible 10-K reports for automated text mining.
Literature Review
Two main strands of prior work have examined Covid-19’s economic effects: (1) traditional macroeconomic modelling and simulations based on past shocks, and (2) computational social science using alternative data sources such as stock market data, news, website content, search queries, and trade/transport statistics. Each has limitations for fast-moving crises: macro indicators are lagged and aggregated, while high-frequency signals can be noisy, susceptible to herding, and insufficiently reflective of firms’ practical situations. Sector-level granularity is often lacking in market-based indicators. The paper positions SEC 10-K filings as a forward-looking, risk-sensitive, and standardized textual source that complements these approaches. It contrasts its industry-level focus with Hassan et al. (2020), who analyse firm-level pandemic risk from earnings call transcripts; 10-Ks offer a dedicated risk section, legal accountability, public accessibility, and standardized format, making them suitable for scalable, reproducible text mining. Prior research supports the forward-looking nature and analytical value of SEC filings and text-based risk measures.
Methodology
Data source and scraping: The study automatically collects 10-K reports filed with the U.S. SEC starting 30 January 2020 (first observed coronavirus mention). Using the SEC’s crawler.idx index, the pipeline constructs each report’s index.htm page, resolves the cryptic htm filename of the latest 10-K, scrapes the text, and extracts sentences that mention Covid-19-related terms. Ready-made EDGAR packages or APIs were deemed impractical due to format noise and volume constraints. Corona keyword specification and sentence filtering: Reports are lowercased and scanned via regular expressions for the relatively unambiguous terms 'corona' and 'covid' (including 'coronavirus' and 'covid-19'), avoiding double-counting. All sentences from risk sections containing these keywords are retained for analysis. Sentiment (text negativity): Negativity is computed as the share of negative words per corona-mentioning sentence, using the Loughran and McDonald (2011) financial sentiment dictionary. For each industry-day, the average share of negative words per corona sentence forms the daily negativity measure. Keyword count: For each report, the total number of corona keywords is counted. Industry-day aggregates reflect the average number of corona keywords per report. Index construction: The CoRisk-Index for industry i at time t is the geometric mean of (a) the average corona keyword count per report and (b) the average negativity of corona sentences. A 14-day right-aligned moving average is applied for smoothing, and values are scaled by 100 for presentation. Alternative specifications (e.g., share of reports mentioning corona) did not materially change results, so the two-factor index was preferred for simplicity and interpretability. Industry mapping and representativeness: SEC industry classifications (SIC, 1987) are mapped to NAICS (2017) using U.S. Census concordances to enable compatibility with current U.S. statistics. Where SICs map to multiple NAICS categories, the most frequent section is retained to avoid duplicates. Representativeness is assessed via COMPUSTAT employee counts (4700 firms of 6400 with data; 44 million employees across 13 sectors ≈ one-third of U.S. workforce). To avoid extrapolation, the index focuses on eight sectors where SEC-reporting firms cover at least 22% of sector employment. Seasonality: Filing seasonality (2014–2018) is examined; Finance reports cluster in Q1, but overall the CoRisk-Index does not appear biased by seasonal reporting patterns across industries. Topic detection: An exploratory Latent Dirichlet Allocation (LDA) topic model is trained on corona-related paragraphs from risk sections (after stopword removal, pruning very frequent terms, and retaining words appearing in at least two documents). The final LDA model uses four topics (selected via semantic coherence). To ensure stable, interpretable, and time-consistent topic tracking, a dictionary-based approach is derived from frequent bi-/tri-grams and expert economic knowledge, defining eight topical domains: Production, Supply, Travel, Demand, Finance, Aid, Remote, Vaccine. Topic prevalence is computed as the share of topic keywords relative to sentence length, aggregated by industry and time. Granger causality testing: To assess effectiveness, vector autoregression (VAR)-based Granger causality tests relate the CoRisk-Index to monthly unemployment (all economy and sector-specific), VIX, and the S&P 1200 Global Index. All series are normalized by first differences and tested stationary via KPSS (α=5%). Lag lengths are selected, and significance is assessed via F-tests (null: no added explanatory power from CoRisk lags). Limitations include potential confounding variables, interpretation as predictive signal rather than causal mechanism, and VAR’s inability to capture non-linearities. Scope decisions: The analysis uses 10-K (annual) filings only, excluding 10-Qs due to reporting disruptions and incentives to discuss Covid-19 in early 2020 that may bias quarterly filings. 10-Ks offer longer-term outlooks and more stable standards.
Key Findings
Early warning and dynamics: - Corona keyword mentions in 10-Ks rose steadily before the first U.S. Covid-19 wave (March 2020), then oscillated with notable peaks in June, September, and December 2020, reflecting waves of economic concern. - Text negativity (share of negative words in corona sentences) spiked in early February 2020—before the February 20 stock market drop—and peaked near 7% just before the March 9 'Black Monday' losses. Negativity decreased through mid- to late-2020 as financial support stabilized markets. Industry-specific CoRisk patterns (composite index): - Manufacturing, Wholesale & Retail, and Professional & Business Services showed steep, early rises in Q1–Q2 2020, stabilizing around 100–150 index points. - Leisure & Hospitality, Finance, and Transportation & Utilities rose later but steadily throughout 2020. - Information and Mining peaked at relatively lower levels, generally between 50–100 index points. - The aggregate 'All Industries' panel exhibits three peaks in early summer, autumn, and winter 2020. Topic-specific insights (dictionary-based): - Travel restrictions were most salient for Transportation & Utilities soon after the January 31, 2020 U.S. travel measures; less so for Professional & Business Services. - Supply/production disruptions were primarily a Manufacturing concern following East Asian restrictions; less pronounced in service-oriented sectors. - Demand shock from the first lockdown was especially acute in Wholesale & Retail. - Financial impact concerns rose in Finance early in 2020 and again in January 2021; Mining’s financial concerns increased from autumn 2020 during the second wave. - Mentions of governmental aid (e.g., CARES Act) increased in Wholesale & Retail during infection peaks and in Leisure & Hospitality through H2 2020. - Remote working topics surged in Professional & Business Services during stay-at-home periods; far less relevant in Transportation & Utilities. - Vaccine-related mentions trended upward across all industries into 2021. Effectiveness relative to macro/market indicators: - The CoRisk-Index exhibits significant Granger causality with VIX (lag 1 month), S&P 1200 (lag 2 months), and unemployment (overall and sector-specific), leading unemployment developments by approximately 1–4 months (e.g., 1 month for All Industries, 2 months Manufacturing, 4 months Finance). KPSS tests confirm stationarity after normalization; multiple F-tests are significant at α=5%. - The index correlates with unemployment rates across industries and with market volatility and preempted the February 2020 stock market losses. Generalisability: - Applying the methodology to 2018 'US–China trade war' filings, negativity in 'china' sentences ranged approximately 1.7%–3.4%, with tariff announcements corresponding to spikes and easing events to declines, illustrating broader applicability beyond pandemics. Coverage and representativeness: - Firms analysed represent about 44 million U.S. employees across 13 sectors (~one-third of the U.S. workforce). Eight sectors with at least 22% employment coverage are reported in the index.
Discussion
The CoRisk-Index provides timely, industry-specific insights that bridge the gap between fast-moving but noisy financial signals and lagged, aggregated macroeconomic statistics. It reveals how industries differ in timing and magnitude of perceived risks and in the specific challenges they face. These granular, topic-tagged signals can inform more targeted policy responses, such as early identification of supply-chain vulnerabilities in Manufacturing, timing and allocation of aid across sectors (e.g., Wholesale & Retail early in the lockdown, Mining and Leisure & Hospitality later in 2020), and support for remote work infrastructure in office-based sectors. By tracking evolving concerns—travel restrictions, supply and production disruptions, demand shocks, financing conditions, aid needs, remote work, and vaccination—the index helps policymakers and researchers monitor sectoral stress and adapt interventions during different phases of the crisis.
Conclusion
The paper introduces the CoRisk-Index, an efficient, forward-looking, and reproducible measure of industry-specific business risk perceptions related to Covid-19 derived from standardized 10-K SEC filings. Combining corona keyword intensity with sentiment negativity yields a reactive indicator that anticipates unemployment developments, correlates with VIX and the S&P 1200, and captures sectoral heterogeneity and topical dimensions of risk. The methodology generalises to other contexts (e.g., trade disputes) and offers open access via an interactive dashboard and public code/data. The authors continue to update the index weekly to provide timely evidence for researchers and policymakers. Future work is facilitated by the open data and can extend to finer industry or firm levels and to additional external events where comparable standardized disclosures are available.
Limitations
The analysis is limited to U.S. 10-K filings; comparable standardized international data are not publicly available, and 10-Qs were excluded due to disrupted reporting and potential short-term bias in early 2020. The approach relies on company self-reporting and risk disclosure, which may be influenced by disclosure incentives and could affect stock price volatility. Mapping from 1987 SIC to 2017 NAICS entails some information loss; not all sectors are equally represented, and eight sectors are emphasised to avoid over-extrapolation. Although seasonality was assessed and appears limited outside Finance, filing patterns could still affect timing. Granger causality identifies predictive relationships, not structural causation, and may be confounded by omitted variables; VAR models used cannot capture non-linear dynamics. Finally, firms cannot report risks unforeseeable at the time of filing, and long-term crisis implications remain uncertain.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny