logo
ResearchBunny Logo
Identifying interdisciplinary emergence in the science of science: combination of network analysis and BERTopic

Interdisciplinary Studies

Identifying interdisciplinary emergence in the science of science: combination of network analysis and BERTopic

K. Kim, D. F. Kogler, et al.

This groundbreaking study, conducted by Keungoui Kim, Dieter F. Kogler, and Sira Maliphol, unveils a cutting-edge approach to identifying emerging interdisciplinary sciences through network analysis and BERTopic. Researchers crafted a global map of scientific co-occurrence networks, revealing evolving influence patterns and the rise of novel global scientific domains.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses how to identify and anticipate emergent scientific topics as disciplinary boundaries increasingly blur due to knowledge recombination. Rising research output coincides with declining productivity and innovation rates, influenced by specialization, larger team sizes, and rising R&D costs. Canonical bias in citations further obscures detection of emerging topics. Understanding emergence is crucial for science, technology, and innovation policy, as impactful innovations often arise from interdisciplinary convergence (e.g., biotechnology, AI, sensors, nanotechnology). The research aims to analyze the complexity, self-organization, and evolution of scientific knowledge by combining network analysis and embedded topic modeling to detect emerging cross-domain topics at a global scale. The main objective is to contribute a novel approach integrating co-occurrence network analysis with BERTopic to identify emergent interdisciplinary topics, leveraging Eigenvector centrality to capture influence beyond frequency and using topic embeddings to retain contextual information.
Literature Review
Science mapping has evolved from citation analyses to network and topic modeling approaches to study topic identification, growth, and emergence. Prior work uses co-citation/citation networks to reveal transformative contributions and stages of scientific development, overlay maps for field categorization, and bibliometric indicators with text analysis to detect emerging technologies. Emergence is linked to novelty via merging previously separate research streams and fast-growing interdisciplinarity. Interdisciplinarity measurement often relies on frequency-based keyword co-occurrence within predefined local maps, with diversity measures (e.g., Rao–Stirling, relative variety) and clustering. However, these approaches are limited by canonical bias, constrained topic scopes, and loss of context. Global science maps can yield more accurate partitions and higher textual coherence by preserving broader context. Long-distance interdisciplinarity tends to yield higher impact, and technology-convergent science often has greater influence. Existing studies frequently predefine domains, rely on frequency and diversity within narrow scopes, and model interdisciplinarity primarily through keyword evolution. The paper responds by using a global co-occurrence network of WoS categories and embedded topic modeling to capture influential, context-rich emergent interdisciplinarity without restricting to local maps.
Methodology
Overall design: Two-stage pipeline: (1) construct science category–subject co-occurrence networks to identify influential (dominant) and fast-growing (emerging/growing) interdisciplinary subjects; (2) conduct embedded topic modeling (BERTopic) on publications filtered to those associated with growing subjects, followed by qualitative validation. Data collection: Web of Science Core Collection metadata (titles, year, journal, authors, institutions/addresses, categories/subheadings, subjects, funding, citations, document type, publication type). Restricted to journal articles (doc_type='Article', pub_type='Journal') with abstracts, focusing on STEM-related subheadings: Life Sciences & Biomedicine (LSB), Technology (TE), and Physical Sciences (PS). Periods split into consecutive three-year windows (2012–2014; 2015–2017) to stabilize rankings. Initial corpus: 7,453,987 publications (10,138 journals, 226 subjects) across 2012–2017; interdisciplinary subset (≥2 subheadings among LSB, TE, PS): 1,194,332 publications (1,137 journals, 172 subjects). Descriptives (Table 1) by period: LSB-TE: 68,768→79,112 publications; LSB-PS: 115,499→120,161; PS-TE: 345,520→414,010; LSB-PS-TE: 25,447→25,805. Stage 1: Science category–subject co-occurrence network - Construction: For each publication, list WoS subjects mapped to their broad subheading(s). Nodes are science category–subjects; edges connect subjects co-listed within a publication. Networks are undirected and weighted by co-occurrence counts aggregated across publications per period and per interdisciplinary pairing (LSB-TE, LSB-PS, PS-TE, LSB-PS-TE). Pair generation includes all unique cross-category pairs per publication. - Centrality: Compute Eigenvector centrality (EIG) on each period’s co-occurrence network. EIG captures influence by weighting connections to already influential nodes (appropriate for undirected networks), complementing degree and PageRank. - Dominant vs. growing: Define dominant subjects as top 10% by EIG in the current period; define growing (emerging) subjects as top 10% by growth rate of EIG between periods (EIG.GR). Subjects in both sets are both dominant and emerging. This conservative threshold addresses skewed degree distributions and supports consistent cross-period comparisons. - Selection for Stage 2: Focus subsequent topic modeling on publications tied to growing subjects to emphasize emergent interdisciplinarity and reduce computation. Stage 2: Embedded topic modeling (BERTopic) - Pipeline: Document embeddings (Sentence-BERT, all-MiniLM-L6-v2, 384-dim) → dimensionality reduction (UMAP) → clustering (HDBSCAN) → topic representation (class-based TF-IDF, c-TF-IDF). UMAP mitigates the curse of dimensionality while preserving neighborhood structure; HDBSCAN identifies dense clusters and handles noise; c-TF-IDF extracts representative terms per topic. - Hyperparameter search: Random search (≤100 iterations) across n-gram ranges (unigram, bigram, trigram), initial topic counts (5–1000), and minimum topic size set proportionally to corpus size (≈0.5–3% of documents per case; e.g., LSB-TE: 130–780; LSB-PS: 50–300; PS-TE: 240–1440; LSB-PS-TE: 5–50). Best model chosen via lowest information entropy of topic word distributions, favoring clearly distinguishable, semantically coherent topics. Qualitative validation: Inspect topic keywords and top representative articles and journals per topic to assess interpretability, coherence, and fit to interdisciplinary definitions. Evaluate whether topics reflect sensible, cross-domain themes and whether journal venues align with inferred topics.
Key Findings
Dataset and interdisciplinarity scope: - From 7,453,987 WoS publications (2012–2017), 1,194,332 are interdisciplinary across STEM subheadings (LSB, TE, PS). Interdisciplinary descriptives by 3-year periods (Table 1): • LSB-TE: 68,768→79,112 publications; subjects 80→81; journals 162→175. • LSB-PS: 115,499→120,161; subjects 67→67; journals 228→248. • PS-TE: 345,520→414,010; subjects 85→86; journals 584→637. • LSB-PS-TE: 25,447→25,805; subjects 43→43; journals 40→43. Network centrality and emergence: - Growing subjects (top 10% by EIG growth rate) show substantially higher EIG in the subsequent period than other subjects: mean next-period EIG 0.348 vs. 0.093 (Fig. 6), supporting predictive validity of the growth-based filter. - Dominant vs. growing subjects differ markedly across all interdisciplinary fields (Fig. 5; Table 2), indicating that influential incumbents (dominant) are not necessarily the fastest-rising (growing) subjects. Examples from Table 2: - LSB-TE dominant: Environmental Sciences; Engineering, Environmental; Green & Sustainable Science & Technology; Energy & Fuels; Engineering, Chemical; Ecology. Growing: Forestry; Materials Science, Textiles; Instruments & Instrumentation; Pharmacology & Pharmacy; Green & Sustainable Science & Technology; Medicine, Research & Experimental; Engineering, Environmental; Ecology. - LSB-PS dominant: Public, Environmental & Occupational Health; Radiology, Nuclear Medicine & Medical Imaging; Chemistry, Applied; Biochemistry & Molecular Biology; Food Science & Technology; Chemistry, Analytical; Biochemical Research Methods. Growing: Neurosciences; Health Care Sciences & Services; Immunology; Polymer Science; Paleontology; Microbiology. - PS-TE dominant: Chemistry, Multidisciplinary; Chemistry, Medicinal; Materials Science, Multidisciplinary; Physics, Applied; Nanoscience & Nanotechnology; Chemistry, Physical. Growing: Fisheries; Engineering, Aerospace; Green & Sustainable Science & Technology; Engineering, Marine; Geography, Physical. - LSB-PS-TE dominant: Physics, Condensed Matter; Chemistry, Multidisciplinary; Engineering, Electrical & Electronic; Energy & Fuels; Materials Science, Coatings & Films; Environmental Sciences; Water Resources; Engineering, Environmental; Computer Science, Interdisciplinary Applications; Statistics & Probability. Growing: Water Resources; Engineering, Mechanical; Acoustics; Engineering, Ocean; Automation & Control Systems; Remote Sensing; Imaging Science & Photographic Technology; Geosciences, Multidisciplinary; Crystallography. Topic modeling results (BERTopic): - Emergent cross-domain themes are coherent and interpretable, frequently highlighting green/sustainability and health-related topics. • LSB-TE: Sustainable Environmental Technologies and Resource Management (≈22.3k docs); Cancer Biomarker Expression in Clinical Patient Groups (≈1.3k docs); Mechanical Properties and Composition of Natural Fibrous Materials (outlier cluster, ≈272 docs). • LSB-PS: Marine Biodiversity and Climate Impact Studies; Clinical Trial Modeling and Simulation Techniques; Chitosan Bioactivity and Drug Delivery Applications (topic sizes in thousands for the two largest topics). • PS-TE: Adsorption and Membrane Processes for Water Treatment (≈10.9k docs); Heat Transfer Modeling and Analysis in Fluid Systems (≈38.2k docs). • LSB-PS-TE: Flood Risk Assessment and Spatial Modeling (≈334 docs); Protein–Molecule Docking and Interaction Dynamics (≈570 docs). - Representative articles per topic (Table 5) exhibit high topical fit (e.g., natural fibrous materials in wood/forestry journals; adsorption/membranes for water treatment; heat transfer and fluid systems; flood risk and spatial modeling; protein docking and interaction dynamics). Journal concentration (Table 6): - Publications with emergent interdisciplinary topics are concentrated within a relatively small share of journals. Half of all emergent-topic publications appear within top journal percentiles: 14th (LSB-TE), 13th (LSB-PS), 10th (PS-TE), 18th (LSB-PS-TE). - Top venues align with topic themes (examples): Journal of Cleaner Production, Environmental Science & Technology, Journal of Hazardous Materials (LSB-TE); International Journal of Biological Macromolecules, Paleogeography Paleoclimatology Paleoecology (LSB-PS); Desalination and Water Treatment, Applied Thermal Engineering, International Journal of Heat and Mass Transfer (PS-TE); IEEE JSTARS, Combustion and Flame, Ultrasonics Sonochemistry, Journal of Molecular Graphics & Modeling, Geocarto International, Natural Hazards Review (LSB-PS-TE).
Discussion
The findings demonstrate that measuring interdisciplinarity via Eigenvector centrality growth uncovers emerging science category–subjects that are distinct from dominant, frequency-driven incumbents. By focusing on co-occurrence influence rather than volume, the approach detects areas with increasing cross-domain connectivity—prospective loci of knowledge recombination. Embedded topic modeling on these filtered corpora yields coherent emergent themes, frequently tied to sustainability (e.g., water treatment, environmental resource management) and health/biomedicine (e.g., biomarkers, chitosan-based drug delivery), consistent with high-impact convergence domains. The observed elevation of next-period EIG for growing subjects supports the method’s anticipatory value. The integration of global science maps with embedding-based topic modeling preserves context, mitigates canonical bias, and broadens coverage beyond pre-defined local domains. These results are relevant for STI policy and strategic research funding, offering a data-driven means to track evolving interdisciplinary boundaries, identify early-stage convergence, and inform portfolio diversification toward societally salient areas.
Conclusion
This study introduces a two-stage framework combining category–subject co-occurrence network analysis with BERTopic to identify emergent interdisciplinary topics at a global scale. Contributions: (i) broaden interdisciplinarity operationalization to global, cross-domain science categories; (ii) employ Eigenvector centrality and its growth as influence-based indicators of emergence distinct from frequency-based dominance; (iii) demonstrate embedded topic modeling’s capacity to retain contextual semantics and extract interpretable emergent themes from large bibliometric corpora. Empirically, growing subjects exhibit greater subsequent influence (higher next-period EIG), and emergent topics prominently feature green/sustainability and health-related areas. Future research directions include refining topic granularity (e.g., recursive clustering), exploring additional embeddings or multilingual corpora, extending beyond journal articles to other scholarly outputs, and integrating policy-relevant impact metrics to better align detection of emergence with societal innovation needs.
Limitations
- Topic scope: Conservative hyperparameter choices yielded a relatively small number of topics; additional emergent themes likely remain undetected. Future work could apply recursive clustering and calibrate granularity while balancing diversity and similarity. - Computational resources: Embedding-based NLP requires substantial computation, limiting routine policy applications without adequate infrastructure. - Data coverage: Restricted to WoS journal articles with abstracts in STEM subheadings; disciplines emphasizing non-journal outputs or social innovations may be underrepresented. Broader inclusion of books, software, datasets, patents, and gray literature could improve coverage. - Bias and classification: Reliance on WoS categorization and automated topic modeling entails potential misclassification; qualitative validation mitigates but does not eliminate such risks. - Impact measures: The study focuses on influence within co-occurrence networks; incorporating broader impact metrics (per DORA) could better capture societal value.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny