Interdisciplinary Studies
Complex networks reveal emergent interdisciplinary knowledge in Wikipedia
G. A. Schwartz
Specialisation has fragmented knowledge into loosely connected disciplines, restricting knowledge flow across and even within fields. Prior approaches (e.g., co-occurrence, semantic models, bibliometric systems) mitigate this but interdisciplinary research still lacks efficient quantitative tools connecting domains such as science, art, and literature. Wikipedia’s vast network of editor-curated internal links encodes implicit relationships among people, ideas, and works. The research question is whether unsupervised network methods can mine these links to reveal emergent, quantifiable interdisciplinary knowledge. As a proof of concept, the study examines relationships among Albert Einstein (science), Pablo Picasso (art), and James Joyce (literature) circa early 20th century to assess whether their domains share structural connections, common influences, and knowledge flow.
The study builds on work visualizing knowledge domains and identifying emerging topics (e.g., Börner et al., Small et al.), literature-based discovery (Henry and McInnes), and Wikipedia-based semantic relatedness and NLP (Gabrilovich and Markovitch). Bibliometric and citation-based systems (e.g., Gates et al.) map scientific influence but are less suited for cross-domain links spanning art and literature. Prior Wikipedia/knowledge-base resources (WikiLinkGraphs, YAGO2, ConceptNet/DBpedia) demonstrate that internal link networks capture semantic associations. This work extends these ideas by using Wikipedia’s internal links to quantify interdisciplinary relatedness across art, science, and literature beyond citation networks.
Data: The WikiLinkGraphs dataset (English Wikipedia internal links intentionally added in article text) from March 1, 2018 was used (Consonni et al., 2019). The original table lists 163,380,007 internal links. Page titles (from/to) were imported into MATLAB as a directed, unweighted graph with 13,680,532 nodes and 163,380,007 edges. Iterative removal of poorly linked pages (zero in/outdegree, redirects, disambiguations) yielded wikiLinksClean with 7,879,531 nodes and 150,995,780 edges.
Subgraph (universe) from seeds: Seeds were “Pablo Picasso,” “Albert Einstein,” and “James Joyce.” Nodes within graph distance d≤2 from each seed were collected to form the universe (dense network; average shortest path ~4.1, so d≤2 limits irrelevant links). The initial universe contained 79,454 nodes and 3,166,325 edges. Non-resolved redirects were handled by redirecting nodes with outdegree=1 to their successor and removing those redirect nodes; zero in/outdegree nodes were also removed. Final universe: 78,444 nodes and 3,159,866 edges.
Relatedness metric and near universe: Relatedness between articles a and b was measured via normalized Google distance computed using sets of in-/out-neighbors; total distance was the harmonic mean of din and dout; relatedness r(a,b)=exp(−d(a,b)) in [0,1]. For each seed s, the N most related nodes were selected with N_s equal to the outdegree of s in wikiLinksClean. A relatedness matrix R on this near-universe subset was built and converted into an undirected weighted graph g with edge weights R(i,j). In this case g had 856 nodes and 143,307 edges.
Clustering and visualization: Nodes were assigned to clusters by the seed they linked to in wikiLinksClean; nodes linked to multiple seeds were assigned to the seed with highest relatedness. The graph was plotted using a force-directed layout (Fruchterman-Reingold), revealing three clusters corresponding to art (Picasso), science (Einstein), and literature (Joyce).
Network metrics and definitions: Assortativity was computed from the weighted assortativity matrix A and its normalized form AN, yielding an assortativity coefficient r (0 for random mixing, 1 for perfect assortativity). Modularity Q for weighted networks was computed with reference to its non-trivial maximum. Community strength (weak/strong) was determined by comparing internal versus external degrees. Openness for a cluster is the percentage of nodes with external degree greater than internal. Average degree per cluster was computed from the diagonal elements of AN divided by cluster size. The Stirling diversity coefficient for nodes measured variety, balance, and disparity across links to clusters.
- The force-directed knowledge map shows three well-defined clusters aligned with the seeds’ domains: art (Picasso), science (Einstein), and literature (Joyce). Artistic and literary clusters are closer and more interconnected than either is with science; science is more connected to art than to literature.
- Qualitative knowledge dealers bridging art and science include Ernst Mach, Henri Poincaré, and Poincaré’s Science and Hypothesis; on the art side, Jean Metzinger and Du 'Cubisme' connect to scientific concepts. Additional bridging elements include Les Demoiselles d'Avignon, Georges Braque, Maurice Princet, electromagnetism, modernism, and Niels Bohr. More than 75 elements coincide with those identified by Miller (2002), indicating strong agreement with historical scholarship.
- Seeds themselves (Picasso, Einstein, Joyce) are not the main knowledge dealers; intermediaries play the bridging roles.
- Quantitative global metrics: assortativity coefficient r = 0.82, indicating strong but not complete homophily; modularity Q = 0.78 (maximum possible ≈0.89), reflecting a strong community structure with meaningful inter-cluster links.
- Inter-cluster link fractions from the normalized assortativity matrix: approximately 13% of links connect art and literature; about 2% connect science and art; about 1% connect science and literature.
- Cluster-level structural/relational properties (from Table 1): sizes—Picasso 232 nodes, Einstein 431, Joyce 193; all three are weak communities. Openness: Picasso 1.29%, Einstein 0.46%, Joyce 12.44%. Average degree (as defined): Picasso 1.36, Einstein 0.66, Joyce 1.25. Reported internal densities and path statistics indicate Einstein’s cluster is denser than the other two.
- Stirling diversity rankings identify top interdisciplinary connectors within each cluster, e.g., Picasso cluster: Jean Metzinger, La Femme au Cheval, Du 'Cubisme', Proto-Cubism; Einstein cluster: Salvador Dalí, Wolfgang Paalen, J. Robert Oppenheimer, Ernst Mach, Niels Bohr, Ludwig Boltzmann, Science and Hypothesis, Henri Poincaré; Joyce cluster: Brian Coffey, Jim Norton, Thomas Pynchon, Literary modernism, Finnegans Wake, Modernism, T. S. Eliot, etc.
The findings address the central question by demonstrating that implicit interdisciplinary relationships can be quantitatively extracted from Wikipedia’s internal link network. The three-cluster structure reflects disciplinary separation, while significant inter-cluster links—especially between art and literature, and to a lesser extent between science and art—map plausible pathways of knowledge flow around 1900–1915. The identification of specific knowledge dealers (e.g., Mach, Poincaré, Metzinger) aligns with established historical analyses (Miller, 2002), supporting the method’s validity. Quantitative metrics (high modularity and assortativity with non-negligible off-diagonal link fractions) show that while disciplines are cohesive, meaningful bridges exist that can carry concepts across domains. The Stirling rankings further highlight nodes whose connectivity patterns span multiple clusters, indicating roles in cross-domain diffusion. Together, these results show that an unsupervised, network-based approach can surface and measure emergent interdisciplinary knowledge beyond explicit cross-references or citations.
The paper introduces an unsupervised complex-network method to mine Wikipedia’s editor-curated internal links, revealing and quantifying interdisciplinary relationships across art, science, and literature. Applied to the Einstein–Picasso–Joyce case, the approach reconstructs known cross-domain connections, identifies key intermediaries, and provides quantitative evidence of community structure and inter-cluster coupling. The strong agreement with historical scholarship suggests the method can augment and accelerate interdisciplinary research. Potential future directions include applying the framework to additional seeds and domains, expanding to other language editions or updated snapshots of Wikipedia, integrating temporal analyses, and combining with external knowledge bases to enrich and validate discovered links.
- Data source and snapshot: Analysis is limited to the English Wikipedia internal links from the March 1, 2018 snapshot; subsequent edits and other language editions are not captured.
- Redirect handling: The procedure to resolve non-resolved redirects by removing nodes with outdegree=1 may remove a small fraction of weakly connected nodes, potentially affecting marginal cases (though authors note negligible influence).
- Subgraph construction: Using distance d≤2 around seeds and selecting only the N most related nodes per seed emphasizes local neighborhoods and can diminish inter-cluster connectivity, possibly overlooking longer-range relationships.
- Seed dependence: Results depend on the chosen seeds (Picasso, Einstein, Joyce) and the time/context implied; other seeds could yield different universes and bridges.
- Reliance on editor-added internal links: Implicit knowledge inferred reflects editorial choices and coverage biases within Wikipedia, not necessarily exhaustive domain relationships.
Related Publications
Explore these studies to deepen your understanding of the subject.

