logo
ResearchBunny Logo
Complex networks reveal emergent interdisciplinary knowledge in Wikipedia

Interdisciplinary Studies

Complex networks reveal emergent interdisciplinary knowledge in Wikipedia

G. A. Schwartz

Explore a groundbreaking method that mines Wikipedia to reveal hidden interdisciplinary knowledge, showcasing unexpected connections across art, science, and literature. Conducted by Gustavo A. Schwartz, this study employs complex network analysis to uncover deep relationships within and between disciplines, promising exciting insights into the fabric of knowledge.

00:00
00:00
Playback language: English
Introduction
The increasing specialization of knowledge has led to fragmented disciplines, hindering the flow of information between them. While existing approaches like co-occurrence models and bibliometric systems have made progress, efficient tools for quantitative interdisciplinary analysis remain scarce, particularly given the vast amount of available knowledge. This paper proposes a method to overcome this limitation by leveraging the implicit knowledge embedded within Wikipedia's internal link network. Wikipedia, a massive collaborative creation, contains a wealth of information beyond its explicit content. The internal links, representing connections between people, ideas, and works, form a large conceptual network. This network, a directed graph with millions of connections, offers a unique opportunity to explore relationships between seemingly disparate fields. The study focuses on the relationship between Albert Einstein and Pablo Picasso in the early 20th century as a proof-of-concept, examining whether their simultaneous breakthroughs in their respective fields (relativity and Cubism) were coincidental or interconnected. The research hypothesizes that network science tools can unveil this implicit knowledge, quantifying the flow of knowledge and individual/collective behavior.
Literature Review
Existing approaches to bridging the gap between disciplines include co-occurrence or semantic models and bibliometric systems based on citation analysis. While these have advanced the field, they lack efficient tools for quantitative connections between diverse disciplines like science, art, and literature. The sheer volume of available knowledge further complicates this challenge. The use of public knowledge databases like Wikipedia, DBpedia, and ConceptNet offers a way to overcome these limitations and uncover unexpected connections. Previous research has utilized Wikipedia's internal link network for tasks such as computing semantic relatedness and natural language processing; this paper builds upon this work by applying network science to reveal emergent knowledge.
Methodology
The study used the Wiki-LinksGraphs dataset (Consonni et al., 2019), containing Wikipedia's internal link network (from March 1, 2018). This was imported into MATLAB as a directed unweighted graph, with poorly linked pages iteratively removed. The resulting 'wikiLinksClean' graph contained 7,879,531 nodes and 150,995,780 edges. The research then generated a subgraph ('universe') using 'seeds' – 'Pablo Picasso', 'Albert Einstein', and 'James Joyce' – and their nearest neighbors (within distance d≤2) to focus the analysis. This subgraph was cleaned by resolving redirects and removing nodes with zero in/outdegree, resulting in 78,444 nodes and 3,159,866 edges. To measure relatedness between nodes, the normalized Google distance (NGD) was used (Cilibrasi and Vitanyi, 2007). NGD accounts for both incoming and outgoing links, producing two distances (din/out) and a total distance using harmonic mean. Relatedness (r(a,b)) was defined as exp(-d(a,b)). The N closest nodes to each seed (where N = outdegree of the seed) were selected. A relatedness matrix (R) and an undirected weighted graph (g) were generated, where edge weights are determined by the corresponding element in R. The graph contained 856 nodes and 143,307 edges. Nodes were clustered according to the seed they were most related to. This graph was visualized using a force-directed layout. Network analysis was performed including assortative mixing (Newman, 2003), using an assortativity matrix (A) and normalized matrix (AN) to quantify the preference for nodes of one type to link to others of the same type. Modularity (Q) was also used to measure the quality of the network partition (Newman, 2003). Further analysis included the identification of weak and strong communities, openness, average degree, and the Stirling coefficient (Stirling, 2007) to measure diversity within clusters.
Key Findings
The visualization (Fig. 1) revealed three distinct clusters corresponding to the seeds: art (Picasso), science (Einstein), and literature (Joyce). The clusters varied in size, density, and interconnectivity. Art and literature were closely connected, while science exhibited a stronger connection with art than literature. The study found a high assortativity coefficient of 0.82, indicating a strong preference for links within clusters, despite inter-cluster connections. The normalized assortativity matrix shows the distribution of links between clusters. The modularity (Q) was 0.78 (with a maximum possible value of 0.89), further supporting the distinct cluster structure. Table 1 details structural, relational, and global properties of the network, including cluster size, diameter, average path length, community strength, openness, and average degree. Table 2 shows the normalized assortativity matrix, highlighting the distribution of inter-cluster links. Table 3 shows the top ten elements in each cluster according to the Stirling coefficient, indicating diversity within clusters. The findings reveal knowledge dealers – elements bridging clusters – including figures and concepts mentioned in Arthur I. Miller’s book ‘Einstein, Picasso’, showing qualitative agreement with historical research. The analysis reveals that Picasso, Einstein, and Joyce themselves did not act as primary knowledge dealers; the network identified other influential individuals and works as key connectors between disciplines.
Discussion
The results demonstrate the effectiveness of the proposed method in uncovering implicit interdisciplinary knowledge. The strong qualitative agreement between the generated knowledge map and established historical analysis validates the approach. The quantitative analysis using network metrics further enhances our understanding of the relationships between disciplines. The high assortativity and modularity indicate well-defined clusters, while the inter-cluster links reveal pathways of knowledge exchange. The identification of knowledge dealers highlights the role of specific individuals and works in fostering connections between seemingly unrelated fields. This research opens avenues for quantitative interdisciplinary studies and reveals limitations of relying solely on explicit citations in mapping knowledge flows.
Conclusion
This study successfully applied network science to Wikipedia's internal link structure to reveal emergent interdisciplinary knowledge. The results demonstrate the potential of this approach to quantify and visualize relationships between disciplines, going beyond explicit connections. Future research could explore larger datasets, different languages, and temporal dynamics to further enhance the understanding of knowledge evolution and cross-disciplinary connections. The method could be applied to other knowledge bases and domains to uncover hidden interconnections and advance interdisciplinary research.
Limitations
The study's findings are limited by the specific snapshot of Wikipedia used (March 1, 2018) and the selection of seeds. The method's reliance on Wikipedia's internal link structure might not fully capture all relevant connections and relationships, especially those not explicitly represented by links. Further research should address the bias present in Wikipedia itself.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny