logo
ResearchBunny Logo
Methods and visualization tools for the analysis of medical, political and scientific concepts in Genealogies of Knowledge

Interdisciplinary Studies

Methods and visualization tools for the analysis of medical, political and scientific concepts in Genealogies of Knowledge

S. Luz and S. Sheehan

Discover an innovative approach to establishing requirements and developing visualization tools for scholarly work, combining in-depth observation, software prototyping, and user engagement. This research by Saturnino Luz and Shane Sheehan explores the co-design methodology and presents case studies from the Genealogies of Knowledge project that illuminate essential concepts in medical, scientific, and political contexts.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses how to design and develop visualization tools that support corpus-based scholarly investigations in the Genealogies of Knowledge (GoK) project, which studies how translation and mediation shape the historical evolution of scientific and political concepts. Building on the Firthian tradition in corpus linguistics and the widespread use of keyword-in-context (KWIC) concordances, the authors argue that visualization can enable a top-down, iterative process moving between overview and detail, revealing patterns otherwise missed. The research question centers on documenting and theorizing an iterative co-design methodology for tool creation in interdisciplinary contexts and demonstrating how such tools (integrated with KWIC workflows) facilitate analyses of positional frequencies, collocations, and metadata-driven patterns relevant to medical, scientific, and political concepts.
Literature Review
The paper situates its contribution within digital humanities, corpus linguistics, and translation studies. It notes the trajectory from early concordance tools (e.g., Index Thomisticus) to distance reading (Moretti) and emphasizes the strong tradition of computational support in lexicography and corpus linguistics, including KWIC concordancing. Corpus linguistics blends quantitative frequency/statistics with qualitative interpretation (Biber et al.), countering misconceptions that corpus methods are purely quantitative. Translation studies has broadly adopted corpus-based methods (Baker; Zanettin). The review highlights popular concordance tools (WordSmith Tools, SketchEngine, AntConc) and their typical features. A substantial comparative analysis of related visualizations maps the KWIC conceptual attributes to visual variables (per Mackinlay; Cleveland & McGill), critiquing designs such as Word Tree, Double Tree, interHist, Corpus Clouds, Structured Parallel Coordinates, TagSpheres, Fingerprint Matrices, TextArc, Phrase Nets, TagPies, and visualizations in Sketch Engine and Voyant. The authors argue that many existing systems either privilege readability of concordance lines at the expense of quantitative positional statistics, or vice versa, and often use suboptimal encodings (e.g., area via font size) for quantitative data. This motivates designs that prioritize positional frequency and collocation strength while preserving pathways for close reading.
Methodology
The authors adopt an iterative co-design process comprising: (1) Analysis of published methodology: Using Sinclair’s (2003) Reading Concordances, they conducted a hierarchical task analysis of 18 tasks, tagging actions/sub-actions (e.g., estimate frequency, read context, frequent patterns, frequency, word position, POS, filter, sense, group, significant collocate, usage, phrase). Tag distributions show that estimating frequency (16 tasks; 34 actions) and reading context (16; 31) are pervasive, as are frequent pattern identification and positional statistics. A task/action hierarchy highlights high-level goals (investigate meaning, identify phrases, investigate usage) and low-level mechanics (sort word position, count word frequencies, filter concordance, read contexts). (2) Conceptual KWIC data model: They formalize entities and relationships needed to support tasks—Concordance Lines (CLs) with ordered Word Objects (WO), and Position Objects (PO) that house Word Token Objects (WTO) aggregating all instances of a string (and metadata such as POS) at a given position. WTOs include quantitative attributes (positional counts/frequencies/statistics) and mappings back to CLs, enabling both positional analysis and linkage to readable context. (3) Analysis of existing visualizations: They map conceptual attributes (PO position, WTO frequency, ordering) to visual variables, arguing for encodings using higher-ranked quantitative variables (position, length) and cautioning against reliance on area/font size. (4) Requirements elicitation: Two GoK researchers (Henry Jones, Jan Buts) each provided 20 prioritized questions across categories: keywords, collocational patterns, temporal spread (HJ), and keyword, text, author, corpus (JB). These emphasize needs for positional collocate frequencies, collocation strength estimation, normalized frequencies, named entity recognition, dispersion plots, subcorpus comparisons (by time, author, source language), and metadata-driven exploration. Challenges with existing concordancers (manual counting, reliance on spreadsheets, difficulty comparing frequency lists across different-sized subcorpora) shaped requirements. (5) Software prototyping: Low-fidelity sketches and mock-ups progressed to high-fidelity prototypes. (6) Implementation: Built as plugins within the modnlp concordancer framework (which supports indexing, metadata, KWIC sorting/filtering), the following tools were developed: Mosaic (concordance mosaic), Concordance Tree, Metafacet (faceted metadata summaries and filtering), and a Frequency Comparison tool. (7) Observational research and case studies: Think-aloud observation and interviews captured real workflows. A democracy case study (Jan Buts) illustrated iterative sampling, pattern description, and hypothesis formation using Mosaic (e.g., inspecting left/right +1 positions, suffix search “-acy,” assessing semantic prosody). A “people” case study (Henry Jones) across eight Thucydides translations showed frequent reliance on frequency and collocates, with spreadsheets for per-translation counts. A statesmanship case study found that “statesman” occurs ~90% in translations from Classical Greek, discovered via metadata-driven frequency comparisons—all motivating the Metafacet tool and enhancements (documentation, additional collocation statistics).
Key Findings
- Task analysis confirms core needs: reading context and positional frequency/statistics underpin concordance analysis (e.g., estimate frequency in 16/18 tasks with 34 action appearances; read context in 16/18 with 31 actions; frequent patterns in 15/18 with 21 actions). Tools must support both close reading and quantitative positional analysis. - Conceptual KWIC model (CL, WO, PO, WTO) effectively structures both readability and positional frequency operations, enabling mapping between quantitative aggregates and original contexts. - Mosaic provides a compact, columnar, position-aware summary of concordances. In the refugee example from the GoK Internet corpus (25,120,689 tokens; 396 lines displayed), local collocation strength visualization highlights anti at K–1 and crisis at K+1 as salient collocates. Hover tooltips provide counts/frequencies, removing the need for manual tallies. Multiple collocation statistics (e.g., mutual information variants, z-score) are supported. - Concordance Tree preserves sentence structure via a prefix tree for left or right contexts, aiding path-wise reading when needed (with pruning of low-frequency branches to manage space). - Metafacet enables interactive, faceted exploration of keyword distributions across metadata (e.g., Internet outlet), supports filtering/sorting, and integrates with KWIC and Mosaic for coordinated exploration. - Frequency Comparison tool visually compares ranked frequency lists across named subcorpora with log-scaled axes and line mappings, allowing comparison across different corpus sizes. Example: “capitalist” ranks ~100 (0.096%) in radical-left outlets (ROAR Magazine, Salvage Zone) versus ~4000 (0.003%) in less radical outlets (The Nation, Open Democracy), revealing ideological lexical differences. - Case studies demonstrate practical utility: (1) Democracy and “-acy” exploration suggests democracy is dominant and positively framed, while others (aristocracy, bureaucracy, meritocracy) are often negatively framed; (2) “Statesman” appears almost exclusively (~90%) in translations from Classical Greek versus related terms (governor, leader, ruler, citizen) that are broadly distributed—found via metadata frequency analysis; (3) Analyses of “the people” across translations leveraged Mosaic to expedite collocate assessment and identified needs for automated, metadata-aware frequency extraction. - User studies highlighted additional requirements: integration of metadata analytics, normalized frequencies, well-known collocation measures, and thorough domain-oriented documentation; visualization assists both discovery and communication of findings.
Discussion
The co-design approach shows that visualization can sensitize researchers to global patterns that are difficult to see through sequential reading alone while remaining tightly integrated with qualitative interpretation. The tools address the identified duality of tasks: they support quantitative positional statistics (e.g., positional frequencies, collocation strength) and preserve pathways for close reading (via KWIC linkage and Concordance Tree). Differences between Sinclair’s foundational tasks and observed GoK practices (notably subcorpus comparisons and metadata-driven analyses) underscore the need for tools enabling frequency comparisons and faceted exploration. Visualization also aids communication: simple, position-aware summaries (e.g., Mosaic) help present usage patterns more clearly than tables or raw concordances. The authors emphasize awareness of bias: corpora and tools should facilitate identifying corpus biases and guard against researcher preconceptions. Ultimately, the tools guide exploration, but interpretive, qualitative analysis remains central.
Conclusion
The paper contributes an interdisciplinary, iterative co-design methodology for developing visualization tools tailored to corpus-based scholarship and presents a suite of tools—Mosaic, Concordance Tree, Metafacet, and a Frequency Comparison tool—integrated into a concordancer platform. Case studies illustrate how these tools accelerate discovery of positional collocations, facilitate metadata-driven analysis, and support hypothesis formation and communication. The authors argue for sustained, visible collaboration between developers and humanities scholars to advance digital humanities methods. Future directions include expanding collocation and confidence measures, deeper integration of metadata analytics and normalization, further evaluation of visualization for communicating scholarly arguments, and comprehensive, domain-oriented documentation.
Limitations
- Mosaic sacrifices explicit sentence-path structure for compact overviews; users must click-through or use the Concordance Tree to confirm co-occurrence paths. - Several surveyed visualization techniques (including font-size/area encodings) have perceptual drawbacks for quantitative comparison; while addressed in design, some trade-offs remain. - Early prototypes initially offered limited collocation statistics and sparse documentation, affecting interpretability for publication; ongoing updates mitigate this. - Metadata analyses previously required external tools; Metafacet addresses this but relies on available, accurate metadata and may still entail manual curation. - Generalizability: GoK corpora are not designed to be representative samples of whole languages or the Internet; findings and workflows may need adaptation in other contexts. Analytical conclusions remain interpretive and depend on expert reading of concordance lines.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny