logo
ResearchBunny Logo
Evol project: a comprehensive online platform for quantitative analysis of ancient literature

Humanities

Evol project: a comprehensive online platform for quantitative analysis of ancient literature

J. Wang, S. Duan, et al.

Discover the innovative Evol platform, an online tool created by Jun Wang, Siyu Duan, Binghao Fu, Liangcai Gao, and Qi Su, that revolutionizes the quantitative analysis of ancient literature! Experience how this tool quantifies literary documents with ease and reveals insights into cultural evolution in ancient Chinese history through engaging case studies.... show more
Introduction

The study introduces Evol, an interactive online platform designed to enable quantitative analysis of ancient literature without requiring advanced technical skills. Motivated by the growing role of computational methods in humanities and social sciences and the limitations of existing tools (e.g., Google N-gram Viewer’s focus on modern phrase-level analysis), the authors aim to provide a comprehensive system that supports multiple semantic units (word, phrase, sentence, document) and a long historical timeline. The platform integrates a large-scale corpus of ancient Chinese texts and delivers analysis modules for text reuse, word co-occurrence, diachronic n-grams, frequency counting, browsing, and retrieval. The paper outlines the technical framework and demonstrates Evol’s utility through three case studies on government attitudes toward nomadic groups, the formulation and propagation of a classical allusion about the Battle of Muye, and the influence patterns of the Book of Changes across domains.

Literature Review

The paper situates Evol within prior quantitative cultural and linguistic research leveraging statistical and deep learning methods for text analysis. It notes the prominence of Google N-gram Viewer for diachronic frequency analysis and its application to culturomics, psychology, and conceptual history, while highlighting its limitations for ancient corpora and broader semantic units. Existing digital platforms (Daizhige, Erudition, Jihe) mainly provide search and reading; others offer value-added functions like named entity annotation (Shidianguji), character relationship discovery (CSAB), and text reuse services (Ctext; Tesserae). Prior work has demonstrated feasibility of text reuse detection across multiple languages and traditions. Evol is positioned as filling a gap by providing an integrated, multi-level, and large-scale quantitative analysis platform tailored to ancient literature with interactive visual analytics.

Methodology

The Evol platform comprises corpus building and function design.

  • Corpus building: Data collection assembled a large ancient Chinese corpus spanning over 2000 years, including 133 categories of classics, Twenty-Four Histories, Zizhi Tongjian and continuations, additional histories, and large anthologies (e.g., Quan shang gu san dai Qin Han San guo Liu chao wen). Data labelling produced three datasets: (1) document data with a hierarchical structure (book → chapter/article → paragraph → sentence → clause) stored as JSON with unique IDs; (2) index data in XLSX for time (dynasty-level identifiers), people (unique IDs for historical figures/authors/editors), and a custom hierarchical catalog based on book topics; (3) metadata (English title, author, editor, publication time, recording time, catalog) stored uniformly in XLSX and linked to index tables.
  • Data pre-processing: (1) Variant mapping and simplification using OpenCC to normalize variants and provide both simplified/traditional query support; (2) Word segmentation with Jiayan and pre-count of word frequencies at the chapter level for co-occurrence and word-count modules; (3) N-gram slicing (1–4 grams) with pre-counts at the chapter level for n-gram frequency; (4) Text reuse detection precomputed using deep learning–based sentence embeddings and contrastive learning (leveraging Transformer models like BERT/RoBERTa) to identify over 14 million reuse pairs, stored as index IDs aligned with document data.
  • Function design: Multi-perspective analytical modules include: (1) Hierarchical text reuse analysis with book-level intertextual networks (nodes=books, edges=reused sentences), chapter-level treemaps of reuse distribution, and sentence-level ranking with diachronic frequency plots; (2) Word co-occurrence analysis where users query a term, contexts are retrieved at paragraph/chapter/book levels, frequencies computed, and visualized as adjustable word clouds with XLSX export; (3) Diachronic n-gram analysis using dynasty-level timelines, with both publication time and recording time stamps, customizable scope (by titles, categories, timespan), and frequency defined as occurrences per total characters per dynasty, supporting variant aggregation; (4) Frequency count for words and n-grams with options for dictionary filtering, stopword editing, and precomputed counts at chapter level; (5) Enhanced browsing that highlights reused sentences in red and links to similar sentences platform-wide to explore intertextuality in context; (6) Enhanced retrieval supporting customizable scope, target fields (title, chapter title, author, full text), fuzzy search via edit distance, secondary search within results, and temporal/category visualizations with sort options. The interface supports Chinese and English.
Key Findings
  • Platform/corpus metrics: The system pre-detected over 14 million text reuse pairs across a large corpus spanning more than 2000 years of ancient Chinese literature. Preprocessing supports rapid online interaction for multiple modules.
  • Word-level case (attitudes toward nomadic ethnic groups): Using co-occurrence analysis in historical texts, two illustrative word clouds show negative war-related associations with Xiongnu during the Han, versus more administrative/political co-occurrences for Mongols during the Yuan. For seven nomadic groups across ~1500 years, the authors compiled top 300 co-occurring words per case (excluding group names) and scored them with a classical Chinese sentiment classifier (five-level scale). Frequency-weighted averages of the extremely negative sentiment probability show: (1) records produced under a regime established by the ethnic group itself show lower negativity toward that group; (2) an overall decline in extremely negative sentiment over time, with an exception of heightened hostility toward Xianbei in the 2nd–4th centuries; (3) in the Tang, Turkic elicited the most negative sentiment, followed by Uighur and Tibet; Uighur1 (回纥, pre-788 AD) shows stronger negativity than Uighur2 (回鹘, post-788 AD); (4) hostility toward Khitan decreases from Five Dynasties to Song, and Song/Jin show similar hostility toward Khitan during Liao.
  • Phrase-level case (formulation/propagation of the Battle of Muye allusion): Using enhanced browsing of text reuse, the team collected 281 texts with 48 main variants of the allusion (e.g., 武王伐纣, 武王克殷, 武王克商) and analyzed subject/object/predicate variants. Diachronic n-gram analysis from Spring and Autumn through Northern and Southern Dynasties shows ‘武王伐纣 (King Wu attacked Zhou)’ was absent in Spring and Autumn texts, emerged in the Warring States, and then became dominant for about a millennium. Predicate variants reveal ‘伐 (fa)’ was initially much more frequent than ‘克 (ke)’ and ‘诛 (zhu)’ in early periods, later declining toward parity, yet the fixed allusion with ‘伐’ persisted as the mainstream phrasing.
  • Document-level case (influence of the Book of Changes across domains): Chapter-level treemaps of text reuse show that Xi Ci (系辞) chapters have the most reused sentences across Yi-ology literature, Twenty-Four Histories, and pre-Qin/Han literature, indicating broad prominence. Within-discipline (Yi-ology) reuse is more evenly distributed across chapters, whereas outside disciplines (histories; pre-Qin/Han literature) influence concentrates on popular chapters, revealing domain-specific patterns of textual impact.
Discussion

The findings demonstrate that Evol’s integrated, preprocessed corpus and multi-level analytics can surface meaningful cultural evolution patterns across words, phrases, and documents in ancient literature. Co-occurrence coupled with sentiment scoring quantifies shifts in governmental attitudes toward nomadic groups; enhanced text reuse browsing and hierarchical analysis reveal how canonical allusions form and stabilize; intertextual mappings expose domain-specific influence patterns of foundational texts like the Book of Changes. These results validate the platform’s capacity to support quantitative inquiries that complement traditional humanities methods. The authors emphasize practical challenges in delivering responsive online services over large corpora and highlight that quantitative outputs are best interpreted within established historical scholarship, serving as evidence and exploratory tools rather than standalone proofs for broad societal-level conclusions.

Conclusion

Evol provides a comprehensive, beginner-friendly online environment for quantitative analysis of ancient literature, integrating large-scale preprocessed corpora with modules for text reuse, co-occurrence, diachronic n-grams, frequency counting, enhanced browsing, and retrieval. The showcased case studies illustrate the platform’s utility for culturomics, history, and philology. Future work includes expanding corpus coverage, incorporating multilingual data to broaden applicability, enhancing validation through expert feedback mechanisms, and further improving performance and user experience for scalable, interactive analyses.

Limitations
  • Corpus coverage is incomplete, potentially biasing or limiting some analyses.
  • Diachronic n-gram analysis in ancient Chinese is sensitive to polysemy and character-level ambiguity, with sparse timelines causing non-smooth trends.
  • Quantitative outputs may be insufficient for macroscopic cultural questions (community/society-level) without traditional humanities interpretation; best suited for exploration and supplementary evidence.
  • Current text reuse pairs are model-generated; expert validation and user feedback mechanisms are intended for future versions to improve precision/recall.
  • Online service constraints necessitate trade-offs among computation, latency, data transmission, and UX.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny