Introduction
Existing research on word usage in books primarily focuses on aggregate trends and the overall shape of narratives, irrespective of book length. This study departs from this by examining how word usage changes over the course of a single book as a function of cumulative word-time. This approach allows for a comparison of sections of comparable length across books of varying sizes, unlike normalization by text length, which focuses only on overall narrative progression. The authors utilize ousiometrics, a framework that reduces semantic meaning into two dimensions, 'power' and 'danger,' derived from the valence-arousal-dominance (VAD) model. The choice of ousiometrics is motivated by its ability to capture essential meaning changes over time, offering a richer analysis than sentiment-based approaches alone. Previous studies using sentiment analysis or word embeddings have revealed narrative arcs or paths in semantic space. However, these methods often overlook detailed fluctuations within the text, which are the primary focus of this work. The researchers hypothesize that these fluctuations, particularly apparent in longer books, reflect a fundamental structural unit within narratives and other textual forms, potentially linked to human information processing capacities. The study will employ empirical mode decomposition (EMD) to analyze the time series of power and danger scores, separating these signals into oscillating components and a trend, which allows for the identification and characterization of these underlying fluctuations.
Literature Review
Prior research on word usage in texts has largely concentrated on aggregate statistics, such as word frequencies in large corpora and co-occurrence patterns. Recent studies have explored changes in word usage within individual texts, focusing primarily on narrative structure. Gao et al. (2016) demonstrated long-range correlations in sentiment across novels. Reagan et al. (2016) identified six basic emotional arcs in fiction based on happiness scores, inspired by Vonnegut's work. Boyd et al. (2020) showed that different stages of Freytag's dramatic arc correlate with distinct word usage patterns. Other research, like Schmidt (2015) and Toubia et al. (2021), has viewed narratives as paths through multidimensional topic spaces or word embedding spaces, analyzing properties like speed and volume to predict success. While these studies analyze word usage changes over text length, they primarily focus on the overall arc, often normalizing by book length. This normalization prevents comparing sections of different books with equal word counts. This paper tackles this limitation by using cumulative word-time, directly addressing the lack of research on intra-textual changes in word usage patterns as a function of this new metric.
Methodology
The study employed a large dataset of over 45,000 books downloaded from Project Gutenberg, after preprocessing to remove headers, standardize contractions, and tokenize text. A power-danger lexicon, expanded from the NRC-VAD lexicon to include conjugated forms and noun plurals (resulting in 32,721 words), was used to assign power and danger scores to the words in each book. Only books with at least 60% unique word coverage in the lexicon were included. The dataset was further refined to remove duplicates and ensure successful EMD computation on the time series for each book. The method for creating power and danger time series involved segmenting the text into non-overlapping windows of 50 words each. For each window, a weighted average score for power and danger was calculated, using the frequency of occurrence of each word. The resulting time series served as the input for the analysis. The authors used ensemble empirical mode decomposition (EEMD), a robust variant of EMD, to decompose each time series into intrinsic mode functions (IMFs) representing oscillations at various scales and a non-oscillatory trend. To distinguish genuine fluctuations from noise, the authors compared the variances of the IMFs from the original text to those obtained from 100 shuffled versions of the text. The lowest IMF order exhibiting significantly higher variance than the shuffled versions was identified as the cutoff, indicating the start of meaningful fluctuations. The period of this cutoff IMF was calculated using the Hilbert-Huang transform (HHT). The analysis was performed for both danger and power scores, and books were categorized by Library of Congress Classification (LCC) codes for comparative analysis of results across different genres and topics.
Key Findings
The analysis of over 30,000 books revealed a clear relationship between book length and the presence of significant fluctuations in ousiometric time series. Shorter books (<3000 words and >100,000 words) primarily exhibited a general trend, while longer books showed relevant fluctuations in addition to the general trend. The period of these fluctuations was consistently around a few thousand words, regardless of overall book length or LCC classification. This suggests a fundamental structural unit of several thousand words, possibly representing a chapter or other meaningful segment, which serves as a building block for longer texts. While the number of IMFs increased with book length, the period and variance of the cutoff IMF did not show a strong dependence on book length. The analysis of LCC categories showed limited distinction in the characteristics of fluctuations across major genre categories. However, applying finer filters, based on specific keywords in the book title, did reveal differences. For instance, books with "poem" in the title exhibited shorter periods and higher variances compared to other books, while books containing "manual" showed lower variances, reflecting the structural and stylistic differences between these types of texts. The findings for power scores mirrored those for danger scores.
Discussion
The findings strongly support the hypothesis that longer books are not simply scaled-up versions of shorter books but rather are composed of recurring structural units of roughly a few thousand words, consistent with chapters or other segmentations found in longer works. The consistency of the fluctuation period across different genres suggests a fundamental aspect of textual organization related to human information processing. While LCC classifications did not strongly differentiate these fluctuation patterns, specific keyword filters on titles confirmed the influence of text type and structure on the observed fluctuations. The data-driven denoising approach using EMD offers a significant improvement over traditional methods that rely on arbitrarily large window sizes, which may remove relevant information. The results offer insights into the fundamental structure of texts beyond narrative, encompassing diverse textual forms, and confirm the validity of the approach to analyze and compare different books regardless of length.
Conclusion
This study provides a novel approach to quantify text structure using ousiometrics and EMD, revealing a consistent structural unit of approximately a few thousand words in longer texts. This aligns with editorial practices and suggests a potential link to human information processing. The methodology also offers a data-driven denoising method for text-derived time series. Future research could explore the hierarchical organization of these structural units, investigate the interplay of power and danger dimensions, and expand the analysis to other textual forms like screenplays and compare different versions of the same book.
Limitations
The study relies on the Project Gutenberg corpus, which may not fully represent the diversity of all published works. The EMD method, while robust, is susceptible to mode mixing, potentially affecting the accuracy of IMF decomposition. The comparison of original and shuffled texts focuses solely on variance differences in IMFs, potentially overlooking other differences that could reveal additional insights. The interpretation of the cutoff IMF period is based on the assumption that it represents a basic unit of textual structure; further research is needed to directly test this assumption.
Related Publications
Explore these studies to deepen your understanding of the subject.