Interdisciplinary Studies
A decomposition of book structure through ousiometric fluctuations in cumulative word-time
M. I. Fudolig, T. Alshaabi, et al.
The paper addresses how word usage changes over the course of a single text when measured in cumulative word-time (number of words seen), rather than normalized by text length. Prior quantitative work has focused on aggregate word frequencies and co-occurrence or on normalized narrative arcs across entire texts. While normalization allows comparing texts of different lengths, it obscures comparisons between sections of texts of similar absolute length. The authors propose using cumulative word-time to enable such comparisons and investigate whether meaningful fluctuations in essential meaning occur within sections of books. Using ousiometrics (a two-dimensional power–danger reinterpretation of VAD), they argue these dimensions are suitable markers for quantifying structure in text with higher lexical coverage than happiness scores. Visual examples show that larger window sizes smooth time series, disproportionately diminishing fluctuations in shorter books while longer books retain oscillations; plotted in raw word-time, shorter books appear comparable to subsections of longer ones. The research questions include: Can we detect meaningful changes in word usage at scales shorter than entire books? How do we characterize such changes across diverse texts? The study compares original texts to shuffled versions (same words, randomized order) to isolate the effect of word order and employs empirical mode decomposition (EEMD/EMD) to extract oscillatory modes and trends. This framework aims to quantify structural fluctuations without assuming noise models and to relate fluctuation scales to properties like length, content, and classification.
The introduction reviews computational and qualitative narrative analyses: Gao et al. (2016) found long-range sentiment correlations in novels; Reagan et al. (2016) identified six basic emotional arcs using smoothed happiness scores across normalized book length; Boyd et al. (2020) linked Freytag’s dramatic arc stages to distinct word-usage patterns and showed similar plot progression patterns in nonfiction with differences in staging and cognitive processes. Schmidt (2015) and Toubia et al. (2021) modeled narratives as paths in topic or embedding spaces with properties like speed, volume, and circuitousness predicting success metrics. The VAD framework (Osgood et al., 1957) and large VAD lexicons (Warriner et al., 2013; Mohammad, 2018) reveal linear correlations among VAD dimensions; Dodds et al. (2021) reinterpreted VAD into two independent dimensions aligning with power and danger, with a safety (low-danger) bias and high correlation of valence with happiness, motivating the use of power–danger scores with broader lexical coverage. The literature mostly treats normalized arcs over entire texts, leaving a gap in analyzing word-usage change in absolute word-time across different texts.
Data preprocessing: The authors downloaded over 45,000 public-domain books from Project Gutenberg and removed Gutenberg headers using the Standardized Project Gutenberg Corpus tooling. Contractions were expanded when unambiguous and deleted when ambiguous; text was lowercased and tokenized on whitespace, excluding tokens with non-word characters or digits and punctuation. They expanded the NRC VAD-derived power–danger lexicon to 32,721 words by adding plurals and conjugations (using base-form scores). Books were required to have at least 60% unique-word coverage by the lexicon, at least one lexicon word per analysis window, no duplicate title, and successful EEMD up to the trend (mean of summed EEMD results within 10% of the raw mean), yielding 31,690 books (median unique-word coverage 73% among analyzed books). Time series construction: For each book, they built danger and power time series using non-overlapping windows of size Nw = 50 words (skip size = 50), each window’s score being the token-frequency-weighted average of lexicon scores for words in the window, ensuring equal length for original and shuffled series. Reference series were generated by shuffling the tokenized text, re-windowing, and recomputing window means; 100 shuffled realizations served as references. EMD/EEMD and variance-based cutoff: For original text series, ensemble empirical mode decomposition (EEMD) was applied with ensemble size 100 and added white noise of 0.2σ per Wu and Huang (2009). For each shuffled realization, basic EMD was computed. IMFs (mean-zero oscillatory modes) and a non-oscillatory trend were obtained. IMF variances Var = Σxi^2/N were computed and rescaled so that the first IMF of the original matched a representative of the shuffled first-IMF distribution (three options: median, 1st percentile, or no rescaling), following Flandrin et al. and Wu & Huang. The first IMF is assumed to be noise given small window size (50 words), supported by similar periods between original and shuffled first IMFs. The cutoff IMF order is the lowest order where the original’s rescaled variance exceeds the 99th percentile of shuffled variances; if the trend is the first such level, the book is trend-only; otherwise, fluctuations above trend are deemed relevant. When no rescaling, the first IMF was often flagged; given its short period (~100 words) and distinct behavior, they disregard the first IMF when determining the cutoff in the no-rescaling case. Period estimation: For the cutoff IMF, a characteristic period (in words) was obtained via the Hilbert–Huang transform using logarithmically spaced frequency bins from 10^-6 to 1 word^-1 (period 1 to 10^6 words), with sampling rate set by the windowing (Nw = 50). The period is taken from the center of the highest-energy frequency bin. This period estimation is independent of the rescaling choice for detecting the cutoff. Parameter notes: They also examined robustness to different rescaling choices (median, 1st percentile, none) and verified comparable IMF periods between original and shuffled series up to the cutoff IMF.
- Corpus and coverage: Of >45,000 downloaded books, 31,690 English books met inclusion criteria (≥60% unique-word lexicon coverage, etc.). About 60% carried LCC class P (Language and Literature); World History (D) was ≈8%.
- Short vs. long texts: Shorter books (notably <3,000 words) predominantly exhibited only a non-oscillatory trend (trend-only). Longer books (notably >100,000 words) more often showed relevant fluctuations above the trend.
- Fluctuation scales: When relevant fluctuations exist, their characteristic periods are typically on the order of a few thousand words and almost always shorter than the book length. There is no clear dependence of cutoff period on total book length.
- Percentile ranges for cutoff periods (danger series):
- Rescale to 50th percentile of first IMF: 25th–75th percentile ≈ 1,000–3,200 words.
- Rescale to 1st percentile of first IMF: 25th–75th percentile ≈ 1,200–6,400 words.
- No rescaling (excluding IMF1): 25th–75th percentile ≈ 500–1,400 words.
- Variance and IMF counts: Raw variances of cutoff IMFs show no relation to book length and are wider in range than for IMF1, indicating divergence from noise at higher orders. The number of IMFs increases with book length, but the cutoff IMF order does not correlate with length.
- Comparison with shuffled text: Original texts have higher IMF variances than shuffled texts at and above the cutoff order; shuffled IMF variances generally decrease with IMF order as in fractional Gaussian noise.
- Category and content effects: LCC class/subclass labels (top groups: PS, PR, PZ, AP, PQ; class P, D, B, A, E) show overlapping distributions of cutoff periods/variances, indicating limited differentiation at this taxonomy level. Title-keyword filtering reveals clearer differences: books with “poem” in the title have shorter cutoff periods and markedly higher variances; “manual” titles have lower median cutoff variances; “play” titles show higher median cutoff periods and lower variances. Some keywords (“collection”, “short stor”, “report”, “essay”) are more sensitive to rescaling choice.
- Power vs. danger: Analyses on power time series mirror the danger-series findings (similar orders of magnitude for cutoff periods and analogous trends).
- Denoising: Partial reconstruction using low-frequency IMFs reproduces smoothed series akin to using large windows (e.g., Nw=5000 with overlap), providing a data-driven denoising approach that preserves relevant fluctuations across text lengths.
The study directly addresses whether meaningful changes in word usage occur at absolute word-time scales and how to characterize them. By contrasting original with shuffled texts and decomposing time series via EEMD/EMD, the authors show that long texts exhibit relevant oscillatory fluctuations with characteristic periods of a few thousand words, independent of total length. This supports the view that, in ousiometric terms, longer books resemble concatenations of shorter segments rather than scaled-up versions of shorter texts. The fluctuation scales align with editorial practice and reader experience (e.g., chapter lengths), suggesting a structural segmentation across genres, including non-literary texts. The lack of strong differentiation by broad LCC categories indicates that structural scales are more influenced by specific content and organization (e.g., poems vs. manuals) than by high-level topical classifications. Methodologically, the decomposition and variance comparison against shuffled controls provide a principled, data-adaptive means to separate signal from noise and to denoise text-derived series without arbitrarily large windows, preserving relevant sub-plot or section-level oscillations. Altogether, the findings argue for a basic unit length in textual structure measurable in word-time, opening avenues for finer-grained computational literary analysis.
The paper contributes: (1) a word-time-based framework using ousiometric power and danger to quantify within-text structural fluctuations, demonstrating that longer books are composed of segments exhibiting oscillations with periods of a few thousand words, and (2) a data-adaptive denoising method using EEMD/HHT and shuffled-text baselines that works across text lengths without over-smoothing shorter texts. Broad LCC categories are not strong differentiators of fluctuation scales, whereas specific content signals (e.g., poems, manuals, plays) show distinctive patterns. Future research directions include: probing hierarchical segmentation (multiple IMF orders contributing to structure); analyzing joint trajectories in power–danger space (spatial paths) akin to embedding-space narrative paths; expanding to other media (screenplays, movies) and versions (drafts vs. final); and investigating cognitive constraints underlying characteristic segment lengths.
- Dependence on EMD/EEMD: Potential mode mixing due to signal intermittency; EEMD mitigates but does not eliminate it in all cases.
- Variance-only criterion: Cutoff detection uses IMF variance comparisons; other differences between original and shuffled IMFs (beyond variance) are not captured. Alternative PDF-based comparisons underperformed on synthetic tests.
- Parameter sensitivity: Results for individual books can vary with rescaling choice (median vs. 1st percentile vs. none) and other settings; sensitivity analysis is advised for book-specific conclusions.
- Assumption about IMF1: Treating the first IMF (with Nw=50) as noise is an assumption justified by period similarity with shuffled text and short scale; nonetheless, very fine-grained structure could be missed at this scale.
- Windowing constraints: Non-overlapping windows (Nw=50) fix temporal resolution; while partial reconstruction captures larger scales, extremely short or highly irregular structures may be under-resolved.
Related Publications
Explore these studies to deepen your understanding of the subject.

