Humanities

Computational thematics: comparing algorithms for clustering the genres of literary fiction

O. Sobchuk and A. Šeļa

This innovative study by Oleg Sobchuk and Artjoms Šeļa delves into the realm of unsupervised learning algorithms, unveiling how they can be leveraged to automatically cluster literary genres. Through a meticulous comparison of text preprocessing, feature extraction, and distance measurement methods on a diverse corpus, the authors discern the most effective techniques in genre classification, promising insights for bibliophiles and machine learning enthusiasts alike.... show more

Introduction

The paper addresses how to best detect thematic similarity among literary texts to enable unsupervised clustering of genres. Building on the success of computational stylometry (focused on authorial style and authorship attribution), the authors note a lack of systematic comparisons for methods aimed at thematic similarity, which they term computational thematics. They motivate the importance of reliable thematic similarity measures for genre detection, historical literary research, and scalable analysis across large digital archives. They compare unsupervised approaches across multiple preprocessing, feature extraction, and distance metrics, using four well-defined fiction genres (detective, fantasy, romance, science fiction) as a proxy ground truth. The central research question is which combinations of pre-processing (thematic foregrounding), features, and distance measures most effectively recover known genre groupings. They also explore generalizability by applying methods to a large HathiTrust-based corpus.

Literature Review

The introduction situates the work relative to computational stylometry, where extensive benchmarking has identified effective procedures for authorship attribution using most frequent words and various distance measures. In contrast, computational thematics lacks such systematic comparisons, often relying on subjective judgment. The authors argue that, while imperfect, genre categories can serve as a useful proxy ground truth for thematic similarity. Prior work includes manual genre tagging and supervised learning from tagged datasets, both with limitations (bias, labor, inability to discover new genres). Unsupervised clustering has shown promise for genre discovery and historical analysis. They reference topic modeling (LDA), network-based topic/group discovery (WGCNA), and distributional approaches (doc2vec), noting varied parameterizations in the literature and gaps in thorough, comparative evaluation specific to thematic similarity detection in long-form fiction.

Methodology

Data and corpus design: Four genre corpora (50 novels each; detective, fantasy, romance, science fiction) were assembled to be clear thematic representatives, emphasizing canonical works, prize-winners, and high-rating Goodreads titles. Controls: comparable abstraction level across genres; publication dates limited to 1950–1999 to reduce diachronic language drift; similar number of authors per genre (29–31) with 1–3 texts per author. The book list was preregistered on OSF (https://osf.io/rce2w).

Overall workflow (Fig. 1) with two loops: For each combination of preprocessing, feature type, and distance metric (Step 1), a small loop draws a stratified random sample (Step 2), clusters it (Step 3), and validates clusters (Step 4). Each combination is evaluated on 100 random samples, yielding an ARI distribution.

Step 1a: Thematic foregrounding (pre-processing levels)

Weak: lemmatization; remove 100 most frequent words (approximate function words)
Medium: lemmatization; remove named entities (via spaCy); POS-filter to nouns, verbs, adjectives, adverbs
Strong: all medium steps plus lexical simplification that replaces all words outside the top 1000 MFWs with more common semantic neighbors (among 10 nearest), using a pre-trained FastText model (2M-word, English Wikipedia)

Step 1b: Feature extraction

Bag-of-words (MFWs): top 1000, 5000, or 10,000 words
LDA topic probabilities: k ∈ {20, 50, 100}; MFWs ∈ {1000, 5000, 10,000}; novels chunked into 1000-word segments
WGCNA modules: with 1000 or 5000 MFWs; with and without chunking (1000-word chunks); default WGCNA parameters
Doc2vec embeddings: 300-dimensional document vectors with transfer learning from FastText Result of Step 1b is a document-term (or document-feature) matrix.

Step 1c: Distance metrics

Euclidean, Manhattan, Delta, Cosine, Cosine Delta; Jensen–Shannon divergence (JSD) for probability distributions (applied to LDA and bag-of-words)

Combinatorics: 291 distinct combinations of foregrounding × features × distance measures were tested.

Step 2: Sampling

Robustness via cross-validation: for each combination, 100 iterations sample 120 novels (30 per genre). For each iteration, train models requiring fitting (LDA, WGCNA, doc2vec) anew.

Step 3: Clustering

Hierarchical clustering with Ward’s linkage; tree is cut into 4 clusters (assumed number of genres). Although Ward is defined for Euclidean distance, prior work shows it often performs best in text clustering; it is used here with all distance matrices.

Step 4: Validation

Adjusted Rand Index (ARI) compares cluster assignments to ground-truth genre labels. Across all combinations and samples, this yields 29,100 ARI observations.

Large-corpus illustration

To test generalizability, they embed four target genres identified within a random sample of 5000 HathiTrust/NovelTM fiction works. They apply a high-performing pipeline (medium foregrounding; LDA with k=100 on 1000 MFWs; Delta distance), and visualize clusters via UMAP. Accuracy comparisons versus a worst-performing pipeline are provided in the supplement (Section 5.2).

Key Findings

Unsupervised clustering of fiction genres is feasible: best combinations achieve average ARI around 0.66–0.70, despite noise and complexity of literary texts.
Thematic foregrounding is crucial: weak foregrounding consistently harms performance across feature types; medium vs strong shows little difference, and lexical simplification (the strong-level addition) offers no consistent gains.
Feature types: doc2vec has the best average ARI at high foregrounding; LDA is a strong second with stable performance across its parameter settings; bag-of-words performs surprisingly well given its simplicity and is close to LDA; WGCNA performs worst on average.
LDA parameters: performance is not meaningfully sensitive to k (20/50/100) or to the number of MFWs (1k/5k/10k), once adequate thematic foregrounding is applied; the foregrounding choice dominates.
Bag-of-words: performance improves markedly from 1000 to 5000 MFWs and from weak to medium foregrounding; moving to 10,000 MFWs and strong foregrounding yields only marginal gains beyond that.
Distance metrics: Jensen–Shannon divergence is best for LDA and bag-of-words; Delta and Manhattan also perform well. Euclidean distance is consistently worst for LDA, bag-of-words, and WGCNA; cosine is not optimal for LDA though acceptable in some contexts. For doc2vec, the distance choice matters less; cosine with doc2vec (300d) is among top performers.
Top-10 combinations (examples from Fig. 2):
1. Strong foregrounding + doc2vec (300d) + cosine: median ARI ≈ 0.703
2. Strong foregrounding + LDA (k=50, 5000 MFWs) + JSD: ≈ 0.677
3. Strong foregrounding + LDA (k=100, 1000 MFWs) + JSD: ≈ 0.670
4. Strong foregrounding + BoW (10,000 MFWs) + JSD: ≈ 0.665
5. Strong foregrounding + BoW (5000 MFWs) + JSD: ≈ 0.657 (Most top performers use medium/strong foregrounding; LDA and BoW with JSD are frequent; doc2vec + cosine also excels.)
Larger corpus test (HathiTrust/NovelTM 5000): a high-performing pipeline (medium foregrounding; LDA k=100 on 1k MFWs; Delta) produces clear genre clusters in UMAP; supplementary materials compare against a worst-performing pipeline.

Discussion

The study demonstrates that different algorithmic choices substantially affect the recovery of thematic similarity structures (genres) in fiction. By systematically benchmarking preprocessing, feature types, and distance metrics against a proxy ground truth, the authors show that robust thematic detection is attainable and that best practices differ from those in computational stylometry. Specifically, heavy reliance on function words and cosine distance (common in stylometry) underperforms for thematic tasks, whereas thematic foregrounding, distribution-based distances (JSD), and topic/embedding or sufficiently large BoW features improve results. These findings support the feasibility of unsupervised clustering for practical tasks such as genre discovery in large digital libraries, historical analyses of literary evolution, and content-based recommendation. The large-corpus illustration suggests that the identified best practices generalize beyond the curated dataset, although further validation is warranted.

Conclusion

The paper provides a preregistered, large-scale benchmark of 291 algorithmic combinations across preprocessing (thematic foregrounding), feature extraction (BoW, LDA, WGCNA, doc2vec), and distance metrics for unsupervised genre clustering in fiction. It offers practical guidance: apply thematic foregrounding (at least medium), prefer JSD for probability-based features (LDA/BoW), avoid Euclidean distance for thematic clustering, and consider doc2vec or LDA (and even BoW with ~5000 MFWs) as strong baselines. The approach scales and shows promise on larger, noisier corpora. Future work should: expand testing of vector/topic methods (e.g., BERTopic, Top2Vec), explore other network/community detection approaches, refine text simplification pipelines, replicate across more genres and languages, and integrate advanced Bayesian clustering to quantify uncertainty and discover latent genres.

Limitations

Clustering is a simplification of relationships in a distance matrix, itself built from imperfect proxies; generalization from specific parameterizations is limited.
Limited control over textual representation; uncertainty in the boundary between thematic and formal (narrative) features; current methods may conflate them.
Genre labels are an imperfect ground truth; different genre axes (plot, setting, audience, affect) may vary in how well they reflect thematic similarity.
The study does not exhaust all feature extraction or embedding methods; doc2vec settings were not extensively tuned; other modern topic/embedding approaches were not benchmarked.
Overfitting concerns are mitigated by resampling but not eliminated; broader validation across corpora, genres, and languages is needed.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Development and evaluation of deep learning algorithms for assessment of acute burns and the need for surgery

C. Boissin, L. Laflamme, et al.

Business

Comparing the influence of visual information and the perceived intelligence of voice assistants when shopping for sustainable clothing online

P. Li, C. Wu, et al.

Social Work

The role of literary fiction in facilitating social science research

B. Yazell, K. Petersen, et al.

Medicine and Health

A predictive computational platform for optimizing the design of bioartificial pancreas devices

A. U. Ernst, L. Wang, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny