logo
ResearchBunny Logo
Computational thematics: comparing algorithms for clustering the genres of literary fiction

Humanities

Computational thematics: comparing algorithms for clustering the genres of literary fiction

O. Sobchuk and A. Šeļa

This innovative study by Oleg Sobchuk and Artjoms Šeļa delves into the realm of unsupervised learning algorithms, unveiling how they can be leveraged to automatically cluster literary genres. Through a meticulous comparison of text preprocessing, feature extraction, and distance measurement methods on a diverse corpus, the authors discern the most effective techniques in genre classification, promising insights for bibliophiles and machine learning enthusiasts alike.

00:00
00:00
Playback language: English
Introduction
Computational literary studies have seen a rise in computational stylometry, focusing on algorithmic detection of stylistic similarities. This contrasts with the less-developed area of "computational thematics," which aims to identify thematic similarities between texts. Understanding thematic similarities is crucial for genre studies, where genres are viewed as evolving populations of texts united by shared thematic characteristics. Digital archives offer opportunities for large-scale genre analysis, but require reliable algorithms for detecting thematic signals. Existing approaches include manual tagging (prone to bias), supervised machine learning (limited to known genres), and unsupervised clustering (scalable and capable of discovering novel genres). This paper focuses on unsupervised clustering, comparing various combinations of preprocessing techniques ("thematic foregrounding"), feature extraction methods, and distance metrics to identify the most effective approaches for detecting thematic similarities in literary texts.
Literature Review
While computational stylometry benefits from clear "ground truth" data (authorship), computational thematics lacks such a widely accepted proxy. Genre categories serve as an imperfect but useful proxy, though the suitability of different genre categorizations (based on plot, setting, evoked emotions, etc.) varies. Existing quantitative genre analysis often relies on manual tagging or supervised learning, both with limitations. Unsupervised clustering offers a more scalable and exploratory approach, capable of identifying novel genre populations.
Methodology
The study uses a controlled corpus of 200 novels (50 from each of four genres: detective, fantasy, romance, and science fiction). The selection aimed for canonical, uncontroversial representatives published between 1950 and 1999, minimizing language change and genre ambiguity. The workflow involves four steps: 1. Selecting a combination of thematic foregrounding (weak, medium, or strong, involving varying degrees of preprocessing like lemmatization, removal of frequent words and entities, and lexical simplification), feature type (bag-of-words, LDA topics, WGCNA modules, doc2vec dimensions), and distance metric (Euclidean, Manhattan, Delta, Cosine, Cosine Delta, Jensen-Shannon divergence). 2. Drawing a random sample of 30 books from each genre for cross-validation (100 samples total). 3. Clustering the sample using Ward's algorithm. 4. Validating the clusters using the Adjusted Rand Index (ARI). This process is repeated for 291 combinations of the chosen parameters. To further evaluate generalizability, the best and worst-performing methods are applied to a larger sample of 5000 novels from the HathiTrust corpus.
Key Findings
The average ARI of the best-performing algorithms ranges from 0.66 to 0.7, indicating the feasibility of unsupervised genre clustering. Key findings from Bayesian linear regression models include: 1. Stronger thematic foregrounding generally improves genre clustering, though the difference between medium and strong foregrounding is marginal. 2. Doc2vec, LDA, and bag-of-words demonstrate similar performance, with WGCNA showing the lowest average ARI. LDA's performance is relatively insensitive to variations in the number of topics and MFWs. 3. The bag-of-words approach requires a balance between thematic foregrounding and the number of MFWs. 4. Jensen-Shannon divergence performs best as a distance metric for LDA and bag-of-words, while Euclidean distance is the worst performer. The analysis of the larger HathiTrust dataset qualitatively supports these findings, showing clear separation of the four seed genres when using the best-performing algorithm combination but significantly less clear separation with the worst-performing combination.
Discussion
The findings highlight the effectiveness of unsupervised learning for detecting thematic similarities in literary fiction and the importance of selecting appropriate algorithms. The best-performing combinations generally involve stronger thematic foregrounding, LDA or doc2vec features, and Jensen-Shannon divergence as the distance metric. Euclidean distance, frequently used in other text analysis, is shown to be suboptimal for thematic analysis. The comparison with the larger HathiTrust corpus demonstrates the generalizability of these results, although a more systematic evaluation on this larger dataset is needed. The ability to identify latent genres and to model literary macroevolution highlights the potential of computational thematics.
Conclusion
This study provides a systematic comparison of algorithms for computational thematic analysis, demonstrating the effectiveness of unsupervised learning for identifying thematic similarities in fiction and offering recommendations for optimal algorithm selection. Future research should focus on broader testing of vector models, more sophisticated text simplification techniques, and application to different corpora and languages to further validate these findings and improve the robustness of computational thematics.
Limitations
The study's reliance on genre tags as a "ground truth" proxy is a limitation, as genre definitions are inherently fluid and subjective. Future work could utilize large-scale user-generated tags for improved validation. The study's inability to explicitly distinguish between thematic and formal elements of texts is another limitation, though the chosen methods may capture some formal variation implicitly. The scope of algorithms and parameters investigated is not exhaustive, and more extensive investigation is required.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny