Humanities

Computational thematics: comparing algorithms for clustering the genres of literary fiction

O. Sobchuk and A. Šeļa

This innovative study by Oleg Sobchuk and Artjoms Šeļa delves into the realm of unsupervised learning algorithms, unveiling how they can be leveraged to automatically cluster literary genres. Through a meticulous comparison of text preprocessing, feature extraction, and distance measurement methods on a diverse corpus, the authors discern the most effective techniques in genre classification, promising insights for bibliophiles and machine learning enthusiasts alike.... show more

Abstract

What are the best methods of capturing thematic similarity between literary texts? Knowing the answer to this question would be useful for automatic clustering of book genres, or any other thematic grouping. This paper compares a variety of algorithms for unsupervised learning of thematic similarities between texts, which we call "computational thematics". These algorithms belong to three steps of analysis: text pre-processing, extraction of text features, and measuring distances between the lists of features. Each of these steps includes a variety of options. We test all the possible combinations of these options. Every combination of algorithms is given a task to cluster a corpus of books belonging to four pre-tagged genres of fiction. This clustering is then validated against the "ground truth" genre labels. Such comparison of algorithms allows us to learn the best and the worst combinations for computational thematic analysis. To illustrate the difference between the best and the worst methods, we then cluster 5000 random novels from the HathiTrust corpus of fiction.

Publisher

Humanities and Social Sciences Communications

Published On

Mar 20, 2024

Authors

Oleg Sobchuk, Artjoms Šeļa

DOI

https://doi.org/10.1057/s41599-024-02933-6

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Development and evaluation of deep learning algorithms for assessment of acute burns and the need for surgery

C. Boissin, L. Laflamme, et al.

Business

Comparing the influence of visual information and the perceived intelligence of voice assistants when shopping for sustainable clothing online

P. Li, C. Wu, et al.

Social Work

The role of literary fiction in facilitating social science research

B. Yazell, K. Petersen, et al.

Medicine and Health

A predictive computational platform for optimizing the design of bioartificial pancreas devices

A. U. Ernst, L. Wang, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny