logo
ResearchBunny Logo
Computational thematics: comparing algorithms for clustering the genres of literary fiction

Humanities

Computational thematics: comparing algorithms for clustering the genres of literary fiction

O. Sobchuk and A. Šeļa

This innovative study by Oleg Sobchuk and Artjoms Šeļa delves into the realm of unsupervised learning algorithms, unveiling how they can be leveraged to automatically cluster literary genres. Through a meticulous comparison of text preprocessing, feature extraction, and distance measurement methods on a diverse corpus, the authors discern the most effective techniques in genre classification, promising insights for bibliophiles and machine learning enthusiasts alike.

00:00
00:00
Playback language: English
Abstract
This paper compares various algorithms for unsupervised learning of thematic similarities between literary texts, focusing on their application in automatically clustering book genres. The algorithms are categorized into three steps: text preprocessing, feature extraction, and distance measurement. The study tests numerous combinations of these steps using a corpus of books from four pre-tagged genres (detective fiction, science fiction, fantasy, and romance). Clustering performance is validated against the known genre labels, identifying the best and worst combinations. The difference between the best and worst methods is illustrated by clustering 5000 random novels from the HathiTrust corpus.
Publisher
Humanities and Social Sciences Communications
Published On
Mar 20, 2024
Authors
Oleg Sobchuk, Artjoms Šeļa
Tags
unsupervised learning
clustering
literary texts
genre classification
distance measurement
feature extraction
text preprocessing
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny