logo
ResearchBunny Logo
Introduction
Many languages utilize pitch variations, specifically tones, to distinguish word meanings. Tone languages, like Chokri, a Tibeto-Burman language spoken in Nagaland, India, heavily rely on this system. Traditional tone analysis often involves examining acoustic features (fundamental frequency (f0), duration, intensity) and applying statistical models. However, machine learning algorithms (MLAs) offer a potentially more powerful approach. MLAs excel at identifying complex patterns in large datasets, making them suitable for analyzing intricate tonal systems. This study aims to evaluate the effectiveness of various MLAs, both traditional and neural network-based, in identifying Chokri's five-way tonal contrasts. The use of f0 directionality, alongside f0 height, is also explored. The under-documented and endangered status of Chokri adds urgency and importance to this research, potentially laying the groundwork for a larger corpus and language preservation efforts. The study follows a systematic methodology, from data collection and preprocessing to MLA implementation, evaluation, and comparison. This approach provides a robust and transparent analysis of the different algorithms’ abilities to classify Chokri tones.
Literature Review
Previous research on tone recognition has utilized various MLAs, such as Support Vector Machines (SVMs) and Deep Neural Networks (DNNs), to analyze tonal systems in languages like Mizo. Studies have explored the use of different acoustic features (f0, duration, intensity) and investigated the effectiveness of different algorithms for various languages, including Mandarin, Cantonese, English, and Yukuna. However, a systematic comparison of multiple MLAs applied to a complex five-way tonal system like Chokri's is lacking. This study addresses this gap by comparing traditional MLAs (linear, non-linear, ensemble-based) against a neural network (ANN), providing a comprehensive assessment of their strengths and weaknesses in the context of Chokri's unique tonal system.
Methodology
A production experiment was conducted with seven native Chokri speakers (five females, two males) who produced eight monosyllabic toneme pairs with five-way meaning contrasts. Speech data were recorded using a high-quality recorder and microphone, with a focus on minimizing noise interference. The target words were manually annotated in Praat, and acoustic measurements (f0, duration, intensity) were extracted using VoiceSauce. Duration and intensity were found to be non-significant, so only f0 data were used in further analysis. F0 values were converted to Z-scores to account for inter- and intra-speaker variability. The dataset was split into training (70%) and testing (30%) sets using a GroupShuffleSplit method to maintain independence. Seven different MLAs were used: K-Nearest Neighbors (KNN), Naive Bayes (NB), Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and Artificial Neural Network (ANN). KNN used Euclidean distance and considered the five nearest neighbors. NB implemented Gaussian Naive Bayes. DT used entropy as a splitting criterion. RF used 100 trees and entropy as the splitting criterion. LR was a multinomial logistic regression and SVM used a linear kernel. ANN consisted of three convolutional layers, max pooling layers, a dropout layer, a flattening layer, and a fully connected output layer. Model performance was evaluated using confusion matrices, ROC curves, AUC values, accuracy, precision, recall, and F1-scores (micro- and macro-averaged).
Key Findings
The Random Forest (RF) algorithm consistently outperformed other MLAs, achieving the highest accuracy in classifying Chokri's five tones across various metrics. Specifically, RF achieved significantly higher F1-scores and AUC values than other algorithms, particularly for male speakers. The average accuracy for female speakers ranged from 84% to 87%, while for male speakers it ranged from 95% to 97%. The study revealed that combining f0 height and directionality as features improved classification accuracy for female speakers, while f0 directionality alone was sufficient for male speakers. This suggests a gender-based difference in how these acoustic features contribute to tonal distinctions. While the ANN model demonstrated good performance, it did not significantly surpass the accuracy of RF. The confusion matrices provided detailed visualizations of the algorithms’ classification performance, highlighting which tones were consistently well-classified and which were more challenging. Overall, all seven MLAs exhibited a high degree of accuracy in classifying the Chokri tones, particularly for male speakers. The ROC curves visually confirmed the superior performance of RF in most scenarios, exhibiting AUC values very close to 1.
Discussion
The findings challenge the assumption that ANNs always outperform traditional MLAs. In this case, RF demonstrated superior performance, highlighting the importance of considering the specific characteristics of the dataset and the complexity of the classification task. The gender-based differences in feature importance indicate a need for considering such factors when designing and implementing MLAs for tone recognition. The high accuracy rates achieved demonstrate the feasibility of using MLAs for analyzing complex tonal systems in under-resourced languages. The methodology presented can be applied to other tonal languages, contributing to language documentation and preservation efforts. Future research could explore additional acoustic features, investigate different MLA architectures, and investigate other under-resourced languages.
Conclusion
This study successfully employed several MLAs to analyze the complex five-way tonal system of Chokri. Random Forest emerged as the most effective algorithm, achieving high accuracy rates, particularly for male speakers. The finding that f0 direction alone sufficed for male speakers while combining f0 height and directionality improved performance for female speakers highlights gender-specific nuances in acoustic feature importance for tone recognition. The research challenges the prevalent assumption of neural network superiority in tonal classification. The successful application of machine learning to an under-documented language demonstrates its utility in linguistic research and language preservation efforts. Future research can explore more sophisticated models and additional acoustic features to further refine the analysis.
Limitations
The relatively small sample size of speakers (seven) may limit the generalizability of the findings. While efforts were made to control for noise, potential variations in recording conditions could have affected the results. The study focused primarily on monosyllabic words; further research is needed to examine the performance of MLAs with polysyllabic words and in connected speech. The current model only utilizes f0 values as input features; adding additional features (e.g., duration, intensity, formant transitions) might lead to enhanced performance.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny