logo
ResearchBunny Logo
Multi-class identification of tonal contrasts in Chokri using supervised machine learning algorithms

Linguistics and Languages

Multi-class identification of tonal contrasts in Chokri using supervised machine learning algorithms

A. Gope, A. Pal, et al.

This groundbreaking study by Amalesh Gope, Anusuya Pal, Sekholu Tetseo, Tulika Gogoi, Joanna J, and Dinkur Borah explores the use of machine learning algorithms to identify tonal contrasts in the endangered Chokri language. With astonishing accuracy rates reaching 95-97% for male speakers, the research uncovers the potential of these algorithms in analyzing tonal languages, paving the way for broader applications.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper investigates how supervised machine learning can identify a complex five-way tonal contrast in Chokri, an under-documented and endangered Tibeto-Burman language from Nagaland, India. The linguistic context centers on tones—pitch patterns that distinguish word meanings—operationalized through fundamental frequency (f0) properties. Traditional phonetic analyses rely on acoustic measures (f0 height, direction, slope) with statistical testing to establish tonal contrasts. Prior work on Chokri suggests four level tones (extra high, high, mid, low) and one contour tone (mid-rising), with duration and intensity being non-significant. The study’s purpose is to compare several supervised MLAs (LR, DT, RF, SVM, KNN, NB) and a neural network (ANN) to classify these tones from acoustic features, evaluate model performance comprehensively, and assess which features (f0 height and/or directionality) matter, with attention to speaker gender. The broader importance includes advancing tools for analyzing complex tonal systems and contributing to documentation and preservation of Chokri.
Literature Review
The paper situates its work within a growing use of MLAs in tone research and broader linguistic tasks. Prior studies have applied SVM, DNN, 1D-CNN, boosted methods, RF, KNN, and regression to tone recognition and related prosodic classification in languages such as Mizo, Dharmashala Tibetan, Mandarin, Cantonese, English, and Yukuna. These works show that MLAs can outperform traditional statistical modeling for complex, high-dimensional acoustic data, though performance varies by algorithm and feature set. The paper also reviews general MLA categories (linear, non-linear, ensemble, and deep learning) and emphasizes that dataset characteristics and feature representation guide algorithm choice. It highlights gaps in consolidated methodological guidance for tonal classification, motivating a systematic comparison across multiple MLAs and features for Chokri.
Methodology
Data and recording: Eight monosyllabic toneme sets with five-way lexical contrasts (total 40 lexical items) were elicited. Speakers produced a priming sentence containing the target, then a fixed carrier sentence (“Repeat X again”) with the target in sentence-medial position, and in isolation. Recordings were made with a Tascam DR-100MKII and Shure SM10A-CN headset mic at 44.1 kHz, 32-bit; mouth-to-mic distance ~25 mm. Participants: Seven native Chokri speakers (5 female, 2 male), age 19–39, from Thipüzu village, Phek district, Nagaland. All L1 Chokri; fluent in English and Nagamese. No reported speech/hearing issues. Informed consent obtained and honorarium provided. Tokens: 8 toneme sets × 5 meanings × 7 subjects × 5 repetitions × 11 time points = 15,400 tokens analyzed. Annotation and measures: Target words were manually annotated in Praat with tiers for f0, duration, and intensity; acoustic measures extracted with VoiceSauce. Duration and intensity were excluded from MLA analyses due to non-significance for tonal contrast realization. Preprocessing and normalization: f0 trajectories were sampled at 11 equidistant points (0%–100%). Raw f0 values were converted to Z-scores per Adank et al. (2004) to reduce inter-/intra-speaker variability: z = (f0_i − mean(f0_all))/sd(f0_all). Outliers were removed; features were scaled by z-normalization. Feature sets: Two configurations: (1) f0 directionality alone (11-point z-score trajectory), and (2) combined f0 height and f0 directionality. Data handling and split: Data manipulation in Python using Pandas and Numpy. Supervised learning with scikit-learn. Train/test split of 70:30 using GroupShuffleSplit to ensure that tokens from the same subject do not appear in both training and test sets; random_state = 0 for reproducibility. Cross-validation applied during model selection. Algorithms and implementations: - KNN: KNeighborsClassifier(n_neighbors=5, weights='distance', metric='minkowski', p=2). - Naive Bayes: GaussianNB(). (BernoulliNB tested but excluded due to low accuracy ~56%.) - Decision Tree: DecisionTreeClassifier(criterion='entropy', max_depth=None). - Random Forest: RandomForestClassifier(n_estimators=100, criterion='entropy'); n_estimators tested at 10, 50, 100 with best performance at 100. - Logistic Regression: LogisticRegression(multi_class='multinomial', solver='newton-cg'). - SVM: SVC with GridSearchCV over kernels (linear, poly degree 3, RBF) and parameters gamma ∈ {1, 0.001, 0.0001}, C ∈ {1, 10, 100, 1000}; optimized to linear kernel (scale). - ANN: Keras model with three Conv2D layers (32 filters, 3×3, ReLU), each followed by MaxPooling2D(2×2); Dropout(rate=0.2); Flatten; Dense output with softmax over 5 classes; compiled with Adam optimizer and categorical_crossentropy; accuracy metric. Validation split 10%; epochs tuned until convergence without over/underfitting. Evaluation metrics: Confusion matrices (per-class TP, FP, FN, TN), ROC and AUC using One-vs-Rest strategy, and accuracy, precision, recall, F1-score with micro/macro averaging via scikit-learn. Gender-specific analyses (female vs male) were performed due to f0 range differences. Visualization: Normalized mean f0 directionality curves per speaker and averaged across speakers to validate tonal patterns (EH, H, M, L, MR).
Key Findings
- Tone system confirmation: Visualized normalized f0 directionality supports 5-way contrasts in Chokri: four level tones (EH, H, M, L) and one contour tone (MR). Male f0 range ~90–200 Hz; female ~140–300 Hz. - Traditional MLAs (female data; confusion matrices): Consistent high accuracy for EH (~91%) and L (≥94–97%), and good for MR (~83–94% except NB ~75%). H and M are harder (H as low as 42% with DT; M ~64–85% depending on MLA). - Traditional MLAs (male data; confusion matrices): Overall higher accuracies; several MLAs reach 100% for H and L; MR typically ~93% (except NB 80%); M commonly ~87–93%; EH often ≥86–100%. - ROC/AUC: For females, RF and SVM show high AUCs across tones (e.g., RF: EH 0.99, H 0.94, M 0.96, L 1.00, MR 0.99). DT and LR weaker for H (AUC ~0.84 and 0.73) and M (~0.75 and ~0.69). For males, RF and KNN achieve AUC=1.0 for all tones; DT/NB/SVM remain strong (AUC >0.92); LR weaker for M (AUC ~0.76). - ANN performance: Females—L 97% (3%→M), EH 86%, H 78%, M 73%, MR 89%. Males—EH and M 100%, L and MR 94% with small confusions (e.g., 6% L→M, 6% MR→EH). Training/validation curves indicate good fit without under/overfitting. - F1-scores (Table 1 averages): - Females (f0 height + direction): DT 0.8308; KNN 0.8417; LR 0.8558; NB 0.8194; RF 0.8768; SVM 0.8620; ANN 0.8447. - Males (f0 height + direction): DT 0.9203; KNN 0.9465; LR 0.9459; NB 0.9325; RF 0.9733; SVM 0.9459; ANN 0.9577. - Females (direction only): DT 0.7987; KNN 0.8235; LR 0.8334; NB 0.8146; RF 0.8719; SVM 0.8564; ANN 0.8476. - Males (direction only): Identical to with height for most MLAs; RF 0.9733; others match their combined-feature scores (e.g., KNN 0.9465, LR 0.9459, etc.). DT improves by ~2% when adding height. - Best overall performer: Random Forest consistently outperforms others—F1 ~0.877 (females) and ~0.973 (males)—challenging the assumption of neural network supremacy. - Feature importance by gender: For females, combining f0 height with directionality improves performance across MLAs; for males, f0 directionality alone suffices with no notable gain from adding height. - Overall accuracy range: MLAs achieve 84–87% for females and 95–97% for males in classifying the five tones.
Discussion
The study addresses whether supervised MLAs can robustly identify a complex five-way tonal system in an under-documented language and which algorithms/features are most effective. Results show that ensemble methods (RF) deliver the strongest and most consistent performance, with SVM, KNN, and ANN close behind depending on gender and tone. While ANN performs well, it does not universally outperform traditional MLAs given the dataset size, feature representation, and class structure—highlighting that model selection should be data- and task-dependent. Confusion matrices and ROC/AUC analyses clarify tone-specific challenges: H and M are harder to separate, especially in female data, while L and EH are reliably classified. The gender-specific f0 range differences appear to impact model separability; for males, directionality alone captures sufficient tonal information, whereas female data benefit from the addition of f0 height. Comprehensive metric evaluation (beyond average accuracy) supports RF as the preferred traditional MLA, with ANN providing comparable performance, especially for male data. These findings validate the feasibility of machine learning for fine-grained tonal classification and contribute to the phonetic-phonological understanding of Chokri while informing best practices for MLA selection and feature engineering in tonal research.
Conclusion
This work provides the first comprehensive MLA-based production study establishing five-way tonal contrasts in Chokri and systematically compares six traditional MLAs with an ANN using f0 directionality and height features. Random Forest emerges as the most effective algorithm overall, achieving F1-scores of ~0.88 (females) and ~0.97 (males), while ANN performs comparably but does not universally surpass traditional approaches. The study reveals a gender-specific feature effect: combining f0 height and directionality enhances performance for females, whereas directionality alone is sufficient for males. The methodology and insights generalize to other multi-class classification tasks in linguistics (e.g., phoneme classification) and beyond (image processing, medical diagnosis, computer vision, social network analysis). Future work could broaden participant pools, include additional lexical contexts and syllable types, explore more prosodic and spectral features, conduct cross-speaker/domain adaptation, and test advanced neural architectures and calibration techniques to further improve tone discrimination and interpretability.
Limitations
- Limited participant pool (7 speakers: 5F, 2M) may constrain generalizability. - Only eight monosyllabic toneme sets and a fixed elicitation paradigm; broader lexical and prosodic contexts were not examined. - Duration and intensity were excluded after initial tests indicated non-significance; other potentially informative features were not explored. - The study cannot conclusively determine whether male speakers’ lower f0 ranges advantage MLAs; observed gender effects in feature utility (height vs directionality) warrant further investigation. - Although GroupShuffleSplit mitigated subject leakage, real-world generalization to unseen speakers and recording conditions remains to be tested more extensively.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny