Computer Science
Emotion classification for short texts: an improved multi-label method
X. Liu, T. Shi, et al.
Discover how Xuan Liu and colleagues have revolutionized multi-label emotion classification for short texts like tweets. Their cutting-edge MLkNN classifier leverages in-sentence, adjacent sentence, and full-text features for unparalleled accuracy and recall. Dive into the iterative correction methods that drive their findings!
~3 min • Beginner • English
Introduction
Fast-increasing electronic documents in the digital environment offer a new source to support better understanding and services to online users. More attention has been paid to extracting users’ opinions towards various events from the content of texts. The process of computationally identifying and categorizing opinions expressed in a piece of text has been highlighted as the first step of data mining. Scholars tried to extract the positive, negative, or neutral attitude toward a particular topic or product from the text (Feng et al., 2021) and recently started to further label the texts with multi-dimensional emotional tendencies, such as joy, fear, rage, etc. (Hu and Flaxman, 2018; Tasmin, 2018). Various methods have been applied to classify the emotions of the texts, but the methods based on machine learning have attracted the most attention (Chen and Zhang, 2018). Previous methods based on the dictionaries of emotions allow segmentation and classification of words for analysis of complicated emotions, preparing a dictionary of emotions is labor-intensive and time-consuming and could hardly catch up with the fast emergence of new words (Ai et al., 2018). Machine-learning algorithms, on the contrary, allow auto recognition of emotional words in texts so as to achieve classification more quickly. However, the sequential process that the machine-learning algorithms follow would inevitably result in the inability to label multiple emotions and the possibility of the heavy impact of previous steps on the following steps (Ullah et al., 2022). Problems such as the decline of classifier performance with emotion refinement, the lack of the relationship between sentences and the whole text, and the recognition of complex human emotions are also stimulating scholars to keep adjusting the algorithms to enhance their performance.
One of the key directions to improve the machine-learning algorithms for emotion classification is computational multi-label classification. Computational multi-label classification is regarded as a good solution. Multi-label classification means that an instance could be classified into multiple categories at the same time; that is, it could be marked by multiple labels. In practical applications, the semantics of real objects or real texts are often not unique, which leads to the need for multi-label learning. Mainly by proposing new emotional dictionaries, some pioneering research has made remarkable attempts in the field of multi-label classification (Yang et al., 2014; Liu and Chen, 2015). However, it becomes very difficult for algorithms to classify emotions with multiple labels. Compared with SVM and Bayesian algorithms, K-Nearest Neighbors (KNN) algorithms perform best as a multi-label classifier and are easier to construct (Keshtkar and Inkpen, 2012). The problem remains that the iterative corrections are not able to be achieved for emotion classification even for kNN algorithms.
In response to the above knowledge gaps, this study adjusts the Multi-label K-Nearest Neighbors (MLkNN) classifier and considers not only individual in-sentence features but also the features in the adjacent sentences and the full text of the tweet. Furthermore, this study further considers the interaction between labels and iteratively updates the overall classification results. Such adjustments allow iterative corrections of the multi-label emotion classification and could improve the accuracy of emotion classification for short texts. Tweets are chosen in this study as a representational source of short texts. Among all text classifications, short text classification is a special subdomain with increasing importance. Since people are now more frequently using short sentences to express opinions or share ideas with others, short text classification becomes essential in author recognition, spam filtering, sentiment analysis, Twitter personalization, customer review, and other applications related to social networks (Liang et al., 2020). Therefore, there is an expanding need for sentiment analysis of short texts on the internet.
The rest of the study is organized as follows: studies about existing emotion classification methods in the literature are summarized in section “Related work”. The adjustments to the MLkNN emotion analysis method are employed in section “An improved MLkNN for emotion classification of short texts”, followed by the experiments and results in section “Experimental study”. Lastly, the discussions and conclusions are presented in sections “Discussion” and “Conclusions”, respectively.
Literature Review
Human emotion has been a research hotspot for scholars since ancient times. In the era of information, emotional signals and sentiment tendencies also attract more attention when scholars extract textual features from the content of online texts to support better understandings and services to users. The specific task of sentiment classification is to identify the subjective views expressed in the specified text and judge the emotional tendencies of the text (Rajabi et al., 2020; Li et al., 2016; Fei et al., 2020). Together with the accumulation of the understanding of texts, two types of emotion classifications have been highlighted. One is the classification of emotions according to the emotional polarities, which means the positive, negative, or neutral attitudes. The other one is the classification of emotions according to emotional tendencies, which generally follows the emotion wheel proposed by American psychologist Plutchik (Hu and Flaxman, 2018; Tasmin, 2018). The introduction of emotional tendencies increases of the emotion classifications and leads to a review of the emotion classification methods.
Among the emotion classification methods, the most used and better-performing methods are mostly based on dictionaries of emotions and based on machine learning (Chen and Zhang, 2018). Emotion classification by a dictionary of emotions is a classical method with both theoretical and practical achievements (Ai et al., 2018). The main implementation processes of emotion classification include segmenting words in the text to be classified and carrying out keyword matching and other operations on these words to realize emotion classification. Ma et al. (2005) first applied the dictionary-based method to the instant messaging system. On this basis, Aman and Szpakowicz (2007) proposed a classification method by adding the emotion intensity knowledge base to the original dictionary and achieved an accuracy of more than 66% in the emotion classification task of the blog corpus. Paltoglou and Thelwall (2012) used the dictionary of emotions method to calculate the negative words, capital letters, emotional polarity, and their strength changes in the linguistic field. The accuracy rate of this method could reach 86.5% when applied to short texts on platforms such as Twitter and MySpace. Taboada et al. (2011) further expanded the dictionary of emotional features and topic-related features of the text. The accuracy of the improved method in the experiment of the Twitter corpus could reach 85.6%.
Due to the long training time of many networks, some researchers have defined several dictionaries related to emotional words, such as attitude dictionary, negative dictionary, degree dictionary and connective dictionary. In addition, there are more complex methods in recent years, such as the emotion classification method based on rules (Yan et al., 2018). These works include the classification method based on mutual information (Liu et al., 2021), the emotion classification method based on physiological signals (Shu et al., 2018), as well as the upgrading of neural networks (Tang et al., 2021). The classification method based on a dictionary of emotions could reflect the unstructured features of the text and have a high utilization rate of emotional words, but its problems are also obvious: the scarcity of corpus resources, the low update frequency of emotional words, and the inability to capture new words or deformed words. Ideal classification requires higher rates of coverage and labeling accuracy of emotional words in the dictionary. Moreover, dictionaries are highly dependent on the domain, time, language and other conditions, thus difficult to expand.
In recent years, the rapid development of machine-learning methods offers a new way for emotion text classification. Two types of schemes have been applied: supervised and semi-supervised. The common features used for emotion classification in the existing supervised learning schemes mainly include word level, sentence-level and chapter-level features (Dogan and Uysal, 2020). Keshtkar and Inkpen (2012) adopted multi-level analysis ideas to analyze the mood of bloggers and achieved stratified emotion analysis for more than 100 mood labels. Semi-supervised learning schemes differ from supervised learning in that they require a large number of labeled samples. Semi-supervised learning could utilize a large number of unlabeled samples, which could improve the classifier performance and reduce the dependence on sample sets. Presently, the existing semi-supervised learning methods in the field of emotion classification include the semi-supervised emotion classification schemes based on multinomial Bayes, discrete binary semi-supervised learning, Emoji space model, and dual view label propagation (Sintsova et al., 2014). The main advantages of this method are that it does not depend on a large number of labeled samples and could easily obtain a large number of new labeled data as training samples through learning. It performs well when the scarcity of labeled datasets occurs. However, the disadvantages of this method are also obvious. It is very sensitive to the results of the first round of classification: the samples that could not be correctly classified in the first classification process will greatly affect the accuracy of the second classification.
New neural networks have also been applied as feature extractors. Liao et al. (2021) proposed a novel two-stage fine grained text-level sentiment analysis model based on syntactic rule matching and deep semantics. Combining the multi-head attention mechanism in Transformer, Lou et al. (2020) proposed a fusion model of convolution neural network and hierarchical attention coding network to avoid the sequential processing of Recurrent Neural Networks (RNN), which were wildly used as the feature extractor for fine-grained sentiment analysis. The self-attention-based Bidirectional Long Short-Term Memory (BiLSTM) model with aspect item information for fine-grained sentiment classification of short texts introduced by Xie et al. (2019) allowed effective use of contextual information and semantic features. A recent study of a bidirectional convolutional RNN adopted bidirectional feature extraction to group features and enhanced the important features in each group while weakening the less important features to improve the classification accuracy (Onan, 2022). Jiang et al. (2022) mixed Bidirectional Encoder Representation from Transformers (BERT), BILSTM and Text Convolutional Neural Network (TextCNN) into a new model, achieving not only the capture of local correlations in contexts, but also high accuracy and stability. However, machine learning methods still face the following defects. One is that they often rely excessively on the manually labeled corpus, and could not achieve good results when the sample set size is small. Besides, unsupervised learning in machine learning is still scarce in the field of sentiment analysis.
Recently, scholars started to realize that it is often unable to accurately restore and analyze the individual's real emotions without considering of multiple emotions contained in the text (Siriwardhana et al., 2020; Sadr et al., 2019; Liang et al., 2019). Multi-label learning originated from the investigation of text classification problems, where each document may belong to several predefined topics simultaneously. In multi-label learning, the training set is composed of instances each associated with a set of labels, and the task is to predict the label sets of unseen instances by analyzing training instances with known label sets (Zhang and Zhou, 2007). Yang et al. proposed a small dictionary that considers not only text, but also graphic emoticons and punctuation marks, and classify multi-label sentiment of Weibo corpus. This classification has achieved a relatively high accuracy rate and has played an active role in the analysis of public opinion on the Malaysia Airlines crash. Liu and Chen (2015) used the combination of three emotional dictionaries to extract the emotional features and the original segmentation word features in the microblog corpus, and completed a multi-label-based emotion classification method, among which the best experimental results had an average accuracy of 65.5%. In response to the imbalance of emotion category distribution in the corpus, Li et al. (2016) adopted a multi-label maximum entropy model to analyze the relationship between words and emotions. The problem remains as iterative corrections are not able to be achieved for the emotion classification. These methods thus could hardly catch up with the fast changes of emotional words in the real world.
Methodology
An improved MLKNN for emotion classification of short texts. The study focuses on modifying the ML-KNN algorithm by incorporating sentence-level context and tweet-level context, and by modeling label correlations to enable iterative corrections.
Workflow and multi-label setup: In multi-label classification, each instance can be assigned multiple non-exclusive labels. Let X denote the example space, L the set of all labels, and Y the label space. The task is to learn f: X → Y from training data. The paper adopts an adapted-algorithm approach based on MLKNN.
Overall pipeline: Tweets are segmented into sentences. Using the training set, the method learns (a) emotion transfer relationships between adjacent sentences and (b) emotion transfer relationships between sentence and overall tweet emotion. An MLKNN classifier serves as the base classifier to obtain initial sentence-level emotion predictions. Then, using learned transition probabilities, it adjusts sentence predictions based on adjacent sentences and on the overall tweet emotion. Average Precision (AVP) is evaluated to decide whether to iterate adjustments (including a label-correlation adjustment) until convergence.
Base classifier (MLKNN): For each label l in L, MLKNN estimates prior P(Y^l) and conditional P(H^l|Y^l), where H^l counts the number of K-nearest neighbors containing label l. The initial prediction per label is ŷ(l) = arg max_{e∈{0,1}} P(Y^l) P(H^l|Y^l). Priors and conditionals are computed from the training set with Laplace smoothing.
Adjustment using emotional transfer between adjacent sentences: After initial sentence classification, adjust predictions using learned transition probabilities between emotions of adjacent sentences (previous and next). The method estimates P(Prev emotion ε → current emotion l) and P(Next emotion ε → current emotion l) from counts in the training set. Assuming independence of transfers, the updated score for label l multiplies MLKNN posteriors by these transition probabilities for the previous and next sentences of the current sentence.
Adjustment using emotional transfer between sentence and full tweet: Similarly, estimate transition probabilities between the overall tweet emotion and sentence-level emotions. Assuming independence, incorporate P(Tweet emotion ε → sentence emotion l) into the sentence-level updated scores.
Label-correlation adjustment (L-MLKNN): MLKNN assumes label independence. To model label correlations, the study adopts a second-order strategy, augmenting the label space with pairwise combinations and computing co-occurrence statistics. Using a co-occurrence matrix M of labels across instances, it computes symmetric co-occurrence scores S_{i,j} and related measures, then derives adjusted posteriors YY1(i) and YY0(i) from MLKNN priors/likelihoods weighted by the co-occurrence information. A mixing parameter α (0 ≤ α ≤ 1) combines MLKNN posteriors and correlation-adjusted posteriors: y1(i) = α·Y1(i) + (1−α)·YY1(i); y0(i) = α·Y0(i) + (1−α)·YY0(i). The tweet-level prediction is obtained by aggregating sentence-level predictions (averaging across sentences in a tweet) and iterating adjustments until AVP convergence.
Illustrative example: The paper shows a tweet with three short sentences: “WTF!”, “This song is groovy!!”, “I love the song soooo much!”. Although “WTF!” can be negative in isolation, considering adjacent sentences and the tweet-level emotion (all joy) corrects its label to joy.
Experimental setup: Dataset is Sentiment140. From 1.6M tweets, 8000 were selected and annotated; after filtering, 6500 tweets (11,338 sentences) remained. Each tweet has up to two labels; each sentence up to one label. Train/test split is 70/30: training 4500 tweets (7779 sentences), test 2000 tweets (3559 sentences). Evaluation is sample-based with metrics: Subset Accuracy (SA), Hamming Loss (HL), One-Error (OE), Ranking Loss (RL), Average Precision (AVP), Accuracy (AC), Precision (PR), Recall (RE), and F-score. Three experiment groups: (1) Basic MLKNN using unigram features, with/without adjacent and tweet-level features (K=5 and K=8). (2) S-MLKNN using unigram+bigram features with contextual features (K=5,8). (3) L-MLKNN: start from unigram+bigram MLKNN, integrate adjacent and tweet-level features, then apply label-correlation adjustment with varying K and α to learn transition matrices.
Key Findings
- Incorporating contextual features improves basic MLKNN substantially. With K=8 and using unigram + adjacent sentences + tweet-level features, SA increased to 0.7179 and AVP to 0.7205 (from ~0.42 with unigram only). For K=5, SA reached 0.6707 and AVP 0.6725.
- Using richer lexical features (S-MLKNN, unigram+bigram) further boosts performance when combined with adjacent and tweet-level context. Best S-MLKNN (K=8, unigram+bigram+adjacent+tweet) achieved SA=0.8103, HL=0.1926, OE=0.5459, RL=0.1553, AVP=0.8056.
- Modeling label correlations (L-MLKNN) yields the best overall results. With α=0.7 and K=8, L-MLKNN attained SA=0.8205, HL=0.0917, OE=0.5256, RL=0.1598, AVP=0.8101.
- Comparative algorithm metrics (at K=8, α=0.7): L-MLKNN achieved PR=0.8101, RE=0.8019, and F-score=0.8060 (AC=0.5854), outperforming MLKNN (PR=0.6559, RE=0.7104, F-score=0.6655) and S-MLKNN (PR=0.8056, RE=0.7324, F-score=0.7673) on overall performance, with recall notably highest at 0.8019.
- The choice of K influences results: K=8 generally outperformed K=5 but with higher training cost.
- Length-wise analysis indicates all methods perform similarly on very short texts; the improved methods (S-MLKNN, L-MLKNN) show greater gains on longer texts.
Discussion
This study compares multiple algorithms on a Twitter dataset, restricting predicted labels to a maximum of two per tweet (taking the top two predicted labels when more are returned). With K=8 and α=0.7, the improved L-MLKNN algorithm outperforms the others in overall performance, with especially strong recall. Across experiments, incorporating contextual emotion transfer (adjacent sentences and tweet-level features) and label correlations significantly improves multi-label emotion classification accuracy. However, many Twitter samples in the corpus are single-labeled; the relatively small proportion of multi-label samples limits the size of the co-occurrence probability matrix, tempering gains at the corpus level. Performance across text length shows small differences for very short texts, while the improved methods deliver markedly better F1 on longer texts. The approach remains supervised and depends on quality initial sentence-level classification; low initial accuracy can degrade final results. Future exploration of semi-/unsupervised settings, and achieving high performance with less labeled data and lower training cost, is encouraged.
Conclusion
By integrating in-sentence features with adjacent-sentence and tweet-level context, and by modeling label correlations, the paper presents an adjusted MLKNN approach that enables iterative correction for multi-label emotion classification of short texts. Experiments on Twitter show the method improves both accuracy and speed, with best results when K=8 and α=0.7; the L-MLKNN variant achieves the strongest overall performance (notably recall). The approach also works well on longer texts. Future work should focus on improving semi-supervised and unsupervised learning, and balancing efficiency with smaller training sample sizes against model completeness to handle diverse scenarios.
Limitations
- Dependence on labeled training data (supervised learning): performance is sensitive to the quality of initial sentence-level classification; poor initial predictions can cascade and harm final results.
- Limited multi-label prevalence in the dataset: the small proportion of multi-label samples leads to a limited co-occurrence matrix, constraining the benefit from label-correlation modeling.
- Small sample size relative to the full Sentiment140 corpus may not be representative, risking limited generalizability and potential sampling bias; robustness to outliers/unexpected data on larger datasets may be reduced.
- Higher K improves performance but increases training time and cost, indicating a trade-off between accuracy and efficiency.
Related Publications
Explore these studies to deepen your understanding of the subject.

