
Computer Science
Fake news detection based on a hybrid BERT and LightGBM models
E. Essa, K. Omar, et al.
This research introduces a cutting-edge hybrid fake news detection system that fuses BERT and LightGBM, outperforming traditional methods across diverse datasets. The work showcases the exceptional capabilities of the approach through rigorous comparisons by Ehab Essa, Karima Omar, and Ali Alqahtani.
~3 min • Beginner • English
Introduction
Social media systems have attained extraordinary levels of achievement, opened unforeseen opportunities, and been changing the way news is disseminated, produced, and consumed, thereby becoming indispensable platforms used for a variety of applications. The dramatic development of social media characteristics in various platforms made the migration to social media platforms greatly desired even by reputable/well-known news organizations and agencies.
The usage of social media is accompanied by a dramatic increase in the threats of fake news and online misinformation. Fake news is fabricated stories that have similar characteristics to news media content but differs in organizational process or purpose in an effort to deceive readers [1]. Fake news is constantly growing through social media, online blogs, magazines, forums, and newspapers, making it difficult to find trustworthy news sources. Social media has evolved into an ideal platform for anybody to manufacture, distort, and propagate fake news. Because of the ease with which information can be created and distributed. For instance, according to Facebook [2], malicious entities contributed to less than one-tenth of 1% of civic content published on the network.
In recent years, fake news has been blamed for deepening political division and party strife, and also, it has a significant effect on topics, such as vaccination, nutrition, and stock values. According to a study by Ohio State University academics [3], false news most likely contributed to the decline in Hillary Clinton's popularity on election day. The study implies that roughly 4% of Barack Obama's 2012 supporters were discouraged from voting for Clinton in 2016 by their belief in false news articles. Another example [4], fake rumors about Tesla buying a lithium mining caused its shares to increase by nearly 250%. According to [5], tweets about the COVID-19 pandemic contained inaccurate or unverified information, with 24.8% and 17.4%, respectively. The falsity of news has a considerable (positive/negative) influence on readers. The development of effective analytical tools for online content is crucial to prevent having a negative impact on social, economic, and political life.
Manual fact-checking needs regular and manual updates by crowdsourced people or a small set of experts and also is unable to perform automated learning [6]. Machine-learning and deep-learning approaches have demonstrated accurate predictions and insights to handle a variety of complicated problems [7][8][9]. Developing automatic, trustworthy, and accurate solutions for detecting fake news is a hot research area. The detection of fake news is a challenging natural language processing (NLP) problem that is concerned with text classification to distinguish between fake and real. NLP has advanced significantly over the past few years. Transformer-based pre-trained language models are now the state-of-the-art approach for many NLP problems [10][11][12]. However, studying fake news detection using transformer-based models is still limited.
In this paper, we propose a hybrid model that combines a transformer-based architecture with a light gradient boosting machine (LightGBM) for fake news detection. Bidirectional encoder representations from transformers (BERT) are used to process the news articles and extract the text representation. BERT is one of the most effective language representation models, producing excellent results across a wide range of NLP applications. We add a LightGBM at the end of the BERT model to produce a hybrid classification that predicts "true" or "false" about news content. LightGBM is a high-efficiency gradient boosting framework that uses tree-based learning techniques. It supports parallel, distributed, and GPU learning and provides faster training speed. The proposed method is evaluated on three fake news datasets. We compare the performance of the proposed method with multinomial Naive Bayes (MNB), linear regression (LR), linear support vector machines (LSVM), and long short-term memory (LSTM) using different word embedding techniques. The proposed method achieves superior performance compared to the state-of-the-art. The main contributions of this work can be summarized as:
• We propose an automated fake detection method for both the title and the full text of news articles based on a hybrid of BERT and LightGBM models.
• The BERT model is proposed to extract a deep representation of the input texts.
• The LightGBM model is proposed to classify the BERT-based word embedding as real or fake content.
• The proposed method is evaluated on three fake news datasets and compared to traditional machine-learning and deep-learning approaches.
The rest of the paper is organized as follows. In the section "Related work", the related works on fake news detection are discussed. The section "Proposed method" presents the details of the proposed method. The experimental results, as well as a comparison to other methods, are discussed in the section "Experimental results and discussion". Finally, the conclusion is drawn in the section "Conclusion".
Literature Review
Various machine-learning-based techniques have been developed to detect fake news. These methods can be categorized into traditional approaches and deep-learning approaches.
Traditional machine-learning approaches include MNB, LR, LSVM, Decision Tree (DT), and extreme gradient boosting (XGBoost). In [13], Ahmed et al. used n-gram analysis with TF-IDF to extract features for detecting fake news, comparing six machine-learning techniques; LSVM achieved 92% accuracy on the ISOT dataset. In [14], five machine-learning models with TF-IDF showed LSVM and XGBoost as best. In [15], Ozbay and Alatas used TF weighting and a document-term matrix to extract features and evaluated 23 supervised models; DT performed best. In [16], the same authors used salp swarm optimization (SSO) and grey wolf optimizer (GWO) algorithms for fake news detection. In [17], Kansal exploited part-of-speech (POS) tags and combined XGBoost with an MLP using TF-IDF and Word2Vec. These methods require substantial labeled data and may struggle to adapt to evolving fake news types.
Deep-learning approaches such as CNN and LSTM are increasingly popular. In [18], a hybrid CNN+LSTM with GloVe embeddings outperformed seven traditional methods. In [19], FNDNet used GloVe with stacked convolutional-pooling and dense layers. In [20], back-translation data augmentation was applied, and CNN, BiLSTM, and ResNet were tested with Word2Vec, GloVe, and fastText. In [21], OPCNN-FAKE optimized a CNN pipeline and outperformed RNN, LSTM, and six classical methods. In [22], TI-CNN fused textual and image features; [23] proposed a multimodal coupled CNN for text+image. These methods may not capture long-range context, and context-free embeddings limit context specificity.
Transformer-based models have advanced NLP. In [24], five transformer models (XLNet, BERT, RoBERTa, DistilBERT, ALBERT) showed comparable performance for fake news detection. In [25], FakeBERT used BERT embeddings with parallel CNN layers, outperforming classical ML. In [26], a BART and RoBERTa hybrid fed into LSTM/CNN branches distinguished news types. In [27], an attention-based transformer outperformed a multimodal CNN. However, many transformer-plus-deep-learning methods are computationally expensive and data-hungry. In contrast, the proposed method uses LightGBM for efficient classification on top of BERT embeddings.
Methodology
Proposed method: A hybrid architecture combining fine-tuned BERT for contextual text representation with LightGBM for efficient, high-performance classification. The pipeline applies pre-processing, BERT tokenization and fine-tuning, sentence representation extraction via [CLS] tokens from the last three layers, and LightGBM training for binary classification.
Pre-processing: Clean input text by filtering non-alphabet characters, tags, and URLs; remove numbers, stop words, and punctuation; apply lowercasing; exclude records with fewer than ten words in full text.
BERT: Utilize the transformer-based BERT model [10], a bidirectional contextual language model trained with masked language modeling (MLM) and next sentence prediction on BooksCorpus and English Wikipedia. Inputs are tokenized with WordPiece (30k vocabulary), with [CLS] and [SEP] special tokens. BERT uses only the encoder stack of the Transformer with multi-head self-attention and feedforward layers, residual connections, and layer normalization.
Model configuration: The base BERT architecture adopted has L = 12 encoder layers, hidden size H = 768 (noted in text as 756 but standard base is 768), and h = 12 self-attention heads, totaling ~110M parameters. The output of each layer Hl is produced by TransformerBlock(Hl−1); multi-head attention is computed as MultiHead(Q,K,V) = Concat(head1..headh) WO, with headi = Attention(QWQi, KWKi, VWVi), followed by a feedforward network, residual connections, and layer normalization.
BERT fine-tuning: Add a fully connected layer with softmax over the [CLS] last hidden state H^L_CLS to classify texts as real or fake. The softmax computes p(C | H^L_CLS) = softmax(W H^L_CLS). All BERT parameters and the classifier weights W are fine-tuned by minimizing negative log-likelihood.
Sentence representations: After fine-tuning, extract fixed-size sentence embeddings by concatenating the [CLS] embeddings from the last three layers: E = Concat(H^L_CLS, H^{L−1}_CLS, H^{L−2}_CLS). This "CLS3" representation serves as input features to the downstream classifier.
LightGBM: Use LightGBM, an efficient gradient boosting decision tree (GBDT) implementation that accelerates training via histogram-based splits, leaf-wise growth, Gradient-based One-Side Sampling (GOSS), and Exclusive Feature Bundling (EFB). LightGBM trains trees sequentially to fit residuals, optimizing a differentiable loss via gradient descent, with improved memory and computational efficiency compared to traditional GBDT and level-wise methods like XGBoost.
Algorithm 1 (Steps):
1) Load labeled dataset and apply pre-processing.
2) Load pre-trained BERT and tokenizer.
3) Tokenize input data.
4) Fine-tune BERT on training data.
5) Generate contextualized embeddings via CLS3 concatenation.
6) Train LightGBM classifier on training features.
7) Predict labels (real/fake) on test data.
Experimental setup (implementation details): Each dataset is split 60%/10%/30% for train/validation/test. Experiments run on Intel i9-10850K CPU and Nvidia TITAN Xp GPU. Classical ML uses scikit-learn; LSTM is implemented in Keras with one bidirectional LSTM (100 hidden units), global max pooling, and output layer; same LSTM architecture is used with GloVe and BERT-based embeddings. BERT model is "bert-base-uncased" (Hugging Face Transformers). Fine-tuning is for one epoch with 1-cycle policy, batch size 6, learning rate 5e-5. LightGBM hyperparameters are tuned with Optuna on the validation set.
Datasets: ISOT (≈45,000 articles; balanced fake/real). TI-CNN (20,015 articles; 8074 real, 11,941 fake). Fake News Corpus (FNC): curated multi-million corpus; here, a balanced subset is used by sampling 500,000 fake and 500,000 credible articles (1,000,000 total). Each dataset includes titles and full text.
Evaluation metrics: Accuracy, Precision, Recall, F1-score, and ROC-AUC, as defined in the paper.
Key Findings
- CLS3 representation effectiveness: Concatenating the [CLS] embeddings from the last three BERT layers (CLS3) yields the best average accuracy across datasets (91.31%) versus CLS (90.95%), MeanPooling (90.63%), and MaxPooling (90.48%).
ISOT dataset (Title vs. Text):
- TF-IDF + LSVM: Title Acc 95.15%, F1 95.26%; Text Acc 99.42%, F1 99.43%.
- GloVe + LSTM: Title Acc 97.94%, F1 97.98%; Text Acc 98.45%, F1 98.48%.
- BERT + LSTM: Title Acc 98.52%, F1 98.55%; Text Acc 99.85%, F1 99.85%.
- Proposed (BERT + LightGBM): Title Acc 98.91%, F1 98.94%, Prec 99.03%, Rec 98.84%; Text Acc 99.88%, F1 99.88%, Prec 99.79%, Rec 99.97%. Best overall on both title and text.
TI-CNN dataset (Title vs. Text):
- TF-IDF + LSVM: Title Acc 87.20%, F1 89.09%; Text Acc 94.51%, F1 95.26%.
- GloVe + LSTM: Title Acc 86.29%, F1 88.88%; Text Acc 91.81%, F1 93.16%.
- BERT + LSTM: Title Acc 87.52%, F1 89.71%; Text Acc 96.31%, F1 96.89%.
- Proposed (BERT + LightGBM): Title Acc 88.65%, F1 90.54%, Prec 89.29%, Rec 91.82%; Text Acc 96.94%, F1 97.42%, Prec 97.32%, Rec 97.51%. Best on both title and text.
FNC dataset (Title vs. Text):
- TF-IDF + LSVM: Title Acc 83.56%, F1 83.70%; Text Acc 97.84%, F1 97.83%.
- GloVe + LSTM: Title Acc 81.63%, F1 81.65%; Text Acc 96.12%, F1 96.11%.
- BERT + LSTM: Title Acc 86.27%, F1 86.29%; Text Acc 81.69%, F1 81.88%.
- Proposed (BERT + LightGBM): Title Acc 86.38%, F1 86.33%, Prec 86.36%, Rec 86.31%; Text Acc 99.06%, F1 99.05%, Prec 99.07%, Rec 99.04%. Best overall; especially strong on text.
ROC-AUC (Title classification): Proposed method achieves AUCs of 0.9985 (ISOT), 0.9527 (TI-CNN), and 0.9430 (FNC), outperforming TF-IDF + LSVM (0.9872, 0.9436, 0.9183) and GloVe + LSTM (0.9971, 0.9363, 0.9059).
State-of-the-art comparison:
- ISOT: Proposed 99.88% Acc, 99.88% F1 vs. TF + GWO 99.50% Acc (99.70% F1), GloVe + CNNLSTM 99.0%, TF-IDF + DT 96.80%, POS + XGBoost 92.32%, TF-IDF + SVM 92.00%.
- TI-CNN: Proposed 96.94% Acc, 97.42% F1 vs. GloVe + CNN 96.26% Acc (95.89% F1) and Word2Vec + LSTM F1 87.58%.
- FNC: Proposed 99.06% Acc, 99.05% F1 vs. TF-IDF + SVM 97.00%, TF-IDF + XGBoost 96.00%, BART+RoBERTa + LSTM-CNN 92.50%.
Overall, the hybrid BERT + LightGBM model consistently outperforms classical ML, deep learning baselines, and recent SOTA across three datasets, for both titles and full text.
Discussion
The study addresses the challenge of accurately and efficiently detecting fake news by combining BERT’s contextual language understanding with LightGBM’s efficient gradient-boosted decision trees. Results show that BERT-derived embeddings, especially the concatenation of [CLS] vectors from the last three layers (CLS3), provide superior sentence representations for downstream classification compared to single-layer [CLS] and pooling strategies. Across three diverse datasets (ISOT, TI-CNN, FNC), the hybrid approach achieves top performance for both headlines and full articles, with especially strong gains on full-text classification and high ROC-AUC values.
The findings demonstrate that leveraging BERT as a feature extractor, followed by a LightGBM classifier, captures rich semantic and syntactic patterns while maintaining computational efficiency. Compared to end-to-end deep transformer hybrids (e.g., BERT with CNN/LSTM stacks), training the light-weight LightGBM on fixed embeddings reduces computational overhead and memory usage, while preserving or improving accuracy. The approach generalizes well across datasets differing in size, source, and label distributions, indicating robustness to domain variation and evolving fake news styles. These results validate the proposed hybrid methodology as an effective and scalable solution for fake news detection.
Conclusion
A novel hybrid model that integrates BERT and LightGBM is presented for automated fake news detection from titles and full texts. BERT is fine-tuned and used to generate contextualized embeddings, from which concatenated [CLS] features (last three layers) are fed to a LightGBM classifier. Extensive experiments on ISOT, TI-CNN, and FNC datasets show that the proposed method achieves superior performance over traditional machine-learning models, deep-learning baselines, and recent state-of-the-art methods, with accuracies of 99.88%, 96.94%, and 99.06%, respectively. The hybrid design effectively balances representational power and computational efficiency. Future research directions may include extending the approach to multimodal inputs (e.g., images, metadata), exploring cross-domain adaptation, and investigating larger or domain-adapted transformer backbones for further gains.
Limitations
Related Publications
Explore these studies to deepen your understanding of the subject.