logo
ResearchBunny Logo
AutoFraudNet: A Multimodal Network to Detect Fraud in the Auto Insurance Industry

Business

AutoFraudNet: A Multimodal Network to Detect Fraud in the Auto Insurance Industry

A. Asgarian, R. Saha, et al.

Detecting fraudulent claims is vital in auto insurance, and this research by Azin Asgarian, Rohit Saha, Daniel Jakubovitz, and Julia Peyre presents AutoFraudNet, an innovative framework combining images, text, and data to enhance fraud detection. With proven effectiveness, AutoFraudNet offers over a 3% improvement in identifying fraudulent activities.... show more
Introduction

The auto-insurance industry faces substantial fraud, with estimated global losses of at least $29 billion annually. Around 30% of submitted auto-insurance claims contain fraudulent elements, yet fewer than 3% are prosecuted. Traditional detection relies on spotting inconsistencies in evidence, but manual inspection is costly, time-consuming, and error-prone given the volume and multimodal nature of claim data (images, text, metadata). Machine learning has been applied to fraud detection across insurance domains, often focusing on tabular data, with some works using textual or visual data alone. However, single-modality approaches fail to leverage complementary information across modalities, risking inaccurate assessments. Multimodal reasoning is a natural fit, but real-world adoption is hindered by challenges: modalities overfit and generalize at different rates, complicating joint training, and high-capacity models are prone to overfitting, especially under data scarcity, class imbalance, and noise. Contributions: The paper introduces AutoFraudNet, a multimodal reasoning framework for detecting fraudulent auto-insurance claims using images, text, and tabular data. It employs cascaded slow fusion and state-of-the-art fusion blocks to improve cross-modal training, and uses a lightweight design with auxiliary losses to mitigate overfitting. Experiments on a real-world dataset demonstrate the effectiveness of multimodal approaches over uni-/bimodal baselines and the superior performance of AutoFraudNet.

Literature Review

Prior fraud detection research spans data mining, unsupervised and supervised ML (e.g., Random Forests, SVMs, Naïve Bayes), with class imbalance addressed via sampling methods like SMOTE and ADASYN. Beyond tabular data, some studies use textual features (e.g., LDA-based features with deep networks) or visual features (e.g., YOLO for damage detection), but these typically remain unimodal. Recent advances in multimodal learning (e.g., CLIP, VisualBERT, SimVLM) show strong visual-language capabilities, often leveraging transformer architectures and large paired datasets. Yet training high-capacity multimodal models end-to-end is challenging with limited data/compute. Consequently, some works use pre-extracted features and focus on fusion mechanisms. Fusion strategies vary by stage (early, intermediate, late) and operation (e.g., bilinear/factorized pooling: BLOCK, BLOCK Tucker, MLB, MFH, MFB; simple concatenation; linear sum). This work builds on that literature by using pre-trained feature extractors and a slow fusion paradigm to combine image, text, and tabular signals for auto-insurance fraud detection.

Methodology

Problem formulation: Binary classification of auto-insurance claims as fraudulent vs. not fraudulent using visual (multiple claim images), textual (claim descriptions), and tabular (metadata) modalities. Dataset: Japanese auto insurance claims. Each claim has a unique ID used to align modalities. Claims can involve multiple vehicle parts (21 main parts). Claims missing any modality are dropped, yielding 1,000,000 claims with 30,000 (3%) fraudulent. Data are split 80/10/10 (train/val/test) with stratified sampling. Visual features: For each image {I1…In}, two in-house pre-trained CNNs (UD: undamaged/damaged severity; CDS: crack/dent/scratch type) produce 720-d embeddings per image, giving E_cds ∈ R^{n×720} and E_ud ∈ R^{n×720}. Each set is passed through an MLP encoder (two fully connected layers) to learn image-level latent representations that are average-pooled to claim-level features A_CDS ∈ R^{50} and A_UD ∈ R^{50}. Textual features: Japanese descriptions are encoded using a canonical English BERT model fine-tuned on the Japanese corpus with MLM, producing a 768-d claim embedding A_Text. Tabular features: (a) Structural metadata features represented as one-hot encodings, forming A_Struct ∈ R^{87}. (b) Visibility scores: From in-house UD and Part-Visibility CNNs (21 vehicle parts). Post-softmax 21-d scores per image are aggregated across images using max, min, and mean to produce 3×21 scores for each network; concatenating yields A_SPUD ∈ R^{126}. Unimodal baseline: Each feature is processed by a two-layer MLP (500 units per layer, ReLU activations, dropout p=0.5) feeding a softmax classifier. Bimodal experiments: Eight cross-modal feature pairs (e.g., CDS+Text, CDS+SPUD, UD+Struct) are fused using seven fusion strategies: Concat MLP, Linear Sum, BLOCK, BLOCK Tucker, MLB, MFH, MFB; output goes to softmax. Multimodal setups: (1) Concat MLP-All concatenates [A_CDS, A_UD, A_SPUD, A_Struct, A_Text] then applies the unimodal MLP; a variant excludes text (Concat MLP-w/o Text) due to limited Japanese text. (2) Slow Fusion (SF): Two bimodal fusion modules are trained jointly; their intermediate activations (A_F1, A_F2) are further fused by a second fusion layer. The first fusion layer is fixed to the best bimodal configuration; the second layer tests MFB, MLB, BLOCK, BLOCK Tucker. Proposed framework: AutoFraudNet uses SF but replaces the second fusion layer with a single fully connected layer on concatenated [A_F1, A_F2], keeping the model compact; classification is via softmax. AutoFraudNet+Heads adds two auxiliary classification heads on A_F1 and A_F2 to provide granular supervision; the total loss is L = L_F1 + L_F2 + L_C. Training settings: Cross-Entropy loss, Adam optimizer (lr=1e−3), Early Stopping (patience=3), class-balanced mini-batches (50% per class), and five runs per model with random initialization; report mean and standard deviation. Evaluation metrics: Precision, Recall, F1 for both classes; thresholds tuned to ensure at least 80% recall for the Fraudulent class; threshold-independent PR AUC is the primary metric; Balanced Accuracy (mean recall across classes) reported for class-imbalance robustness.

Key Findings
  • Dataset: 1,000,000 claims with 3% fraudulent; stratified 80/10/10 split. - Unimodal: Visual embeddings (CDS, UD) achieve the highest PR AUC, outperforming textual and structural features. SPUD (structured proxy of visual visibility/damage) outperforms raw Struct and Text features. - Bimodal: Visual+tabular pairs are strongest; (CDS, SPUD) and (UD, Struct) are best-performing feature pairs. Text underperforms when fused with other modalities. Among fusion strategies, BLOCK Tucker achieves the highest average PR AUC across pairs, followed by BLOCK, MLB, and MFB; several strategies outperform naïve Concat MLP. - Multimodal baselines: Concat MLP-w/o Text outperforms Concat MLP-All, indicating limited benefit from the available textual features. In SF, using BLOCK Tucker in the second fusion layer is best, followed by BLOCK; SF-MLB and SF-MFB do not surpass Concat MLP-w/o Text. - Proposed models: AutoFraudNet surpasses all other multimodal configurations across metrics, improving PR AUC by 2.0% over Concat MLP-w/o Text and by 0.9% over SF-BLOCK Tucker. AutoFraudNet is compact with 21.6M parameters (41.4% fewer than SF-BLOCK Tucker). AutoFraudNet+Heads further improves PR AUC by 2.1% over AutoFraudNet and increases Balanced Accuracy by approximately 15% compared to SF-BLOCK Tucker and Concat MLP-w/o Text. - Overall PR AUC trend (average over runs): Unimodal Best 0.195; Bimodal Best 0.214; SF-BLOCK Tucker 0.204; AutoFraudNet 0.212; AutoFraudNet+Heads 0.233, demonstrating gains from incorporating multiple modalities with appropriately designed fusion.
Discussion

The study demonstrates that fraud detection in auto-insurance benefits markedly from multimodal reasoning that leverages complementary signals across images and tabular metadata. Visual features (CDS, UD) capture detailed evidence of damage type and severity, while tabular features (Struct, SPUD) add contextual and visibility cues; their interplay is especially potent, as shown by the best-performing pairs (CDS, SPUD) and (UD, Struct). However, not all fusion choices are beneficial: poorly chosen fusion strategies or early naive concatenation can hinder training and reduce performance. Carefully designed slow fusion pipelines, coupled with compact second-stage fusion and auxiliary supervision (AutoFraudNet+Heads), mitigate overfitting and training instability, yielding superior PR AUC and substantially improved Balanced Accuracy under class imbalance. These findings validate the hypothesis that multimodal architectures, when appropriately fused and regularized, address the unique challenges of real-world, imbalanced, multimodal fraud detection tasks and can surpass both unimodal and simplistic multimodal baselines.

Conclusion

AutoFraudNet is a compact, effective multimodal framework for detecting fraudulent auto-insurance claims. By employing cascaded slow fusion with strong bimodal blocks and a lightweight second-stage fusion, along with auxiliary supervision, it achieves state-of-the-art performance on a large real-world dataset. The work evidences clear benefits from leveraging images and tabular metadata together and highlights the importance of judicious fusion design to avoid overfitting. Future directions include improving textual modeling with Japanese-pretrained language models, handling missing modalities through learned gating mechanisms, and enhancing explainability (e.g., attention maps) to support user trust and error analysis.

Limitations
  • Text modality: Limited success from fine-tuning an English-pretrained BERT on a small Japanese corpus; lack of large, relevant Japanese-pretrained models and limited labeled text constrained performance. - Modality availability: The framework requires all modalities; claims with missing features were dropped. A learned gating mechanism could enable robustness to missing modalities. - Explainability: Current model lacks interpretability; adding attention maps or similar techniques could increase transparency and user trust. - Overfitting risks: High-capacity fusion blocks are parameter-heavy and can overfit; while mitigated via compact design and auxiliary heads, sensitivity to model complexity remains. - Data challenges: Class imbalance (3% fraud) and potential noise necessitate careful sampling and evaluation; generalization to other regions/insurers may require adaptation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny