logo
ResearchBunny Logo
GEMTELLIGENCE: Accelerating gemstone classification with deep learning

Engineering and Technology

GEMTELLIGENCE: Accelerating gemstone classification with deep learning

T. Bendinelli, L. Biggio, et al.

Discover GEMTELLIGENCE, an innovative deep learning method for automated gemstone origin and treatment detection, developed by Tommaso Bendinelli, Luca Biggio, Daniel Nyfeler, Abhigyan Ghosh, Peter Tollan, Moritz Alexander Kirschmann, and Olga Fink. This cutting-edge approach combines inexpensive techniques with advanced neural networks, delivering predictive accuracy that rivals costly traditional methods, revolutionizing gemstone analysis.... show more
Introduction

Gemstones command high value, with price strongly affected by geographic origin and post-mining treatments. Current practice relies on expert visual microscopy and various analytical instruments (UV-Vis-NIR, FTIR, XRF, ICP-MS). These workflows are costly, time-consuming, require significant expertise (especially ICP-MS), and can yield inconsistent, hard-to-standardize decisions, threatening trust and asset valuation. The research question is whether a multimodal deep learning system can accurately and consistently determine country of origin (OD) and detect heat treatment (TD) for gemstones—using affordable measurements when possible—thus increasing automation, reducing cost and turnaround, and matching expert- and ICP-MS-level performance. The study targets blue sapphires, a challenging and economically important class, and evaluates the approach in real-world laboratory settings.

Literature Review

Machine and deep learning have transformed materials science, geoscience, and computational chemistry, yet applications in gemology remain nascent. Prior work typically applies conventional ML with hand-crafted features to single modalities (images, spectra, or compositions) for tasks like geotagging or grading, still heavily dependent on expert input. Analytical best practice in gemology often combines techniques (microscopy, ICP-MS, UV-Vis-NIR, FTIR, XRF), with ICP-MS considered highly reliable but expensive and operator-intensive. This work builds on advances in deep learning for spectral analysis and transformer-based tabular modeling to integrate heterogeneous laboratory data, addressing the gap in automated, multimodal gemstone origin determination and treatment detection.

Methodology

Data: Over 5500 blue sapphire records from the Gübelin Gem Lab (2013–2020). Approximately half of stones have comprehensive measurements (UV-Vis-NIR, FTIR, XRF, ICP-MS); others follow reduced protocols. Five-fold cross-validation is used. For OD and TD ground truth, the study filters to stones with rigorous consensus: two independent gemologists agree by microscopy; OD conclusions are consistent with ICP-MS; and complete relevant measurements are present.

Tasks: (1) Origin determination (OD) as multiclass classification among major metamorphic sources (e.g., Kashmir, Myanmar/Burma, Sri Lanka, Madagascar; minor/non-metamorphic origins analyzed in supplements). (2) Treatment detection (TD) as binary classification (treated vs non-treated). For TD, UV and FTIR are used; elemental analyses (XRF, ICP-MS) are excluded to avoid spurious correlations as they do not capture physical changes from heat treatment.

Instruments and preprocessing:

  • ICP-MS (LA-ICP-MS): Laser ablation with 50 μm spots at ~15 Hz and 6 J/cm², carrier He/Ar gases; Agilent 8800 mass spectrometer. Calibration with GLITTER using NIST 612 primary standard and EHVO-2, ATHO-G secondary standards; assume 99 wt% Al2O3 for corundum. Focused analytes include 26Mg, 27Al, 28Si, 31P, 37Cl, 40K, 42Ca, 48Ti/54Ti, 56Fe, 58Ni, 90Zr/96Zr, 146Ce, 178Hf/185Hf, 192Pt (as listed). Data converted from cps to concentrations.
  • FTIR: Varian 640 with KBr beamsplitter and DTGS detector, DRIFT/transmission configurations, 64 scans at 1 cm⁻1 resolution over ~2000–700 cm⁻1. Spectra homogenized by cubic interpolation; padded where needed; outliers (< -5 or > 10 a.u.) removed (<1% of data). Final spectra: 861 points per measurement.
  • XRF (ED-XRF): Thermo Fisher QUANT’X with SDD; 15–50 kV; filters to reduce interferences. For sapphires, track Al (major), Ti, Cr, V, Fe, Ga; exclude Co, Pb, W for routine sapphire runs. Outlier rejection rules applied (e.g., Fe2O3 > 40,000 ppm, Al2O3 < 85,000 ppm, Cr2O3 > 10,000 ppm, TiO2 > 6,000 ppm).
  • UV-Vis-NIR: Cary 5000 with deuterium and tungsten halogen sources and InGaAs detector; 280–880 nm, 0.5 nm step; typically two perpendicular polarizations. Resulting data per sample: 2 × 121 points (per provided summary), standardized across instruments with consistent protocols and references.

Model (GEMTELLIGENCE): A multimodal neural network with three encoders and a fusion head.

  • UV encoder: 1D CNN with residual/skip connections; initial conv kernel size 59; six residual conv blocks (hidden dim 128, kernel size 17, stride 2); produces a 1D embedding (length ~190).
  • FTIR encoder: Same CNN backbone as UV (single-channel input); produces a 1D embedding (length ~213).
  • Elemental analysis encoder: Processes concatenated XRF and (optionally) ICP-MS tabular features via SAINT (transformer-based) with both intra-sample self-attention and inter-sample (column-wise) attention. To stabilize inference with batch-size one, a set of reference stones is appended during both training and testing. Output embedding length 32.
  • Fusion head: Concatenate embeddings from available modalities; batch normalization; linear readout with softmax for class probabilities. Missing modalities are masked.

Training: For each fold, split training data 80/20 into train/validation; 250 epochs with checkpointing every 5 epochs; batch size 16; learning rate 1e-4; early stopping after 30 epochs without improvement. Hyperparameters selected via preliminary grid search. Experiments run on an NVIDIA GeForce RTX 2080 Ti machine. Cross-validation aggregation follows a concatenate-then-evaluate protocol for most results; for accuracy vs coverage curves (Fig. 3), per-fold curves are averaged.

Confidence-thresholding: Define confidence as the maximum softmax probability. Using training data, sort samples by confidence and find the minimum threshold such that accuracy on samples above threshold reaches a target (calibration) level. Three operating modes: None (threshold 0; full automation), Mode 1 (moderate threshold balancing automation and accuracy), Mode 2 (higher threshold prioritizing accuracy). At inference, accept predictions only if confidence ≥ calibrated threshold; otherwise defer to experts.

Evaluation: Compare GEMTELLIGENCE to human gemologists under controlled access to data sources (e.g., OD comparisons exclude ICP-MS since labels are curated to match ICP-MS; TD uses UV+FTIR jointly, as per expert practice). Analyze accuracy vs fraction of confidently classified stones for different modality subsets, and assess temporal consistency on stones measured multiple times.

Key Findings
  • Overall performance and automation trade-offs (Table 1):
    • Origin Determination (OD):
      • None (threshold=0): 100% stones auto-classified; test accuracy 90.69%.
      • Mode 1: 74.2% stones above threshold; calibration accuracy 98.6%; test accuracy 96.8%.
      • Mode 2: 38.5% stones above threshold; calibration accuracy 99%; test accuracy 99.1%.
    • Heat Treatment Detection (TD):
      • None (threshold=0): 100% stones auto-classified; test accuracy 98.03%.
      • Mode 1: 97.4% stones above threshold; calibration accuracy 98%; test accuracy 98.7%.
      • Mode 2: 95.5% stones above threshold; calibration accuracy 99%; test accuracy 98.9%.
  • Modality contributions (accuracy vs coverage, Fig. 3):
    • For OD, ICP-MS yields ≈10% higher accuracy than the next best single source (UV) across coverage levels. However, combining UV+XRF achieves performance comparable to ICP-MS, despite lower cost and complexity.
    • For TD, best accuracy arises from UV+FTIR, but FTIR alone attains similar accuracy with GEMTELLIGENCE—despite experts typically requiring both modalities—highlighting robustness when only one modality is available.
  • Human comparison (Fig. 2): GEMTELLIGENCE matches or exceeds expert accuracy at similar coverage levels across modality combinations, with negligible inference time (< 1 s), substantially reducing expert workload by auto-classifying a large fraction of stones while deferring low-confidence cases.
  • Consistency over time (Fig. 4): Predictions for stones measured multiple times across years and devices remain consistent, with higher thresholds (Mode 1/2) further stabilizing conclusions.
  • Practical impact: Enables high accuracy using inexpensive modalities (UV/XRF/FTIR), reducing reliance on ICP-MS and accelerating laboratory workflows.
Discussion

GEMTELLIGENCE addresses the need for accurate, consistent, and scalable gemstone analysis by integrating heterogeneous laboratory data within a single end-to-end model. The results show that with confidence-thresholding, laboratories can tune the balance between automation and risk: Mode 1 offers substantial workload reduction with accuracy comparable to or better than human experts, while Mode 2 offers near-perfect accuracy on a sizable subset. The modality analysis confirms accepted domain knowledge (ICP-MS is highly informative for OD) but also demonstrates that combining lower-cost UV and XRF can rival ICP-MS performance, and that FTIR alone can suffice for robust TD in many cases. Temporal consistency analyses support reliability for re-evaluations, which is critical given the financial and legal implications of gemstone certification. Collectively, these findings suggest that GEMTELLIGENCE can standardize decision-making, reduce ambiguity, and serve as a decision-support system that reserves expert time for complex, low-confidence cases, thereby improving throughput and trust in gemstone markets.

Conclusion

This work introduces GEMTELLIGENCE, a multimodal deep learning framework combining CNNs for spectral data and a transformer-based encoder for tabular elemental data to automate origin determination and heat treatment detection of gemstones. It achieves high accuracy with confidence calibration, enabling automated processing of most samples while deferring uncertain cases. Notably, it can match ICP-MS-level OD performance using inexpensive modality combinations (UV+XRF) and achieves strong TD accuracy even from FTIR alone. The approach promises substantial cost and time savings, improved standardization, and increased market trust. Future directions include expanding to non-metamorphic stones, improving ground-truth quality across labs, enlarging public datasets, and exploring applications of the framework to broader materials and spectroscopic domains beyond gemology.

Limitations
  • Training scope: The model is trained only on metamorphic blue sapphires; it does not cover all sapphire origins or non-metamorphic stones. A simple pre-classifier can separate metamorphic from non-metamorphic to route suitable cases to the model, but ideally the model would handle all types without preprocessing.
  • Ground-truth noise and bias: Labels are based on Gübelin Gem Lab determinations (expert microscopy plus spectroscopy/chemistry), which, while rigorous, may not perfectly reflect true origins for all stones. Reducing label noise and cross-lab biases could further improve performance and reliability.
  • Data availability imbalance: Comprehensive multi-instrument measurements exist for only about half the stones; although the model handles missing modalities, broader balanced datasets would strengthen generalization. These limitations can be mitigated by collecting more diverse, labeled data across origins and treatment states and by harmonizing labeling protocols.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny