logo
ResearchBunny Logo
A deep learning approach for morphological feature extraction based on variational auto-encoder: an application to mandible shape

Biology

A deep learning approach for morphological feature extraction based on variational auto-encoder: an application to mandible shape

M. Tsutsumi, N. Saito, et al.

Explore the groundbreaking Morpho-VAE, a unique framework that employs deep learning for shape analysis in image data. This innovative tool, developed by Masato Tsutsumi, Nen Saito, Daisuke Koyabu, and Chikara Furusawa, excels at distinguishing morphological features among different classes, particularly in primate mandible images, showcasing its potential for biological discoveries.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of objective, automated quantification of biological shape without relying on expert-annotated landmarks. Traditional geometric morphometrics requires homologous landmarks and struggles across phylogenetically distant taxa or disparate developmental stages. Both too few and too many landmarks can lose information or introduce errors, and landmarking can be device- and operator-dependent. Landmark-free alternatives like elliptic Fourier analysis are often paired with linear dimensionality reduction (e.g., PCA), which may miss complex nonlinear morphological features. Deep neural networks excel at image representation but are often criticized as black boxes and are rarely applied to morphometrics. The research question is whether a hybrid deep generative-discriminative model can perform landmark-free feature extraction from images to separate predefined biological groups (families) and reconstruct shapes, including from incomplete data. The purpose is to create an interpretable, low-dimensional latent space that emphasizes morphology relevant to class distinctions, enabling visualization, classification, and reconstruction in a single framework. This is important for developmental and evolutionary biology where morphology encodes functional, developmental, and evolutionary signals.
Literature Review
The paper reviews landmark-based geometric morphometrics as the dominant approach across taxa, noting limitations with homologous landmark definition, information loss, and measurement/observer error. Landmark-free outline methods such as elliptic Fourier analysis (EFA) have been used for cells, mollusks, fish, and plant organs. Dimensionality reduction is commonly done with PCA or LDA; however, linear methods may not capture complex nonlinear shape variation. Deep neural networks dominate in image and medical imaging tasks but are underused for morphological feature extraction. Hybrid semi-supervised VAE-classifier architectures have been explored in other domains (e.g., dementia drawing analysis, multimodal anomaly detection, omics classification, supervised autoencoders enforcing evenly spaced latent class centroids). Interpretability tools (e.g., Grad-CAM/Score-CAM) aim to open the black box. Building on this, the authors propose the first application of a VAE-classifier hybrid to morphometrics.
Methodology
Data: 147 mandible samples from seven families: six primate families (Cercopithecidae, Cebidae, Lemuridae, Atelidae, Hylobatidae, Hominidae; 141 total) plus Phocidae (Carnivora; 6) as an outgroup. CT and photographic datasets were sourced from KUPRI, MorphoSource.org, and MCPA2. Adults without visible abnormalities were used. Preprocessing: 3D mandible volumes/images were aligned (teeth downward; mandible base parallel to xy; axis alignment using the line from condylar heads to mandibular tip). Due to left-right symmetry, each mandible was split at the midline to create two mirrored samples without overlap between train/test for the same specimen. Each specimen was projected into three 2D binary images (128×128 pixels): x-projection (yz plane), y-projection (xz plane), z-projection (xy plane). Size normalization was applied so the length from the angular process to the tip was constant across samples. Model: Morpho-VAE combines a VAE module with a classifier acting on the latent code ζ. The encoder (5-layer CNN) maps inputs to mean μ and variance σ of a multivariate Gaussian; ζ is sampled from N(μ, σ). The decoder (5-layer deconvolutional network) reconstructs the three-projection image set. A single-layer softmax classifier maps ζ to seven family labels. Losses: E_rec = binary cross-entropy reconstruction loss; E_reg = KL divergence between q(ζ|X) and N(0,1); E_VAE = E_rec + E_reg; classification cross-entropy E_C. Total loss E_total = (1 − α) E_VAE + α E_C with α = 0.1 selected via cross-validation to retain reconstruction quality while improving separability. Training and tuning: Hyperparameters tuned with Optuna over 500 trials with 100 epochs each. Best architecture: encoder filters [128, 128, 32, 32, 64], ReLU activations; decoder filters [64, 32, 32, 128, 128], sigmoid output; RMSprop optimizer for both modules. Latent dimension explored 2–10; chosen d=3 as a compact representation with low E_total and substantial improvement over d=2. Training used double cross-validation: one-third held out as test for generalization; of the remainder, 75% train and 25% validation. Proportional extraction maintained label ratios; samples from the same original mandible were not split across train/test. Training typically ran for 100 epochs. Evaluations: - Latent space visualization and cluster separation using Morpho-VAE, PCA (top 3 components from 128×128 pixel vectors), and VAE (α=0). - Cluster Separation Index (CSI; Davies–Bouldin-based) to quantify cluster separation; lower is better. - SVM (RBF kernel; C=1.0; tol=0.001; gamma=1/latent_dim) classification accuracy on latent features as a measure of separability. - Leave-one-family-out evaluation: trained on six families and computed maximum CSI between the held-out family and each trained family across methods. - Reconstruction performance: reconstruction loss trajectories; classification accuracy on reconstructed images vs. originals using the Morpho-VAE classifier; Mann–Whitney test for difference. - Generative visualization: uniformly sampled latent grid in a principal plane of ζ (PC1–PC2 at PC3=0) decoded to images; classifier predictions over the latent plane. - Interpretability: Score-CAM on the last encoder conv layer to produce saliency maps per family and per projection (x, y, z). - Cropped-input reconstruction: artificial vertical and horizontal croppings (y and z directions) with varying crop rates; measured reconstruction loss (BCE) vs. original and classification accuracy of reconstructed images vs. crop rate; identified thresholds where quality deteriorates.
Key Findings
- Morpho-VAE produced well-separated latent clusters by family, outperforming PCA and VAE in cluster separation measures (lower CSI) and downstream SVM classification accuracy. - Statistical significance: Steel tests showed higher test accuracy for Morpho-VAE vs PCA (p ≈ 3.07 × 10^-4) and vs VAE (p ≈ 3.70 × 10^-3). - Family classification by Morpho-VAE achieved high validation accuracy (median ≈ 90% after 100 epochs). - Leave-one-family-out analysis: Morpho-VAE yielded lower maximum CSI (better separation) for the majority of held-out families compared to PCA and VAE, indicating a latent space that generalizes to unseen families. - No significant drop in classification accuracy when classifying reconstructed images versus original inputs (Mann–Whitney U-test p = 0.160), confirming faithful reconstruction. - Generative exploration of the latent space showed smooth morphological transitions; PC1 correlated with increasing angle between condylar and coronoid processes; PC2 with enlargement of the angular process. - Score-CAM saliency maps indicated x-projection is most informative; z-projection contributed negligibly. Saliency consistently highlighted angular, condylar, and coronoid processes—functional sites relevant to mastication. - Cropped-image reconstruction was robust up to ~60% vertical crop and ~25% horizontal crop with only slight loss/accuracy changes; beyond these thresholds, performance deteriorated. Loss of coronoid/condylar processes led to reconstruction and classification failure. - No correlation was detected between inter-family latent-space distances and phylogenetic distances derived from a family-level tree (VertLife), suggesting factors beyond phylogeny (e.g., diet, function) may dominate mandible morphology. - At a finer taxonomic scale, within Cercopithecidae, Morpho-VAE separated four genera better than PCA and VAE (improved CSI and SVM accuracy), demonstrating applicability to more subtle morphological differences.
Discussion
By integrating supervised classification with unsupervised variational encoding, Morpho-VAE constructs a compact latent space that emphasizes discriminative morphological traits, akin to a nonlinear LDA while maintaining VAE reconstruction fidelity. This hybrid approach yielded superior cluster separation compared to unsupervised PCA and VAE, even when evaluated on unseen families, indicating that the learned features capture informative shape characteristics that generalize beyond training labels. The anatomical interpretability of the latent axes and Score-CAM maps aligns with known functional anatomy: the angular, condylar, and coronoid processes, key to mastication, drive inter-family distinctions. Reconstruction experiments show the model can infer missing regions from partial data, particularly when key processes remain. The absence of correlation between latent-space distances and phylogenetic distances likely reflects the influence of homoplasy and functional adaptation (e.g., diet) on mandible morphology and the broad phylogenetic scope with limited sample sizes. Sex differences were not evident in this supervised, size-normalized, multi-family setting, suggesting that dedicated analyses without size normalization and at narrower taxonomic scales may be required. Overall, Morpho-VAE provides an interpretable, low-dimensional morphological representation that supports visualization, classification, and shape completion.
Conclusion
The study introduces Morpho-VAE, a landmark-free, image-based deep learning framework that jointly performs dimensionality reduction, classification, and reconstruction for biological shape analysis. Applied to primate and pinniped mandibles, it created a 3D latent space with strong inter-family separation, interpretable anatomical gradients, and robust reconstruction, including from partially missing data. It outperformed PCA and VAE in cluster separation metrics and downstream classification, and highlighted functionally meaningful regions via Score-CAM. Future directions include: systematic comparison with landmark-based methods; expanding datasets and focusing on closely related taxa; exploring metric-learning losses (e.g., triplet loss) to better encode phylogenetic signals; removing or adapting size normalization to study sex differences; and extending the architecture to 3D volumetric inputs for richer reconstruction and analysis.
Limitations
- Dataset size is modest and imbalanced across families; sex and species ratios could not be equally split across train/validation/test, which may limit generalizability. - The method uses 2D projections rather than full 3D volumes, potentially losing spatial information; 3D models demand greater compute and data. - Size normalization may obscure sexually dimorphic features. - Supervised training focuses the latent space on label-separating traits, which may reduce sensitivity to other biological signals (e.g., phylogeny). - z-projection contributed little information in this setup, suggesting view-dependency. - Reconstruction from heavily cropped images fails when critical processes (condylar/coronoid) are absent. - Although cross-validation mitigates overfitting and bgPCA-like spurious separation, small-sample effects can persist.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny