logo
ResearchBunny Logo
Multi-animal DeepLabCut: markerless pose estimation and robust animal tracking

Biology

Multi-animal DeepLabCut: markerless pose estimation and robust animal tracking

J. Lauer, M. Alam, et al.

This innovative research showcases a multi-animal pose estimation and tracking system that enhances DeepLabCut. Developed by J. Lauer and colleagues, this system employs advanced CNNs for precise animal pose estimation, facilitating in-depth analysis of complex social interactions among marmosets.... show more
Introduction

The study addresses the challenge of accurate multi-animal 2D pose estimation and identity-aware tracking in complex, crowded, and occlusion-prone scenes. Existing human-centric methods often struggle to generalize to diverse animal morphologies and behaviors, while prior animal tracking tools may not combine precise pose estimation with robust multi-individual tracking. The authors introduce a unified framework extending DeepLabCut to multiple animals by developing new convolutional neural network architectures and inference algorithms to localize keypoints, assemble them into individual animals without hand-crafted skeletons, predict identity, and track through time, aiming to provide state-of-the-art accuracy, speed, and minimal user input across varied species and experimental setups.

Literature Review

The approach builds on and contrasts with prior multi-person pose estimation and grouping methods such as OpenPose using Part Affinity Fields (PAFs), Associative Embedding (ResNet-AE, HRNet-AE), HigherHRNet, and DeeperCut. It references broader behavior quantification tools (e.g., MARS) and recent multi-animal pose trackers (SLEAP, AlphaTracker). The work also benchmarks against idtracker.ai, which focuses on identity tracking without pose estimation. For benchmarks, the COCO multi-human pose estimation literature and metrics (mAP over OKS thresholds) are adopted. The study notes limitations of top-down pipelines and emphasizes bottom-up reasoning over whole images for multi-animal scenes.

Methodology

Datasets: Four open benchmark datasets were curated: (1) Tri-mouse: three male C57BL/6J mice following odor trails (640x480, 30 Hz), 12 keypoints, 161 labeled frames; (2) Parenting: adult female mouse interacting with two pups (720p or 704x480 at 30 Hz), adult 12 keypoints, pup 5 keypoints (intermediate points interpolated and adjusted), 542 labeled frames; (3) Marmosets: pairs recorded with Kinect V2 (1080p, 30 Hz), 7,600 labeled frames from 40 animals across colonies, 15 keypoints per animal; (4) Fish (Menidia beryllina): dorsal view in flow tank at 60–125 fps, 5 keypoints, 100 labeled frames. Additional densely annotated videos for tracking evaluation and total durations are provided (Table 1). Data split: 70% train, 30% test. Augmentations include rotation, occlusions (random boxes), motion blur, and keypoint-aware cropping.
Architecture: Multi-task CNNs with ImageNet-pretrained backbones (ResNets, EfficientNets) and a custom multi-scale architecture DLCRNet_ms5. Outputs per keypoint: score maps (probability), location refinement fields (offsets), and PAFs (vector fields encoding limb orientations). An identity (ID) head predicts per-keypoint identity probabilities for marked animals. The DLCRNet_ms5 uses a multi-fusion module combining high-resolution (conv2/conv3) with low-resolution features (conv5) via downsampling (3x3, 1x1 conv) and upsampling (3x3 deconvs, stride 2), followed by a multi-stage decoder predicting score maps and PAFs with shortcut connections between score-map stages. Overall stride tunable (2–8). Losses: cross-entropy for score maps, Huber for location refinement, L1 for PAFs; trained with Adam and a staged learning rate schedule for 60k–200k iterations, batch size 8.
Inference and assembly: Keypoints are decoded from smoothed score maps with subpixel refinement using location fields and non-maximum suppression. PAF integration along candidate connections yields affinity costs. A data-driven skeleton selection is performed: models are trained on a fully connected graph; edges are ranked by discriminability (auROC between within-animal vs between-animal PAF costs) and pruned to a minimal spanning tree, then progressively extended; the selected graph maximizes purity and connectivity. Assembly proceeds by selecting strong connections, finding connected components, and greedily adding remaining connections, optionally calibrated by a pose prior (Mahalanobis distance) and with temporal coherence costs when analyzing videos. The method supports arbitrary graphs and is parallelized.
Tracking: A tracking-by-detection pipeline links assembled animals across frames into tracklets using online local trackers: box tracker (IoU-based Kalman filter) and ellipse tracker (error ellipse for robustness to outliers). Tracklets are globally stitched via a min-cost flow optimization on a directed acyclic graph with multiple affinity costs: motion (bidirectional centroid prediction error), spatial proximity, shape similarity (undirected Hausdorff distance of keypoints), and dynamics (Hankelets/time-delay embeddings to assess dynamical similarity). An ID head (supervised) can reduce switch costs, and a GUI supports manual refinement.
Unsupervised re-identification: A transformer-based metric learning module (ReIDTransformer) ingests 2,048-d keypoint-centric features extracted from the trained DLC backbone along tracklets. With 4 heads and 4 blocks (dim 768), followed by an MLP to 128-d embeddings, trained with triplet loss on sampled triplets from ground truth and local tracklets, it provides cosine-similarity costs for stitching.
Evaluation: Detection errors (RMSE) and PCK (33% of ear distance; fish uses tip–gill distance), assembly quality via mAP over OKS thresholds (0.50–0.95), unconnected fraction and purity. Tracking via MOTA and related CLEAR MOT metrics (FP, FN, ID switches), IDF1/IDP/IDR, recall, precision. Comparisons made to HRNet-AE and ResNet-AE (MMPose implementations) under matched training regimes; top-down variants with and without PAFs also evaluated.

Key Findings
  • Keypoint detection: DLCRNet_ms5 achieved median test errors of approximately 2.65, 5.25, 4.59, and 2.72 pixels (tri-mouse, parenting, marmoset, fish). Across datasets, 93.6±6.9% of predictions were within PCK thresholds (33% ear or tip–gill distances).
  • PAF discriminability: Predicted limbs strongly distinguish within-animal from cross-animal keypoint pairs; mean±s.d. auROC ≈ 0.99±0.02.
  • Data-driven skeleton: Automatic edge selection reduced unconnected keypoints and increased assembly purity compared to naive skeletons, with gains up to 3.0 (tri-mouse), 2.0 (marmosets), and 2.4 percentage points (fish) in purity; supported arbitrary graphs; assembly speeds reached ~400 fps with 14 animals and up to ~2,000 fps for small skeletons with 2–3 animals.
  • Benchmarking vs human-pose SOTA: DLCRNet-based models significantly outperformed HRNet-AE and ResNet-AE on all four animal datasets in mAP (one-way ANOVA P-values: tri-mouse 8.8×10⁻¹¹, pups 6.5×10⁻¹⁰ to 10⁻¹², marmosets 3.8×10⁻¹¹, fish 4.0×10⁻¹²). Generalization to new marmoset cages incurred ~0.25 mAP drop, mitigable by adding new data.
  • Top-down ablation: Including PAFs in top-down variants significantly improved mAP over without PAFs (ANOVA P: tri-mouse 4.656×10⁻¹¹; pups 3.62×10⁻¹²; marmosets 1.33×10⁻² to 10⁻²⁸; fish 1.645×10⁻⁶).
  • Identity head (supervised): On marked marmosets, per-keypoint ID accuracy peaked at 99.2% near head regions and was ~95.1% on distal parts. Incorporating appearance cues reduced switches by 26% in marmosets.
  • Tracking performance: Ellipse tracker outperformed box tracker (MOTA 0.97 vs 0.78) with ~92% fewer false negatives and similar switch rates. Global stitching further reduced switches by an average of 63% vs local tracking, improving full track reconstruction.
  • Unsupervised ReID: ReIDTransformer improved triplet accuracy and delivered up to ~10% MOTA gain on challenging fish sequences; it also improved multiple ID metrics on the most crowded frames for fish and marmosets.
  • Comparison to idtracker.ai: On tri-mouse and marmoset datasets, idtracker.ai performed significantly worse (MOTA; one-sided, one-sample t-tests: tri-mouse t=-11.03, P=0.0008; marmosets t=-8.43, P=0.0018).
Discussion

The findings show that combining multi-scale CNNs with PAF-based grouping, a data-driven skeleton selection, and identity-aware tracking yields robust multi-animal pose estimation and tracking across varied species and challenging conditions. PAFs strongly aid correct within-animal grouping, while automatic skeleton pruning removes the need for hand-crafted, species-specific limb graphs and improves assembly purity and connectivity. The bottom-up pipeline, which reasons over the entire image, generally outperforms top-down approaches and scales efficiently, evidenced by high assembly framerates. Identity cues, both supervised (appearance of marked animals) and unsupervised (transformer-based metric learning), further reduce identity switches and improve tracking continuity, especially when temporal cues alone are insufficient (e.g., entry/exit, occlusions). Benchmark comparisons indicate that methods optimized for human pose (e.g., HRNet-AE, ResNet-AE) underperform on these animal datasets relative to the proposed approach, highlighting the importance of animal-specific multi-task design and training. The framework’s generalization is strong but can drop with substantial domain shifts; adding data from new environments improves performance, aligning with expectations from prior literature.

Conclusion

This work extends DeepLabCut to multi-animal scenarios by introducing DLCRNet architectures, data-driven skeleton selection, fast assembly, and a hierarchical tracking system augmented by supervised ID heads and an unsupervised transformer-based metric learning module. The unified framework delivers state-of-the-art pose estimation and tracking performance across diverse animal datasets and is efficient, flexible, and requires minimal user input. The authors release four open benchmark datasets and code to spur community progress. Future directions include further improving performance in highly occluded and visually similar collectives (e.g., dense fish schools), exploring enhanced domain generalization strategies, extending 3D multi-animal capabilities, and refining identity modeling for unmarked animals.

Limitations
  • Generalization to new environments or colonies can degrade (e.g., ~0.25 mAP drop on held-out marmoset cages) without additional training data.
  • Highly crowded, occluded, and visually similar groups (e.g., schooling fish) remain challenging; while improved, the problem is not fully solved.
  • Performance depends on sufficient image resolution to localize keypoints; very small animals or low-resolution video may limit accuracy.
  • Supervised ID requires consistent visible markings or features; unsupervised reID improves tracking but may not fully resolve identity in the hardest cases.
  • Top-down pipelines may be suboptimal in crowded scenes; bottom-up is recommended but still subject to occlusion-induced errors.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny