Biology
SLEAP: A deep learning system for multi-animal pose tracking
T. D. Pereira, N. Tabris, et al.
Unlock the secrets of animal behavior with SLEAP, an innovative machine learning system for multi-animal pose tracking developed by a team of experts from Princeton University and beyond. Experience real-time applications and unparalleled accuracy in understanding social interactions among various species!
~3 min • Beginner • English
Introduction
The paper addresses the challenge of reliable multi-animal pose estimation and tracking, which goes beyond single-animal landmark detection by requiring robust part grouping within frames and consistent identity tracking across frames. Existing tools often focus on either pose or identity tracking and typically implement only one of two strategies—bottom-up (detect parts then group) or top-down (detect individuals then parts)—leaving unclear which is optimal for animal datasets with diverse imaging and behavioral conditions. The authors present SLEAP, a unified, flexible system that implements both approaches along with identity tracking via temporal and appearance cues. The purpose is to deliver a general, efficient, accurate, and user-accessible framework that supports the full workflow from annotation to real-time inference, thereby facilitating studies of social behavior and enabling experiments that depend on low-latency feedback from pose estimates.
Literature Review
Prior work adapted human pose-estimation techniques to animals for single individuals (for example, DeepLabCut, DeepPoseKit, LEAP), but multi-animal settings introduce additional complexities of part grouping and identity maintenance. Multi-human pose methods use bottom-up approaches with part affinity fields or top-down pipelines with instance detection followed by part localization; however, their relative suitability to animal data is not established. Tools exist for multi-animal identity tracking (for example, idtracker.ai, TRex) and multi-animal pose estimation (including evolving multi-animal DeepLabCut and other pipelines such as AlphaTracker), yet comprehensive frameworks that compare bottom-up and top-down approaches within the same system and support end-to-end workflows are limited. SLEAP builds on these developments by integrating both strategies, standardized data models, and identity tracking into one system, and by benchmarking across diverse animal datasets.
Methodology
- System design and workflow: SLEAP provides an integrated pipeline: data I/O (videos, arrays, import from DLC/DPK/COCO), interactive labeling GUI, configuration-driven model training (JSON-configured hyperparameters, data preprocessing, augmentation, architectures, optimization), evaluation, inference via CLI/APIs, proofreading, and standardized, portable outputs with metadata.
- Data model: A standardized, format-agnostic schema encapsulates videos, labeled frames, instances, skeletons, tracks, and predictions, saved in a single HDF5-based SLP file optionally embedding images. This supports reproducibility and interoperability.
- Multi-instance approaches:
• Bottom-up: A single network predicts multi-part confidence maps and part affinity fields (PAFs). Local peaks yield part coordinates; PAF-based connection scores guide optimal matching into full instances. GPU-accelerated operations implement peak detection, scoring, and matching. Skeletons are directed trees.
• Top-down: Stage 1 detects per-animal anchors (e.g., centroids) via confidence maps and local peak finding; Stage 2 processes anchor-centered crops with a centered-instance network that predicts parts for the centered individual via global peak finding. Animals are modeled implicitly by spatial priors in crops.
- Neural architectures: SLEAP supports any fully convolutional backbone within an encoder-decoder framework. A modular UNet is the primary architecture, configurable by number of down/upsampling blocks to control receptive field (RF) and computational cost. It also supports transfer-learning backbones (ResNet, MobileNet, EfficientNet, VGG, etc.) with skip connections. Integral-regression-based subpixel refinement improves localization beyond output stride resolution.
- Identity tracking:
• Temporal (flow-shift): Optical-flow-based displacement estimates propagate poses across frames for identity association without additional training or labels.
• Appearance-based ID: Bottom-up ID replaces PAFs with multi-class segmentation-like maps to assign IDs per landmark; top-down ID adds a classification head on centered crops to score identity per instance. Optimal assignment yields per-frame identity without temporal propagation.
- Evaluation metrics and benchmarking: Accuracy measured by mAP using OKS (adapted from PoseTrack), with uncertainty factor set to human-eye-level (0.025) as a conservative lower bound; tracking quantified primarily by ID switches via MOT metrics. Speed measured as end-to-end inference throughput/latency on preloaded 1,280-frame clips, including GPU warm-up, using a Titan RTX on Ubuntu.
- Datasets: Seven datasets totaling 7,636 labeled images and 15,441 instances (single fly; flies; bees; mice open field; mice home cage; gerbils; and a flies dataset with extended landmarks for closed-loop). Fixed train/val/test splits used for reproducibility.
- Real-time and closed-loop: End-to-end latency quantified via DAQ loopback alignment of online vs offline pose-derived features. A closed-loop system triggers optogenetic activation in Drosophila females contingent on real-time detection of a male approach pose computed from tracked landmarks.
- Training details: Single-GPU-focused pipelines via tf.data; Adam with AMSGrad; initial LR 1e-4; up to 200 epochs with early stopping; batch size 4; rotation augmentation; logging via TensorBoard/CSV and GUI training monitor.
Key Findings
- Single-animal benchmark: SLEAP achieves comparable or improved accuracy to DeepLabCut and DeepPoseKit (mAP 0.927 vs 0.928 for DLC) while being substantially faster (2,194 FPS vs 458 FPS for DLC).
- Multi-animal throughput: Peak inference speeds of 762 FPS (flies) and 358 FPS (mice, open field). Top-down ID models reached up to 804 FPS on flies at 1,024×1,024 resolution with 13 landmarks and identity assignment.
- Sample efficiency: 50% peak accuracy with as few as 20 labeled frames; 90% peak accuracy with ~200 labeled frames on flies and mice.
- Localization accuracy: 95% of landmark errors within 0.084 mm (3.2% body size) for flies and 3.04 mm (3.7% body size) for mice. Overall multi-animal mAPs of 0.821 (flies) and 0.774 (mice), comparable to top multi-person benchmarks.
- Approach trade-offs: Top-down generally more accurate and faster with few animals; bottom-up scales better as animal count increases in frame. Bottom-up shows near-constant speed vs animal number; top-down scales roughly linearly with number of animals.
- Architecture insights: Increasing receptive field improves accuracy up to dataset-specific saturation, enabling small, fast specialist models with adequate RF to match or exceed transfer-learned backbones. Transfer learning often improves over random init for fixed backbones but does not outperform an optimally configured randomly initialized UNet; pretrained encoders incur 3–4× longer training and 7–11× slower inference than lightweight UNet at similar accuracy.
- Identity tracking performance: Temporal flow-shift tracking yields low switch rates: 0.91 and 22.7 ID switches per 100,000 frames for flies and mice, respectively (11.7M fly frames, 367k mouse frames). Appearance-based ID achieves very high accuracy: flies 99.7% (bottom-up) to 100% (top-down); gerbils 82.2% (bottom-up) and 93.1% (top-down) despite challenging conditions.
- Real-time capability: End-to-end system latency ~70 ms (mean 71.0 ms, s.d. 17.0 ms; n=50,000 1-s segments) from frame capture to output; model inference latency ~3.2–3.45 ms per 1,024×1,024 image (mean 3.45 ms, s.d. 0.16 ms; n=1,280 images).
- Closed-loop experiment: Real-time detection of male approach triggers optogenetic activation of DNp13 neurons in virgin female flies, reliably evoking ovipositor extrusion (OE). Total latency to behavioral response ~326 ± 150 ms, decomposed into ~77 ± 11 ms system latency and ~249 ± 148 ms biological latency.
Discussion
SLEAP addresses the core multi-animal challenges of part grouping and persistent identity by offering both bottom-up and top-down pose pipelines and two complementary identity strategies (temporal and appearance). The system’s speed and low latency enable real-time, closed-loop behavioral experiments previously impractical in multi-animal settings. Empirical comparisons show that top-down models are advantageous for scenes with few animals due to speed and accuracy, whereas bottom-up scales better when many animals fill the field of view. Architecture studies demonstrate that appropriately tuned receptive fields in lightweight UNet variants can match or exceed the accuracy of heavier, pretrained encoders while dramatically improving training and inference efficiency. The low ID switch rates with temporal tracking and high single-frame ID accuracy with appearance models illustrate that SLEAP can maintain identities across diverse conditions, including challenging, long-duration home-cage recordings where proofreading is infeasible. Overall, SLEAP’s modular, reproducible design and standardized data model make it suitable for widespread adoption and integration into downstream behavioral analysis pipelines and experimental control systems.
Conclusion
The paper introduces SLEAP, a flexible, open-source, end-to-end system for multi-animal pose estimation and tracking that unifies bottom-up and top-down approaches, supports temporal and appearance-based identity assignment, and delivers high accuracy with exceptional speed and low latency suitable for real-time closed-loop experiments. Comprehensive benchmarks across seven datasets establish strong accuracy, data efficiency, and performance advantages over prior tools. Future directions include improved temporal modeling for more consistent tracking, multi-view alignment for 3D pose estimation, and self-supervised learning to enhance sample efficiency and generalization. The released datasets, models, configurations, and documentation provide a robust foundation for further research and tool development.
Limitations
- Appearance-based ID requires animals with sufficiently distinct visual features and manual ID labeling; it may not generalize to visually indistinguishable individuals.
- Temporal (optical-flow) tracking can propagate identity errors over time, which is problematic for very long videos or real-time applications where proofreading is not possible.
- Performance trade-offs between top-down and bottom-up approaches depend on the number of animals and scene composition; selecting and configuring the optimal approach may require dataset-specific tuning (e.g., receptive field size, skeleton design).
- OKS-based accuracy uses a conservative uncertainty factor derived from human labeling (eye), potentially underestimating true accuracy for some animal landmarks.
- Generalist models across diverse domains may require larger compute; SLEAP emphasizes specialist models that perform best within narrower domains.
Related Publications
Explore these studies to deepen your understanding of the subject.

