logo
ResearchBunny Logo
Introduction
Quantitative analysis of animal movement is crucial for understanding behavior across neuroscience and ecology. Pose estimation—predicting body part locations in images—has become a standard for behavioral quantification. Deep learning adaptations of human pose estimation methods have enabled single-animal pose estimation. However, reliably tracking multiple interacting animals remains challenging, hindering the study of social behaviors. Single-animal pose estimation involves detecting body parts. Extending this to multiple animals requires solutions for assigning detections to individuals within an image and across frames. While tools exist for tracking multiple animal identities, a unified approach combining pose estimation and tracking is needed. Existing multi-human pose estimation methods use either bottom-up (detect parts, then group) or top-down (find individuals, then detect parts) strategies, but optimal strategy for animals is unclear. Existing animal pose estimation tools implement one approach but lack the flexibility to compare both. This paper introduces SLEAP, a system addressing these limitations.
Literature Review
Existing methods for animal pose estimation and tracking include DeepLabCut, DeepPoseKit, and LEAP for single-animal tracking and idtracker.ai for multi-animal identity tracking. However, these lack a unified approach that combines pose estimation and identity tracking. Other methods like TRex and AlphaTracker have been developed but lack a comprehensive framework for handling the entire workflow, including data labeling and comparison across different approaches.
Methodology
SLEAP is a comprehensive framework encompassing the entire multi-animal pose-tracking workflow: interactive labeling, training, inference, and proofreading. It implements both top-down and bottom-up approaches, using animal identity tracking via motion or appearance models and offering over 30 neural network architectures. SLEAP supports data input in various formats and imports annotations from other software. An interactive graphical user interface (GUI) facilitates labeling, exporting training packages, and proofreading predictions. A configuration system ensures reproducibility, providing standardized hyperparameters for model creation and training. Trained SLEAP models perform pose predictions using efficient GPU-accelerated code accessible via CLI or APIs. A standardized data model is used, encompassing aspects specific to multi-animal tracking and enabling data sharing. The GUI allows for launching and monitoring training, proofreading predictions, and exporting data. High-level APIs allow for application integration. SLEAP uses industry-standard software engineering practices for version control, continuous integration, packaging, and documentation. The system is built using Python and leverages numerous open-source libraries.
Key Findings
SLEAP achieves high accuracy and speed. Compared to DeepLabCut, DeepPoseKit, and LEAP on single-animal data, SLEAP achieves comparable or improved accuracy with significantly faster prediction speeds (2,194 vs 458 FPS). On multi-animal datasets (flies and mice), SLEAP achieves peak inference speeds of 762 and 358 FPS, respectively. It reaches 50% peak accuracy with as few as 20 labeled frames and 90% accuracy with 200 labeled frames. Landmark localization errors are low (within 0.084 mm for flies, 3.04 mm for mice in 95% of estimates). SLEAP's mAP scores are high (0.821 for flies, 0.774 for mice), comparable to the top scores in multi-person pose estimation benchmarks. SLEAP implements both bottom-up and top-down approaches to multi-instance pose estimation. Bottom-up uses confidence maps and part affinity fields (PAFs) to detect and group body parts. Top-down first detects animals and then locates body parts using anchor points and centered sub-images. Top-down models generally show higher accuracy and speed, particularly with fewer animals. Bottom-up scales more efficiently with increasing numbers of animals. SLEAP uses a modular UNet architecture that can be configured to control the receptive field (RF) size, improving accuracy while managing computational costs. Transfer learning using pretrained encoders may not always outperform well-configured randomly initialized UNets. For identity tracking, SLEAP uses flow-shift tracking (temporal context) and appearance-based models. Flow-shift tracking is highly accurate but error prone. Appearance-based models provide high accuracy even in challenging conditions (gerbils). SLEAP achieves real-time performance (low latency of 3.2 ms), demonstrated by controlling a fly's behavior based on real-time detection of social interactions. Total end-to-end latency is around 70ms, including 3ms for model inference.
Discussion
SLEAP advances the state-of-the-art in single-animal and multi-animal pose estimation and provides a flexible and performant open-source framework. Its modular design makes it easy to troubleshoot and adapt. SLEAP's modular UNet architecture allows for specialized models optimized for specific datasets, contrasting with a 'generalist' approach that might compromise accuracy for broader applicability. The high performance and low latency of SLEAP open up new possibilities for real-time closed-loop behavioral experiments. The release of the datasets and models associated with this paper will contribute to further improvements and development of pose-tracking techniques.
Conclusion
SLEAP is a powerful, versatile, and open-source deep learning system for multi-animal pose tracking. Its speed, accuracy, and modularity make it a valuable tool for studying animal behavior. Future work could focus on improving temporal information integration for consistent tracking, multi-camera view alignment for 3D pose tracking, and self-supervised learning to enhance sample efficiency and generalizability.
Limitations
While SLEAP demonstrates high performance, certain limitations exist. The accuracy of identity tracking can be affected by the visual distinctiveness of the animals. The real-time performance depends on hardware and software optimizations. Further work is needed to enhance temporal information integration and achieve robust tracking in highly cluttered environments.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny