Introduction
Precise and efficient tracking of multiple animals is crucial for understanding animal behavior in various contexts, from neuroscience to ecology. Existing methods often struggle with challenges such as occlusions, similar-looking animals, and complex behaviors. This research addresses these limitations by extending DeepLabCut, a widely used markerless pose estimation tool, to handle multiple animals simultaneously. The goal is to create a robust and versatile system capable of accurately tracking animals in diverse settings, even those with significant visual similarities or frequent interactions. The improved tool is essential for advancing research in animal behavior, enabling the analysis of complex social interactions and providing deeper insights into the neural mechanisms underlying these behaviors. The ability to track multiple animals with high accuracy opens up new possibilities for studying group dynamics, social interactions, and the influence of environmental factors on behavior. This paper presents a significant advancement in the field of computational ethology, providing researchers with a powerful tool for analyzing complex animal behavior.
Literature Review
The authors review existing methods for animal pose estimation and tracking, highlighting their limitations in handling multiple animals, particularly when dealing with occlusions, similar appearances, and complex interactions. They cite several studies using deep learning for pose estimation, such as OpenPose and DeepCut, which are typically designed for single individuals or humans and lack the robustness for multi-animal tracking. They also mention existing animal tracking software like idtracker.ai, which, while efficient, doesn't perform pose estimation. The literature review sets the stage for the novel contributions of the current work, emphasizing the need for a system that combines accurate pose estimation with robust multi-animal tracking.
Methodology
The methodology involves several key components:
1. **Dataset Creation:** Four benchmark datasets were created, each presenting unique challenges: tri-mouse (frequent contact and occlusion), parenting mice (similar-looking pups and adult), marmosets (occlusion, non-stationary behavior, motion blur), and schooling fish (cluttered scenes, frequent occlusion). Each dataset includes manually annotated keypoints on animals across numerous frames and videos.
2. **Multi-task CNN Architecture (DLCRNet):** A novel multi-scale convolutional neural network (CNN) architecture, DLCRNet, was developed for pose estimation. This architecture employs a multi-fusion module and a multi-stage decoder to improve accuracy and robustness. Different CNN backbones (ResNets, EfficientNets) were tested.
3. **Data-driven Skeleton Selection:** A novel data-driven method was developed to automatically determine the optimal skeleton (connections between keypoints) for each animal, eliminating the need for manual user input and accommodating diverse body plans. This method ranks connections based on their discriminability power in separating within-animal and between-animal pairs.
4. **Animal Assembly:** An efficient animal assembly algorithm was developed to group predicted keypoints into individual animals. It combines limb-based grouping using Part Affinity Fields (PAFs) with a data-driven graph selection process.
5. **Tracking:** A two-stage tracking approach was employed: (a) online, local tracking to generate tracklets (fragments of trajectories), using box and ellipse trackers; and (b) global tracklet stitching to combine tracklets into complete trajectories, formulated as a network flow optimization problem to handle occlusions and discontinuities.
6. **Animal Identity Prediction:** Two methods for animal identification were developed: (a) supervised, leveraging visually distinct markers on animals; and (b) unsupervised, using a transformer-based metric learning approach (ReIDTransformer) to learn animal identities directly from image features.
7. **Evaluation Metrics:** The system's performance was evaluated using various metrics, including root mean squared error (RMSE) for keypoint localization, percentage of correct keypoints (PCK), mean average precision (mAP), and multiple object tracking accuracy (MOTA).
Key Findings
The key findings demonstrate significant improvements in multi-animal pose estimation and tracking:
1. **DLCRNet Performance:** The DLCRNet architecture achieved state-of-the-art keypoint detection accuracy across all four benchmark datasets, with median test errors ranging from 2.65 to 5.25 pixels.
2. **Data-driven Skeleton Selection:** The data-driven skeleton selection method significantly improved animal assembly purity and reduced the number of unconnected keypoints compared to baseline methods, achieving gains up to 3 percentage points.
3. **Tracking Performance:** The two-stage tracking method, with the ellipse tracker and tracklet stitching, achieved near-perfect MOTA (0.97) and significantly outperformed idtracker.ai.
4. **Identity Prediction:** The supervised animal identity prediction method achieved high accuracy (>0.99) for keypoints near the head, decreasing to 0.95 for distal keypoints. The unsupervised ReIDTransformer approach provided a 10% boost in MOTA performance in the challenging fish dataset.
5. **Social Behavior Analysis:** Application to a long-term marmoset study (9h) revealed insights into complex social interactions, demonstrating the system's potential for studying behavioral dynamics over extended timescales. The system revealed postural correlations with spatial relations. Specifically, the marmosets tended to face each other when farther apart.
Statistical significance was demonstrated using various tests, including ANOVA and t-tests, across different datasets and performance metrics. Comparisons are made to other state-of-the-art pose estimation models (HRNet, ResNet-AE), showing improved performance across all datasets. The superior results highlight the advantages of the data-driven skeleton selection method, the multi-stage tracking approach, and the incorporation of animal identity information for improved performance.
Discussion
This research makes several significant contributions to the field of computational ethology. The improved DeepLabCut significantly advances the ability to analyze complex animal behavior, especially in scenarios involving multiple animals. The data-driven method for skeleton selection is particularly important as it removes a significant barrier for users—the need for manually crafting a skeleton, making the tool more accessible and applicable to a wider range of species and behaviors. The integration of animal identity prediction further enhances the system's robustness and allows for more comprehensive analysis of social interactions. The open-source nature of the datasets and code allows for validation and potential improvement, while the creation of benchmark datasets allows for future comparison and competition, driving progress in the field. The improved speed also allows for analysis of very long videos. The results obtained from the marmoset study exemplify the potential of this system for revealing nuanced insights into social dynamics. The limitations of the system, discussed below, highlight areas for future development and improvement.
Conclusion
This paper introduces a substantial upgrade to DeepLabCut, making it a highly efficient and accurate tool for multi-animal pose estimation and tracking. The data-driven skeleton selection, improved tracking algorithm, and integration of identity prediction significantly improve accuracy and robustness. The open-source datasets and code will further advance the field, enabling researchers to investigate complex animal behaviors with greater precision and efficiency. Future work could focus on further refining the unsupervised animal identification method, and enhancing the system's ability to handle even more challenging scenarios, such as extremely dense animal aggregations.
Limitations
While the system exhibits high performance across various datasets, certain limitations remain. The accuracy of the unsupervised identity prediction may be affected by variations in animal appearance and lighting conditions. Extremely dense animal aggregations may still pose challenges for accurate tracking due to severe occlusions. The computational cost, although significantly improved, can still be substantial for very large datasets or long videos. The performance might be affected by low resolution and low contrast images. Finally, there are several hyperparameters that the users may wish to tune for their specific experiments.
Related Publications
Explore these studies to deepen your understanding of the subject.