logo
ResearchBunny Logo
Three-dimensional surface motion capture of multiple freely moving pigs using MAMMAL

Veterinary Science

Three-dimensional surface motion capture of multiple freely moving pigs using MAMMAL

L. An, J. Ren, et al.

Explore the fascinating world of animal interactions with MAMMAL, a groundbreaking system developed by Liang An, Jilong Ren, Tao Yu, Tang Hai, Yichang Jia, and Yebin Liu. It captures 3D surface motions of pigs and other large mammals, even amidst occlusions, paving the way for advanced analysis of their locomotion and social behaviors.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the need for accurate, noninvasive 3D motion capture of large mammals, particularly pigs, to quantify locomotion, postures, animal–scene interactions, and social behaviors important for animal welfare, agriculture, and neuroscience. Pigs’ behaviors (e.g., tail motion, limb dynamics) inform emotional states and health, and pigs are increasingly used as translational models for neurological and cognitive disorders. Existing video-based behavior recognition and 2D keypoint methods (e.g., SLEAP, DeepLabCut) struggle under occlusions and cannot capture dense surface geometry critical for understanding social interactions. Multi-view triangulation approaches also fail with multiple animals due to ambiguous multi-view association and frequent occlusions. Regression-based 3D methods require large 3D-labeled datasets. The research question is whether an articulated mesh-based, multi-animal alignment system can robustly reconstruct full 3D surface motion from multi-view videos, resolve occlusions, and enable rich behavioral analyses without large-scale 3D training data. The purpose is to develop and validate MAMMAL, a mesh-based pipeline for multi-animal 3D surface motion capture and analysis in naturalistic settings.
Literature Review
Prior work in animal pose estimation largely focuses on sparse 2D keypoints (SLEAP; DeepLabCut), which are efficient but prone to identity swaps and failure under occlusion. Multi-view triangulation pipelines (e.g., Anipose; OpenMonkeyStudio; DeepLabCut-3D) combine 2D keypoints to 3D but face challenges associating unordered multi-view detections and reconstructing poses when keypoints are occluded. Regression/voxel-based methods (e.g., DANNCE, BEV, VoxelPose for humans) can mitigate occlusions but require large, annotated 3D datasets and are sensitive to camera configurations. Articulated mesh models have proven powerful for humans (SMPL) and various animals (SMAL and subsequent species-specific meshes), enabling dense surface representation and contact reasoning, but have been less explored for multi-animal, large-mammal social capture. This work builds on these strands by combining multi-view 2D detection with articulated mesh fitting and cross-view matching to recover dense 3D surfaces for multiple animals with fewer training demands.
Methodology
System overview: MAMMAL reconstructs 3D surface motion of multiple freely moving animals from synchronized multi-view RGB videos in three stages: (1) Detection, (2) Detection Matching, and (3) Mesh Fitting. It uses an articulated species-specific mesh (PIG model) and is demonstrated on pigs, mice, and Beagle dogs. Hardware and data: For pigs, ten HIKVISION cameras around a 2.0 m × 1.8 m pen recorded 1920×1080@25 FPS. Datasets: BamaPig2D (3340 images, 11504 pig instances with boxes, silhouettes, 19 keypoints) for training 2D detectors; BamaPig3D (70 s clip with 280 instances labeled with 5320 3D keypoints, 2437 2D keypoints, 2437 silhouettes) for evaluation. Additional sequences with varying camera numbers used for qualitative behavior analysis. Calibration used precomputed intrinsics (OpenCV) and extrinsics via PnP with manually labeled scene points. Articulated mesh models: PIG model has 11239 vertices, 62 joints (3-DOF rotations), with 24 critical joints for efficient body reconstruction and 19 defined 3D keypoints regressed from mesh vertices via a sparse regressor. Global scale and 6-DOF pose (rotation, translation) are optimized. Analogous meshes were prepared for mouse (140 joints, 14522 vertices, 22 keypoints) and Beagle dog (39 joints, 4653 vertices, 29 keypoints). Stage 1 – Detection: Top-down pipeline. PointRend segments pig instances and provides bounding boxes; cropped instance images are resized to 384×384. HRNet estimates 2D keypoints with visibility/confidence. Networks trained on BamaPig2D achieved AP 0.869 (boxes) and 0.868 (silhouettes) for PointRend; HRNet AP 0.633. Resolution normalization improves robustness across viewpoints. Stage 2 – Detection Matching: At T=0, multi-view 2D instances are nodes in a multipartite graph with edges only across different views; edge weights are average epipolar distances. A modified maximal clique enumeration partitions cues into N* (animal identities) plus a noise group, associating cross-view detections. Complexity is polynomial in number of views given fixed animal count. For T>0, 3D keypoints from the previous frame are projected to each view, and per-view association uses Kuhn–Munkres (Hungarian) based on keypoint distances, yielding temporally consistent identity tracking and per-view matched 2D cues. Stage 3 – Mesh Fitting: Initialization uses an anchor pose retrieved from a library of plausible poses by minimizing a combined keypoint and silhouette energy; at subsequent frames, previous pose initializes optimization. Full objective combines terms: E2D (keypoint reprojection), Esil (differentiable silhouette alignment using signed distance fields and per-vertex visibility with multi-animal rendering), Etemp (temporal smoothness vs T−1 keypoints), Ereg (rotation regularization), Eanchor (anchor prior at T=0), and Efloor (enforce above-floor constraint). Typical weights: w2D=1, wsil=5e−5, wreg=0.01, wfloor=100; wtemp=0 at T=0 then 1; wanchor=0.01 at T=0 then 0. Levenberg–Marquardt optimization uses <60 iterations for initialization and ~5–15 iterations for tracking; silhouette term disabled in first few iterations for stability. Behavior analyses enabled: (i) Animal–scene interaction recognition via 3D distances to scene priors (e.g., drinking if nose within 0.12 m of tap; eating if nose inside predefined feeding volume); (ii) Posture discovery: 20,819 poses from 44 clips (4 pigs) normalized and represented by 178-D features (positions, velocities, tail/center heights, pitch/roll, joint rotations); PCA to 16-D (97.3% variance), t-SNE embedding, density estimation and watershed to identify 8 posture clusters; (iii) Social behavior recognition: define dyadic events (approach, leave; head–head/body/limb/tail; mount) using surface-to-surface minimum distances from active head to partner’s mesh (downsampled vertices), top-view convex-hull IoU, and body pitch thresholds over a temporal window. Baselines and comparisons: Triangulation (MLE-based) implemented for multi-view 3D keypoint reconstruction; SLEAP trained for 2D performance comparison; DANNCE-T (temporal) for mouse; VoxelPose for dogs. Experiments vary camera counts (10/6/5/4/3) while using identical 2D inputs and matching when applicable. Tail motion module adds TailMid and TailTip keypoints, extends HRNet training, uses full 62-joint tail DOFs and an additional E3D term aligning to triangulated keypoints; tail angle and PSD computed for behavior classification. Performance: Detection ~50 ms/frame (RTX 2080Ti), Matching ~0.15 ms/frame (CPU), Mesh fitting ~1200 ms/frame with GPU (~2000 ms CPU).
Key Findings
- Accurate 3D reconstruction under occlusions: On BamaPig3D, mean per-joint position error (MPJPE) 3.44 ± 3.99 cm (mean ± SD), average per-keypoint error <5.2 cm; detailed parts (eyes/ears/knees/elbows) <3 cm; terminals (nose, tail root, paws/feet) <5 cm. - Invisible keypoints: Comparable errors for invisible vs visible keypoints for most parts; mean error <7 cm (~10% body length) for invisible keypoints except some distant terminals. - Surface accuracy: IoU between rendered meshes and labeled silhouettes up to 0.80 (0.77 with keypoints only; silhouettes improve to 0.80). - Generalization across pig sizes: Error remained low across different body sizes: e.g., 2.30 ± 1.89 cm (train pig), 3.28 ± 2.15 cm (moderate), 4.27 ± 3.43 cm (very fat), 3.78 ± 2.71 cm (juvenile). - Robust with fewer cameras vs triangulation: With 10 views, MAMMAL 3.44 ± 3.99 cm vs Triangulation 14.17 ± 32.02 cm. With 5 views: 4.08 ± 4.45 cm vs 24.19 ± 39.73 cm. With 3 views: 5.19 ± 6.10 cm vs 41.81 ± 43.23 cm. MAMMAL with 3 views outperformed Triangulation with 10 views; fewer missing/false poses, especially for occluded legs. - Behavior analytics: Automated detection of drinking/eating via 3D distances; discovery of 8 posture clusters from 20,819 poses; dyadic social behaviors recognized and part-level contacts distinguished (head–head/body/limb/tail; approach/leave; mount) via surface distance fields. - Tail behavior vs social hierarchy: Dominant pigs fed longer (p=0.0004, n=4) and displayed higher proportion of loosely wagging tails than subordinates (p=0.0367, n=4). Tail motion reconstruction error ~6.05 cm; PSD-based classification differentiated loosely wagging (peak ~1.625 Hz, PSD 10.24 V^2 Hz^-1) vs passive hanging (peak ~0.125 Hz, PSD 0.45 V^2 Hz^-1). - Mouse extremities (single animal): With 6 views, MAMMAL 2.43 ± 1.69 mm vs DANNCE-T 4.78 ± 7.15 mm on 8 extremities; excluding tail, MAMMAL 2.20 ± 1.25 mm vs DANNCE-T 2.71 ± 3.59 mm. With 3 views, DANNCE-T degraded more than MAMMAL. - Beagle dog social capture: With 10 views, MAMMAL 5.02 ± 3.22 cm vs VoxelPose 5.93 ± 7.82 cm. With 6/4 views, MAMMAL 5.44 ± 3.67 cm and 5.84 ± 4.06 cm; VoxelPose increased to 7.17 ± 7.02 cm and 14.58 ± 19.79 cm, showing MAMMAL’s stability with sparse views. - 2D detection: MAMMAL’s detection module outperformed SLEAP (both top-down and bottom-up), especially on side views, due to resolution normalization.
Discussion
The findings show that articulated mesh alignment enables robust multi-animal 3D motion capture in naturalistic, occlusion-prone settings, addressing the central challenges of multi-view association and reconstructing occluded keypoints. By leveraging silhouettes and dense surface priors from a species-specific mesh, MAMMAL reconstructs fine-grained body and terminal motions and even invisible keypoints, outperforming triangulation especially with fewer cameras. This dense surface representation further enables analyses beyond sparse keypoints, including part-level contact reasoning, posture discovery, animal–scene interaction detection, and social behavior classification. The method generalizes across different pig sizes and to other species (mouse, dog) with limited additional labeling and no need for large-scale 3D datasets, suggesting broad applicability in agriculture (animal welfare, health monitoring), neuroscience (behavioral phenotyping, disease models), and other life-science domains. Compared to voxel/regression methods, MAMMAL achieves competitive or superior accuracy while being less sensitive to camera count, making it practical for varied laboratory or farm configurations.
Conclusion
This work introduces MAMMAL, a three-stage mesh-based pipeline for multi-animal 3D surface motion capture that resolves severe occlusions and recovers invisible keypoints. It achieves accurate pose and surface reconstruction with fewer cameras than traditional triangulation, supports rich behavior analytics (postures, social interactions, animal–scene interactions, tail dynamics), and generalizes to mice and Beagle dogs. The tool is open-sourced with datasets and documentation, enabling adoption by the community. Future work includes integrating 2D pose, segmentation, identification, and behavior classification into an end-to-end system; accelerating optimization for real-time use; and developing a parametric linear blend shape model for pigs analogous to SMPL/SMAL to capture inter-individual shape variation and further improve robustness.
Limitations
Current limitations include non–real-time performance due to iterative optimization; reliance on species-specific mesh preparation and some manual annotations (e.g., for initial datasets and pose library); and the need for multi-view video (though MAMMAL remains robust with fewer cameras). The approach would benefit from an integrated, end-to-end pipeline and acceleration techniques to broaden real-time applications.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny