
Biology
B-SOID, an open-source unsupervised algorithm for identification and fast prediction of behaviors
A. I. Hsu and E. A. Yttri
Get ready to uncover the secrets of animal behavior with groundbreaking research by Alexander I. Hsu and Eric A. Yttri. Introducing B-SOID, an unsupervised algorithm that revolutionizes the identification of behaviors through spatiotemporal pose patterns. This innovative approach not only enhances processing speed but also breaks barriers in studying pain, OCD, and movement disorders in various models.
~3 min • Beginner • English
Introduction
The study addresses how to objectively identify and quantify naturalistic animal behaviors from pose data without human bias. Traditional top-down approaches and supervised classifiers depend on human annotations that are biased, variable, and low in temporal resolution, and they often fail to generalize across contexts. Existing unsupervised methods (e.g., MotionMapper) work well for invertebrates with orthogonal limb movements and controlled backgrounds but have limited applicability to vertebrates and general lab settings. Proprietary systems like MoSeq provide unsupervised segmentation but require specialized hardware and do not jointly capture both action identity and kinematics at high temporal resolution. The authors hypothesize that extracting spatiotemporal relationships among tracked body parts and coupling unsupervised clustering with a supervised classifier will yield unbiased, generalizable behavioral segmentation and fast prediction with millisecond-scale temporal precision suitable for neurophysiology.
Literature Review
- Top-down, human-defined behavioral scoring and supervised classifiers reach human-level accuracy but inherit biases, show inter-rater variability, low temporal resolution, and limited flexibility.
- Unsupervised approaches like MotionMapper use spectral representations and nonlinear embedding to find stereotyped behaviors, successful largely in flies and soft-bodied invertebrates with orthogonal limb movement and uniform backgrounds, limiting vertebrate applications.
- MoSeq introduced unsupervised segmentation in rodents using depth cameras and hierarchical clustering but raises challenges in capturing kinematics, temporal resolution, and generalization across sessions/labs.
- Advances in pose estimation (DeepLabCut, SLEAP, OpenPose) provide accurate body-part locations, but pose alone has low behavioral interpretability and user-defined rubrics may not generalize across animals/cameras.
- The field needs methods that (i) identify behaviors and kinematics, (ii) achieve high temporal resolution for alignment with neural data, and (iii) generalize across datasets and labs.
Methodology
Overview: B-SOiD is an open-source pipeline that transforms pose-estimation data into behavior labels and kinematic readouts by extracting spatiotemporal pose relationships, reducing dimensionality, unsupervised clustering, and training a supervised classifier for fast prediction. A frameshift paradigm enables behavior timing at the native camera frame rate.
Data acquisition:
- Subjects: Six adult C57BL/6 mice for open-field (3 females). Additional demonstrations include rat reach-to-grasp and human data (OpenPose) and Drosophila (SLEAP).
- Recording: Open-field sessions of 1 hour in a clear arena (15×12 in). Bottom-up camera 1280×720 at 60 Hz (clustering/ML) or 200 Hz (frameshift tests); simultaneous top-down view in a subset for comparisons.
- Electrophysiology: One session from layer 5 forelimb motor cortex (35 units recorded with 64-channel silicon probe). Spikes sampled at 30 kHz, sorted with Kilosort2. Activity aligned to behavior onsets; z-scored in peri-event windows.
Pose estimation:
- Six keypoints for mouse (snout, forepaws, hindpaws, tail-base) tracked using DeepLabCut. DLC model trained on 7,881 frames (21 animals, 69 sessions), 1.03M iterations, loss 0.002. Weights are open-sourced.
Feature extraction (Algorithm 1):
- To improve SNR, all inputs are downsampled to 10 fps using non-overlapping 100 ms windows.
- Compute per-frame spatiotemporal features from pose: 15 displacement (D) and angular change (θ) features and six pairwise distances (L) across points (speed, angular change, inter-point distances). Displacement and angular change are summed over the 100 ms window; distances averaged.
- Features are smoothed with a ~60 ms sliding window (±30 ms) to mitigate pose jitter.
- Low-confidence pose points (per-frame likelihoods) are thresholded per session at the elbow of the bimodal distribution; values below threshold are replaced with last high-confidence position (yielding zero displacement during occlusions/low confidence).
Dimensionality reduction:
- UMAP (umap-learn v0.4.x) with parameters: n_neighbors=60, min_dist=0.0, Euclidean metric. PCA (scikit-learn v0.23.x) used to set n_components to explain 20.7% of pose-estimation variance; in the six-mouse dataset this resulted in 11 UMAP dimensions.
Unsupervised clustering:
- HDBSCAN (v0.8.x) applied to UMAP embeddings to identify dense clusters separated by sparse regions. min_cluster_size is user-adjustable. This produced 11 behavior classes in the six-mouse dataset.
Supervised classifier:
- A RandomForestClassifier (scikit-learn v0.23.x, default parameters) is trained to map the original high-dimensional features (D, θ, L) to the cluster labels (multi-class). This enables fast, generalizable prediction without re-embedding or reclustering.
Frameshift prediction paradigm (Algorithm 2):
- To achieve native-frame temporal resolution with high SNR, B-SOiD predicts on multiple 10 fps downsampled streams offset by 1 original frame each (for F= native_fps/10 offsets). Predictions from F offsets are interleaved to reconstruct a high-resolution label sequence G at the original sampling rate (e.g., 200 fps reconstructed from 20 shifted 10 fps passes).
- Very brief bouts are culled to enforce continuity (e.g., <3 samples at high fps, <50 ms).
Top-down vs bottom-up comparison:
- Built separate models using six points per view (top of snout, shoulder approximation, hips approximation, tail-base for top-down). Compared segmentation overlap and ethograms across simultaneously recorded views.
Benchmarking against MotionMapper:
- Using identical pose input (DLC_2_MotionMapper), extracted up to 20 bouts/group (300–600 ms). Constructed motion energy (ME) images (absolute frame differences after animal-centric alignment). Computed pairwise mean squared error (MSE) within- and across-groups to quantify in-group consistency and out-group separability. Compared cumulative distributions and processing speed.
Kinematic analyses:
- Grooming and stride kinematics: Identified individual strokes via MATLAB findpeaks() on right forepaw speed; troughs defined stroke boundaries. Stroke distance computed as Euclidean displacement; speeds quantified per stroke. Applied to basal ganglia lesion experiment (A2A-cre with/without AAV2-flex-taCasp3-TEVP; N=4/group).
Statistics and visualization:
- Non-parametric two-tailed Kolmogorov–Smirnov tests for behavioral distributions. For neural data, z-scoring in peri-event windows; differences assessed by two-tailed t-tests across behaviors. Boxplot definitions standard. Coherence metrics for frameshift consistency across downsampled resolutions relative to 200 fps baseline.
Software and availability:
- Open-source code and trained DLC model weights available on GitHub/Zenodo. GUI provided for step-by-step usage. Processing speed reported: ~100,000 frames/min on a typical laptop; one hour of 60 fps, six-point data processed in under five minutes on a 128 GB RAM CPU.
Key Findings
- Unsupervised discovery of behaviors: From six mice in open field, B-SOiD identified 11 behavior classes in an 11-D UMAP space, mapping well to ethological categories (e.g., locomotion, rearing, grooming subtypes including paw/face groom, head groom, body lick, and hindleg itch). Clusters were consistent across animals of different sizes and colors.
- Classifier accuracy and speed: A random forest trained on high-dimensional features reproduced HDBSCAN labels with >90% accuracy on 20% held-out data; 10-fold cross-validation showed high accuracy across groups. The classifier enables rapid prediction (one hour of 60 fps, six-point data in <5 minutes on CPU; ~100,000 frames/min on a typical laptop) and generalization across sessions/subjects.
- Frameshift temporal precision: Frameshifted predictions aligned to native frame rates improved transition timing. Using a 200 fps baseline, behavior-label coherence remained high even at 10 fps (median ~84% without frameshift), and benefits plateaued above 50 fps. Frameshift yielded visibly improved transition-aligned limb trajectories.
- Neural validation: In simultaneous motor cortex recordings (35 units), aligning to high-resolution frameshift onsets increased the magnitude of behavior-locked neural modulation around action initiation compared to low-resolution (10 fps) alignment. Across behaviors, high–low resolution signal difference was significantly positive around onset (p<0.01, two-tailed t-test), particularly for forelimb-involving actions. A Poisson spiking model reproduced the observed biphasic differences and demonstrated reduced temporal displacement at higher alignment resolution.
- Viewpoint generalization: Segmentation from top-down and bottom-up cameras showed substantial overlap; mapped categories conserved across views with overlaps several hundred percent above baseline distributions. Some divergence occurred for paw-dependent grooming behaviors in top-down view due to occlusions.
- Benchmark vs MotionMapper: Using identical pose inputs, B-SOiD produced more distinct motion energy (ME) images per group and greater out-group vs in-group MSE separation (cumulative distribution right-shift). Statistical separation: MotionMapper out-group shift p<3e-12; B-SOiD p<7e-111; shuffled p=0.60. B-SOiD processing was ~100× faster with six points and avoids memory limitations reported for MotionMapper.
- Lesion study kinematics: In A2A-cre mice with striatal indirect-pathway lesions (AAV2-flex-taCasp3-TEVP; N=4) vs controls (N=4), face-groom strokes exhibited significant rightward shifts in distance and speed distributions (Kolmogorov–Smirnov tests: *p*<0.05 to *p*<0.0001), especially for smaller movements; head-groom did not show these changes. Itching speed increased; locomotor stride length increased while stride speed did not, suggesting a kinematic basis for hyperactivity following indirect-pathway lesion.
Discussion
B-SOiD demonstrates that extracting and clustering spatiotemporal pose relationships, followed by training a supervised classifier, yields robust, unbiased behavior segmentation that generalizes across subjects, camera views, and labs. The approach identifies not only action categories but also kinematics at millisecond scales through the frameshift paradigm. Neural data aligned to high-resolution action onsets exhibit stronger and temporally sharper modulation, validating the behavioral clusters as neurally meaningful and highlighting the importance of precise onset timing for linking behavior and neural activity. Comparisons across views confirm external consistency, and benchmarking against MotionMapper shows improved within-group consistency and between-group separability with dramatic gains in processing speed. Real-world application to basal ganglia lesion models reveals behavior-specific kinematic changes that are undetectable with traditional methods, underscoring the utility for studying movement disorders, OCD, pain, and itch. Together, these results address the initial challenges of bias, temporal resolution, and generalizability in behavioral segmentation from pose data.
Conclusion
The paper introduces B-SOiD, an open-source, unsupervised-to-supervised pipeline that discovers and rapidly predicts behavior categories and kinematics from pose data with high temporal precision. It generalizes across datasets and camera views, aligns well with neural population dynamics at action initiation, and outperforms a leading unsupervised alternative on separability and speed. As a practical tool with a user-friendly GUI and open resources, B-SOiD enables detailed, unbiased analysis of naturalistic behaviors and their kinematics in rodents and other species. Future directions include: real-time closed-loop applications using the fast classifier; integration of additional modalities (e.g., audio, stimuli, multi-animal interactions); expansion to 3D pose and social behavior; and broader application to disease models to dissect action sequencing and kinematic signatures.
Limitations
- Dependence on pose-estimation quality: Occlusions or low-confidence keypoints require imputation (holding last good position), which can zero out displacement and potentially affect fine kinematic estimates, though smoothing and multi-feature input mitigate this.
- View-specific constraints: Top-down cameras underperform for paw-centric behaviors due to occlusions, leading to some misclassification toward head-movement–based categories. Bottom-up transparent floors may alter behavior in some paradigms.
- Temporal resolution trade-offs: Frameshift relies on initial downsampling to 10 fps for SNR; benefits plateau above ~50 fps. Extremely brief bouts (<50 ms) are filtered to enforce continuity, which may omit ultrafast micro-actions unless parameters are adjusted.
- Validation scope: Core benchmarks are from six mice in open field; neural demonstration is from a single session/animal (35 units), limiting generalizability of neural findings pending larger-scale validations.
- Fixed feature set and parameters: Current implementation uses six points in mice and specific UMAP/HDBSCAN parameters; performance may vary with different species, point sets, or environments and may require user tuning.
- No absolute ground truth: Behavioral classes are derived from unsupervised clustering; while neurally validated and ethologically consistent, they are not based on exhaustive human labels or external gold standards.
Related Publications
Explore these studies to deepen your understanding of the subject.