logo
ResearchBunny Logo
Localization and recognition of human action in 3D using transformers

Computer Science

Localization and recognition of human action in 3D using transformers

J. Sun, L. Huang, et al.

Discover the groundbreaking BABEL-TAL dataset and the LocATe model that sets new benchmarks in 3D action localization! This innovative research, conducted by Jiankai Sun and colleagues, pushes the boundaries of 3D human behavior analysis with applications in HCI and healthcare.

00:00
00:00
Playback language: English
Introduction
Understanding human behavior from 3D motion sequences is crucial in computer vision. 3D Temporal Action Localization (3D-TAL) aims to identify actions and their precise start and end times within a sequence. This task has applications in animation, HCI, and AR/VR. While 3D sensors are increasingly affordable, progress in 3D-TAL has stagnated due to limitations in existing benchmark datasets. These datasets often lack diversity and are recorded in constrained environments, leading to low intra-class variance. This paper addresses this limitation by introducing a new, challenging benchmark dataset, BABEL-TAL (BT), which features diverse actions, long sequences, and significant intra-class variance, designed to more accurately reflect real-world scenarios. The dataset is systematically categorized into different subsets (BT-20, BT-60, and BT-ALL) based on action granularity, allowing for comprehensive evaluation. Furthermore, the paper proposes a new model, LocATe, which utilizes a transformer architecture with deformable attention to jointly localize and recognize actions. The model's superior performance on BT and PKU-MMD datasets, particularly with limited training data, is demonstrated, highlighting its potential for practical applications.
Literature Review
Existing 3D action localization datasets, such as G3D, CAD-120, Comp. Act., Watch-N-Patch, OAD, and PKU-MMD, suffer from limitations including constrained environments, limited action diversity, and low intra-class variance. While some datasets like SBU Kinetic Interaction focus on two-person interactions, others like Wei et al.'s dataset incorporate RGB-D videos alongside skeletal sequences. The recently introduced BABEL dataset provides a more diverse set of actions, but lacks the necessary temporal annotations for 3D-TAL. Regarding transformer-based approaches to 3D action recognition, several methods have been proposed, including those employing decoupled spatial-temporal attention, stacked relation networks (SRN), sequential correlation networks (SCN), and two-stream transformer networks. However, these predominantly focus on action recognition rather than localization. This paper builds upon these advancements to introduce a new dataset and a novel approach specifically tailored for 3D-TAL.
Methodology
The paper introduces the BABEL-TAL (BT) dataset, derived from the AMASS dataset and annotated using a modified version of the VIA annotation software. The dataset is divided into three subsets: BT-20 (20 action classes), BT-60 (60 action classes), and BT-ALL (102 action classes). The BT dataset addresses the limitations of previous datasets by featuring diverse and complex movements, long sequences, and a long-tailed distribution of actions, mirroring real-world scenarios. The authors propose LocATe, a single-stage transformer-based model for 3D-TAL. LocATe uses a skeleton-based sampling strategy to convert the 3D skeleton sequence into a fixed set of joint snippets. Positional encoding is added to the 3D pose features before inputting them into the transformer encoder. The encoder utilizes deformable attention (DA) to capture global context across all temporal positions, producing a feature encoding action span information. A decoder, also employing DA, transforms action queries into representations used by the prediction heads (regression and recognition networks) to output the temporal localization results. The model is trained using bipartite matching to handle the association between predictions and ground truth. A class-balanced focal loss function is used to address class imbalance in the dataset. The paper evaluates LocATe against several baselines, including Beyond-Joints, SRN, ASFD, TSP, G-TAD, and AGT, on both BT-20 and PKU-MMD datasets. Human evaluation is also employed to assess the quality of the localization predictions.
Key Findings
LocATe demonstrates superior performance compared to existing baselines on the BABEL-TAL (BT) dataset, particularly on the more challenging BT-20 subset. The results show that LocATe significantly outperforms existing methods, particularly at lower temporal IoU thresholds, suggesting a higher recall. On the PKU-MMD dataset, LocATe achieves state-of-the-art performance, surpassing previous methods by almost 10% in cross-subject evaluation. Remarkably, LocATe maintains high mAP performance (91.9%) on PKU-MMD even when trained using only 10% of the labeled training data. Analysis of the confusion matrix reveals that certain actions, like "stand," are often confused with others, highlighting the dataset's complexity. Average precision (AP) varies significantly across actions, with simpler actions like "run" and "walk" achieving higher AP scores than more complex actions like "touch body part." Human evaluation corroborates the automatic evaluation results, confirming LocATe's superior performance. Experiments using 2D features extracted from rendered videos show significantly lower performance than using 3D joint features, highlighting the importance of 3D representation for this task. Ablation studies show that using action recognition features does not improve performance compared to using raw joint position information.
Discussion
The superior performance of LocATe on both the newly introduced BABEL-TAL and the established PKU-MMD datasets demonstrates the effectiveness of the proposed transformer architecture with deformable attention for 3D-TAL. The model's ability to achieve state-of-the-art results with significantly less training data highlights its scalability and efficiency. The analysis of the confusion matrix and per-action AP scores reveals areas for future improvements, such as addressing the challenges posed by actions with high intra-class variance or subtle movements. The significant performance difference between using 3D and 2D features emphasizes the importance of leveraging the inherent advantages of 3D data for more accurate action understanding. The concordance between automatic and human evaluation results validates the robustness of the evaluation methodology. The research findings have implications for various applications requiring precise and efficient 3D human action understanding, such as healthcare, HCI, and animation.
Conclusion
This paper introduces BABEL-TAL, a challenging benchmark dataset for 3D action localization, and LocATe, a novel transformer-based model that achieves state-of-the-art performance. LocATe’s efficiency in utilizing limited training data is a key advantage. Future work could explore the integration of RGB and motion-captured data, further improving action understanding. The improved accuracy and scalability of this research has important implications for healthcare and assistive technologies.
Limitations
One limitation of the study is the focus on skeletal data; incorporating RGB information could potentially improve performance. The human evaluation, while valuable, involved a limited number of annotators. The model's performance on highly nuanced actions could be further improved. Finally, the generalization of the model to unseen environments and actions warrants further investigation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny