
Computer Science
Intuitive physics learning in a deep-learning model inspired by developmental psychology
L. S. Piloto, A. Weinstein, et al.
Explore how current AI systems are bridging the gap in common sense understanding of intuitive physics, a skill even young children master. This groundbreaking research by Luis S. Piloto, Ari Weinstein, Peter Battaglia, and Matthew Botvinick introduces a novel machine-learning dataset and a deep-learning system, PLATO, which learns intuitive physics from visual data.
~3 min • Beginner • English
Introduction
The paper addresses the gap between modern AI systems and human-like common-sense reasoning in the domain of intuitive physics. The authors focus on whether and how a learning system can acquire core physical concepts (e.g., continuity, object persistence, solidity, unchangeableness, and directional inertia) from visual experience. Drawing on developmental psychology, they adopt two principles: (1) intuitive physics consists of discrete, probeable concepts, and (2) possession of a concept entails expectations about future events that can be measured via violations of expectation (VoE). The research questions are: Can a deep-learning model learn such concepts directly from videos? Is object-centric representation crucial for this acquisition? The work introduces a new dataset and a model (PLATO) designed to test these hypotheses, using VoE-based evaluation to quantify concept knowledge by comparing surprise to possible versus impossible events.
Literature Review
The study builds on extensive developmental psychology literature showing early acquisition of physical concepts in infants and the use of the VoE paradigm to measure concept possession (e.g., object permanence, solidity, and continuity). Classic findings include longer infant gaze to events violating continuity or solidity. In AI, prior approaches to intuitive physics often used prediction accuracy, question answering, or RL rewards as proxies, without explicitly probing discrete concepts. Two closely related VoE datasets were developed by Riochet et al. (IntPhys) and by Smith et al. (ADEPT). These datasets provided concept-oriented evaluations but either had restricted event diversity or incorporated hand-engineered physics rather than learning from scratch. Object-centric and relational inductive biases have been proposed and explored in AI (e.g., Interaction Networks, graph networks, relational modules) and shown to aid generalization and data efficiency. The present work extends this literature by combining object-centric representation, relational dynamics, and a concept-specific VoE evaluation with stronger separation between training and test events, and by systematically comparing to non-object-centric baselines.
Methodology
Datasets: The Physical Concepts dataset includes two parts: (1) a training corpus of 300,000 procedurally generated videos (15 frames at 64×64 resolution; MuJoCo engine) featuring composable physical interactions (rolling, collisions, occlusions via a curtain, stacking, covering, containment, ramps) with a drifting camera. Primitive shapes are spheres and rectangular prisms; object sizes, masses, and colors vary within controlled ranges. Each frame includes a corresponding segmentation mask with unique IDs per object. An additional 5,000 videos form a held-out test set for hyperparameter selection. (2) A VoE test suite with 5,000 probe tuples per concept (continuity, object persistence, solidity, unchangeableness, directional inertia). Each tuple comprises two possible and two impossible videos tightly matched at the image and pairwise-frame levels via a splicing procedure that swaps start/end segments around a shared frame to create aphysical sequences while keeping all individual frames and adjacent pairs physically plausible. A stationary camera is used in probes to control occlusion.
Model (PLATO): An object-centric architecture with two components. (a) Perception module (ComponentVAE): Inputs are a 64×64 RGB frame and an 8-channel segmentation mask (max 8 objects including ground). It outputs K=8 object codes, each a 16-D Gaussian posterior for a masked object image; a spatial broadcast decoder reconstructs masked object images and masks. The module is trained offline via a variational objective (β-like setting with φ=0.1, γ=10) using 4.5M images, then frozen. (b) Dynamics predictor (InteractionLSTM): A slotted, object-wise LSTM (2,056 hidden units, shared weights) with per-object memory that takes the history of object codes (object buffer) and viewpoint info to predict next-step object codes with teacher forcing. An Interaction Network computes pairwise interactions among LSTM cell states and inputs using MLPs (3 layers, 512 units, GELU), aggregated via sum and max and concatenated back into the object’s LSTM slot. Residual connections from current object code to prediction are used. Camera position at current and next timestep is included to account for viewpoint drift present in training data.
Baselines: Two flat (object-agnostic) models replace the set of object codes with a single scene embedding: (1) Flat Equal Parameters (FEP): 16-D embedding to match PLATO’s dynamics parameter count but with reduced perceptual capacity. (2) Flat Equal Capacity (FEC): 128-D embedding to match PLATO’s total representational capacity, resulting in ≈4M more parameters in the dynamics predictor than PLATO.
Training: Perception module trained with RMSProp (lr=1e-4), 1,000,000 steps, batch size 64 images. Dynamics predictor trained for 1,300,000 steps, batch size 128 videos, lr 1e-4 decayed to 4e-5 after 300,000 steps, using teacher forcing over 15-frame sequences. For main comparisons, five random seeds are used. For training-set-size ablations, three seeds and varying numbers of training videos, with the perception module kept frozen.
Evaluation (VoE paradigm): For each video, surprise is the sum over frames of pixel-space reconstruction error between predicted and decoded images (decode predicted object codes to compose a frame). For each 5,000-tuple set per concept, compute physically possible surprise (sum over the two possible probes) and physically impossible surprise (sum over the two impossible probes). Accuracy is the fraction of tuples where impossible surprise > possible surprise. Relative surprise is the normalized difference (impossible − possible) / (impossible + possible). Means are averaged over seeds; significance is assessed with one-tailed, single-sample t-tests against chance (0 for relative surprise; 0.5 for accuracy).
Generalization: Evaluate PLATO without retraining on three probe types from the independently developed ADEPT dataset: ‘block’ (tests solidity/continuity) and ‘overturn short’/‘overturn long’ (object permanence with a rotating drawbridge). ADEPT videos are centrally cropped, downsampled to 64×64, and temporally downsampled to 15 frames. Because PLATO requires consistent mask channel ordering, manual alignment is applied where needed to mine three compatible probe types. Metrics mirror the main evaluation but over possible/impossible pairs (not quadruplets).
Training set size analysis and visual experience: The 300,000 videos correspond to ~6.95 days of continuous visual experience (or ~20.9 days at 8h/day), and 50,000 videos to ~28 hours (~3.5 days at 8h/day).
Key Findings
- PLATO (object-centric) shows robust VoE effects across all five concepts (means over five seeds, 5,000 probe quadruplets per concept):
• Relative surprise (mean above 0 with one-tailed t-tests): continuity M=0.044, s.d.=0.006, t(4)=15.9, P=4.6×10^-5; directional inertia M=0.017, s.d.=7×10^-4, t(4)=47.8, P=5.7×10^-7; object persistence M=0.034, s.d.=0.008, t(4)=8.7, P=4.8×10^-4; solidity M=0.009, s.d.=0.003, t(4)=6.4, P=0.002; unchangeableness M=0.007, s.d.=2.2×10^-4, t(4)=60.57, P=2.2×10^-7.
• Accuracy (mean above 0.5 with one-tailed t-tests): continuity M=0.891, s.d.=0.028, t(4)=27.7, P=5×10^-6; directional inertia M=0.727, s.d.=0.017, t(4)=26.9, P=5.6×10^-6; object persistence M=0.678, s.d.=0.043, t(4)=8.2, P=5.9×10^-4; solidity M=0.719, s.d.=0.064, t(4)=6.8, P=0.001; unchangeableness M=0.656, s.d.=0.021, t(4)=14.7, P=6.2×10^-5.
• Frame-wise analyses show relative surprise rises at the onset of the impossible events, with concept-specific temporal profiles.
- Flat (object-agnostic) baselines show diminished/absent VoE effects, even when given greater capacity (FEC):
• FEC relative surprise significantly >0 for only three concepts: continuity M=0.009, s.d.=0.008, t(4)=2.4, P=0.038; directional inertia M=0.012, s.d.=0.01, t(4)=2.4, P=0.036; unchangeableness M=1.4×10^-4, s.d.=1.3×10^-4, t(4)=2.15, P=0.049; not for object persistence or solidity.
• FEC accuracy above 0.5 for only two concepts: continuity M=0.71, s.d.=0.15, t(4)=2.8, P=0.024; directional inertia M=0.69, s.d.=0.2, t(4)=1.9, P=0.065 (marginal); at chance for object persistence (0.51), solidity (0.5), unchangeableness (0.493).
- Data efficiency: With only 50,000 training examples (~28 hours), PLATO shows robust aggregate VoE effects across concepts: grand mean relative surprise M=0.02, s.d.=0.003, t(2)=9.246, P=0.006; accuracy M=0.75, s.d.=0.015, t(2)=23.3, P=9.2×10^-10. An untrained dynamics predictor (training size=0) shows no VoE effects, indicating dynamics learning is critical.
- Generalization to novel objects/dynamics (ADEPT, no retraining): Significant VoE effects for all three probe types:
• Relative surprise >0: block M=0.007, s.d.=0.002, t(4)=7.5, P=8.4×10^-5; overturn long M=0.069, s.d.=0.011, t(4)=12.671, P=1.1×10^-7; overturn short M=0.022, s.d.=0.016, t(4)=2.8, P=0.024.
• Accuracy >0.5: block M=0.765, s.d.=0.049, t(4)=10.9, P=2×10^-5; overturn long M=0.97, s.d.=0.037, t(4)=25.3, P=7.2×10^-10; overturn short M=0.79, s.d.=0.16, t(4)=3.56, P=0.012.
- Overall, object-centric representation and relational processing are critical for learning and expressing intuitive physics concepts as measured by VoE.
Discussion
The findings show that a deep-learning model equipped with object-centric perception, tracking, and relational dynamics can acquire core intuitive physics concepts from visual data, as evidenced by strong VoE effects across five concept categories. The comparison to flat, object-agnostic baselines establishes the importance of object-level representations and interactions for both detection of violations and generalization. The model’s surprise signals align temporally with the onset of impossibilities, suggesting it learns expectations congruent with the probed concepts. Data efficiency analyses indicate that a relatively small amount of visual experience (tens of hours) suffices to learn such expectations, contrasting with the common view that deep models require vast data. Generalization to ADEPT demonstrates robustness to novel appearances and motion patterns without retraining. The authors propose that object-centric processing acts as a regularizer, guiding learning toward compositional, relational structures that mirror physical reality, thus improving transfer and reducing overfitting to training specifics. The work extends prior VoE datasets and methods by offering richer dynamics, stricter train-test separation, and a learned model (as opposed to hand-engineered physics) that succeeds broadly across concepts. The results support developmental psychology insights that object individuation, tracking, and relational reasoning underpin intuitive physics, and suggest cross-fertilization between AI inductive biases and cognitive theories.
Conclusion
The paper introduces (1) the Physical Concepts dataset for VoE-based evaluation of five core intuitive physics concepts, and (2) PLATO, an object-centric model that learns per-object representations and relational dynamics from video. PLATO exhibits robust VoE effects across concepts, requires relatively modest visual experience to do so, and generalizes to an external dataset with novel objects and events. The study highlights the critical role of object-level representation and interactions in acquiring intuitive physics and suggests that such inductive biases serve as effective regularizers for generalization. Future directions include removing privileged inputs by integrating unsupervised object discovery and tracking, expanding to more naturalistic and diverse scenes and object types, modeling developmental trajectories and the order of concept acquisition, leveraging neurophysiological VoE measures, and training on large-scale infant-perspective datasets to further connect AI models with human developmental processes.
Limitations
- Privileged information: The model relies on ground-truth segmentation masks and consistent object indices (tracking) during training and evaluation; segmentation and tracking are not learned end-to-end from raw video.
- Domain scope: Training and probe events, while diverse procedurally, remain limited relative to real-world variability (object types, textures, lighting, complex dynamics).
- Probe construction: VoE probes use splicing to ensure tight control over image-level statistics; certain designs (e.g., unchangeableness with a single occluder) may admit alternative interpretations (e.g., self-propelled motion) in principle.
- Not a direct developmental model: The system is inspired by developmental principles but does not simulate developmental timelines or the acquisition order of concepts.
- Evaluation assumptions: Statistical tests assume normality across seeds without formal testing; pixel-space error can overweight larger objects despite within-tuple matching.
- Code availability: Implementation is not publicly released in a directly usable form; reproducibility may require substantial re-implementation.
- Camera/viewpoint handling: Training used drifting cameras; probes used stationary cameras; the reliance on provided viewpoint metadata may limit applicability to settings without such information.
Related Publications
Explore these studies to deepen your understanding of the subject.