Computer Science
Probing for consciousness in machines
M. Immertreu, A. Schilling, et al.
The paper addresses whether and how artificial agents might develop components necessary for consciousness, focusing on Damasio’s theory in which core consciousness arises from integrating a self model (grounded in feelings/emotions) with a world model. The authors argue that behavioral indistinguishability (e.g., passing Turing-like tests) does not suffice to infer consciousness, referencing Searle’s Chinese Room and concerns about attributing internal understanding from external behavior alone. They propose a structural approach: probe internal representations of RL agents trained in virtual environments to see if they develop world and self models as byproducts of task learning. The research questions include: (1) Can a machine become conscious? (2) Can a machine possess core consciousness as defined by Damasio? (3) Can a machine develop models of the world, itself, and their relations? (4) Can an agent in a computer game develop such models? The authors motivate Damasio’s framework (protoself, core consciousness, extended consciousness) as particularly suitable for embodied, goal-directed RL agents, where emotions can be operationalized via reward/homeostatic signals, and emphasize the advantage of full access to internal activations in these systems.
The paper situates its inquiry within debates on assessing machine consciousness. It contrasts behavior-based tests (Turing Test; recent claims about LLMs’ indistinguishability) with critiques like Searle’s Chinese Room, highlighting the gap between output and internal understanding. Among theories of consciousness, the authors review Integrated Information Theory (IIT), Global Workspace Theory (GWT), and Damasio’s model. IIT’s Φ-based causal modeling is noted as hard to scale; GWT requires architectural features (global broadcasting, attention) not typical in standard RL agents. Damasio’s theory, emphasizing embodied, affect-modulated self/world integration (protoself → core → extended consciousness), is presented as mechanistically detailed and operationalizable in AI. They also reference work showing emergent world models in trained networks (e.g., a transformer trained on Othello moves recovering board state; Li et al., 2022), motivating the use of probes to detect internal representations. The discussion integrates broader views (e.g., Seth & Bayne, Kuhn’s taxonomy) and affective neuroscience perspectives on emotions and feelings, while noting conceptual distinctions between intelligence and consciousness.
The study employs reinforcement learning (RL) agents trained with Proximal Policy Optimization (PPO) in the NetHack Learning Environment via MiniHack scenarios. Environments: 15×15 grid maps—MiniHack-Room-Random-15x15-v0 (random), -Monster-15x15-v0 (monster; 3 monsters), -Trap-15x15-v0 (trap; 15 teleportation traps), and -Ultimate-15x15-v0 (ultimate; monsters + traps; unlit, 3×3 visible window). Entities are randomly placed; teleport traps are initially invisible and relocate the agent randomly. Action space: initially 8 directions; later restricted to cardinal moves. Observations: entire map plus centered crops (9×9 initially; then 5×5 or 3×3), represented as glyph IDs. Reward: +1 on reaching the goal (down staircase), −0.001 per step; max episode length 300. Agents were trained on a single map instance (fixed seeds per agent). Training used RLlib’s PPO with standard MiniHack-like hyperparameters until convergence. Agent architecture: embeddings (dim 64) → 5 Conv2D layers (16 filters size 3; last with 8 filters) → concatenation → 2 Linear layers (hidden dim 256) → policy and value heads. From experiment 2 onward, an LSTM (hidden/cell size 512) is inserted between the 2 Linear layers and the heads. Probing approach: After training, a dataset is collected by running the agent and recording per-timestep internal activations along with the agent’s (x,y) position. Probes are small feedforward classifiers trained on layer activations (or LSTM hidden/cell states) to predict the agent’s spatial coordinates on the 15×15 grid (argmax over scores per x and y). For the first experiment, linear probes are trained on convolutional and linear layer activations. For subsequent experiments, both linear (single layer) and non-linear probes (3 layers with ReLU) are trained on LSTM hidden and cell states. Data for probes: 230,000 samples per trained agent (200,000 train; 30,000 test). In some experiments, map-edge positions are excluded to prevent position inference from boundary observations (thus altering chance levels). Optimization used Adam; learning rates: 0.00005 for linear probes (exp. 1), 0.001 for linear and 0.0001 for non-linear probes (exp. 2 and 3); 10 runs per setting with different seeds to estimate mean and standard deviation.
- Probing reveals positional information in network activations beyond chance, indicating internal representations consistent with a world model.
- Dataset and evaluation: 230k samples/agent (200k train, 30k test); predictions evaluated by selecting the highest-scoring x and y coordinates; chance levels depend on excluded edges (e.g., 6.7%, 9.1%, 7.7%).
- Experiment 1 (no LSTM; inputs: full map + 9×9 crop; linear probes on layer activations; maps: ultimate, trap; chance 6.7%):
- Ultimate: Conv layers 33.4–35.1% (sd ~0.17–0.92); Linear layers 25.2–28.1%.
- Trap: ~8.0–8.8% (slightly above chance).
- Interpretation: Positional info present, but may derive from direct observation; motivates recurrent architectures/restricted observations.
- Experiment 2 (LSTM 512; input: only 5×5 crop; cardinal actions; edges excluded; linear and 3-layer ReLU probes on LSTM hidden/cell; chance 9.1%):
- Hidden (linear): Random 25.6%, Monster 58.5%, Trap 40.6%, Ultimate 25.1%.
- Hidden (non-linear): Random 31.3%, Monster 64.0%, Trap 44.5%, Ultimate 27.7%.
- Cell (linear): Random 29.9%, Monster 62.2%, Trap 42.6%, Ultimate 26.5%.
- Cell (non-linear): Random 37.4%, Monster 67.4%, Trap 47.3%, Ultimate 30.4%.
- Findings: Strongly above chance; memory-dependent positional encoding; non-linear probes outperform linear; cell states slightly more informative.
- Experiment 3 (LSTM 512; input: 3×3 crop; edges excluded; chance 7.7%):
- Hidden (linear): Random 54.8%, Monster 49.5%, Trap 34.0%, Ultimate 27.6%.
- Hidden (non-linear): Random 58.1%, Monster 53.3%, Trap 35.4%, Ultimate 28.9%.
- Cell (linear): Random 57.5%, Monster 50.3%, Trap 34.2%, Ultimate 28.3%.
- Cell (non-linear): Random 59.7%, Monster 54.7%, Trap 36.3%, Ultimate 30.4%.
- Findings: Above chance across maps; highest on simpler random map; smaller crop can reduce accuracy on complex maps (monster/trap), implying increased difficulty and reliance on memory.
- Overall: Results suggest the agent forms internal world representations encoding its position; complexity (monsters/traps), observation scope, and probe nonlinearity affect probe accuracy. Accuracies across configurations ranged roughly from ~17% to ~67% in preliminary variations, consistently exceeding chance.
The findings address the core question by showing that RL agents can develop internal representations predictive of self-localization, a constituent of a world (and potentially self) model per Damasio’s framework. Restricting observations (5×5 or 3×3 crops) and excluding edge positions reduces trivial cues, demonstrating that positional information persists in recurrent states, implicating memory-based internal models. Teleportation traps disrupt spatial continuity, making localization harder and decreasing probe accuracy, yet still above chance—consistent with non-trivial world modeling. Non-linear probes extract more information than linear ones, and LSTM cell states appear slightly more informative than hidden states. While these results support the existence of world-model-like encodings, distinguishing self versus world models remains unresolved; further causal interventions (e.g., perturbing units encoding position or internal variables) are proposed to test functional usage of internal models. The authors argue that more complex architectures (advanced RNNs/transformers) and richer environments may foster more expressive internal models and enable stronger tests aligned with Damasio’s criteria for core consciousness.
The study presents evidence that RL agents trained in MiniHack develop internal representations encoding positional information, detectable via probes and exceeding chance across multiple settings. This suggests that even “model-free” approaches can yield implicit internal (world) models. The authors connect these results to broader theory, noting parallels between discounting in RL and successor representations shaping cognitive maps, and propose Damasio’s layered framework as a useful inductive bias for designing RL agents with integrated self/world representations. Future directions include: (1) scaling to richer, dynamic environments and tasks (delayed rewards, stochastic dynamics, multi-agent settings); (2) adopting more expressive architectures (transformers, advanced RNNs) to capture long-range dependencies; (3) explicit modeling of self versus world signals (including internal state variables and multi-dimensional affect-like signals); (4) causal probing via activation interventions to test whether internal models are functionally used in decision-making; and (5) extending evaluations to real-world domains (robotics, social simulations) to assess generalizability and robustness.
- The work does not claim or detect subjective experience; it probes structural/functional precursors (self/world models) inspired by Damasio’s theory.
- Environments are simplified with constrained spatial and task complexity; results may not generalize to richer settings without additional validation.
- Emotions/feelings are approximated via scalar rewards, which do not capture the multidimensional, biologically grounded affective states and intrinsic significance present in organisms.
- Differentiating self vs. world models is not fully addressed; positional encoding alone is insufficient to demonstrate a true self model.
- Probe success indicates information presence/correlation but not necessarily functional use; causal evidence (e.g., interventions) is needed.
- Access to full internal states is specific to compact RL agents; insights may be harder to obtain in large black-box systems (e.g., LLMs).
Related Publications
Explore these studies to deepen your understanding of the subject.

