The Arts
The virtual drum circle: polyrhythmic music interactions in extended reality
B. V. Kerrebroeck, K. Crombé, et al.
The study examines how extended reality (XR) can support realistic, low-latency, expressive joint musical performance and how XR-mediated visual coupling and auditory context influence interpersonal coordination. Grounded in action-oriented and embodied cognition perspectives on joint action, the authors note that music provides a rich testbed for anticipatory and adaptive sensorimotor processes and prosocial outcomes (e.g., shared agency, self–other merging). Prior work shows both auditory and visual channels shape coordination, yet many studies rely on simplified tapping tasks. XR offers high experimental control with improving ecological validity, enabling investigation of complex, expressive interactions. The research question asks to what extent visual coupling (not seeing, seeing partner as avatar, seeing partner as real) and auditory context (metronome vs polyrhythmic music) affect performance precision, embodied coregulation, and experiential/prosocial outcomes during a dyadic 2:3 polyrhythmic drumming task.
The paper situates XR as a methodological tool affording control and realism in social and cognitive research, enabling novel manipulations (e.g., embodiment, perspective taking) and efficient data collection. Prior music and joint-action literature shows that: (1) visual coupling enhances synchrony and reduces variability in diverse contexts (singers, pianists, dancers, listeners); (2) auditory and visual channels can each facilitate coordination, with context-dependent dominance and compensatory mechanisms; (3) music entrains movement and supports interpersonal motor coupling across hierarchical metrical levels; (4) successful coupling correlates with perceived performance quality and can induce prosocial effects (shared agency, self–other merging); and (5) most joint-action work uses simplified tapping paradigms, motivating more ecologically valid yet controlled studies. XR’s challenges include ensuring human-like motion for avatars and dealing with latency, raising questions about whether XR can sustain fine-grained musical intentions and coregulation.
Participants: 16 dyads (32 participants; 13 women, 19 men; age mean 30.2 years, SD 8.4, range 18–48) with varied musical expertise (Musical Sophistication Index training scores mean 4.1, SD 1.9, range approximately 1–6.57). Fifteen participants with MSI musical training >3 reported percussion/piano (8 percussionists, 7 pianists). Dyad partners were acquainted to increase comfort with avatar interactions. Ethics approval by Ghent University; COVID-19 precautions applied.
Materials and apparatus: Two separate rooms (~7 m apart) in Ghent University’s ASIL. Full-body motion capture with Qualisys, 42 markers per participant; independent systems per room. Real-time avatar rendering via Qualisys into a Unity server; streamed to two HoloLens 2 headsets over 802.11ac WiFi. A shared virtual drum circle with rotating spheres was displayed to both participants. Note onsets from custom drum pads (17 cm diameter, waist height; pressure sensor via Teensy 3.2) produced percussive sounds (via Ableton) and were logged (MIDI). Networking used a local HP5406 switch; Dante audio-over-IP (48 kHz, buffer 128 samples, Dante latency 1 ms); OSC over UDP for MIDI/visual control/skeleton; Unity Photon for synchronized 3D rendering; a Focusrite Rednet Nanosync provided a 120 Hz clock for cross-system sync.
Measured latencies: Drum pad to speaker: 17 ± 2 ms; motion-capture marker to HoloLens visualization: 58 ± 4 ms; drum pad to Max for Live: <5 ms; Max for Live to Unity via OSC: 1.5 ± 0.5 ms.
Questionnaires: Pre-experiment: percussion training, dyad relationship length/intensity, Inclusion of Other in the Self (SOI), MSI factors. Post-trial: perceived performance quality, agency, shared agency, SOI, Flow/absorption (Likert scales).
Task: Dyads jointly performed a 2:3 polyrhythmic drumming pattern at fixed tempo. One participant executed the binary rhythm (IOI 1059 ms), the other the ternary rhythm (IOI 706 ms); tactus 353 ms; common pulse 2118 ms. A metronome (bell at pulse) or a polyrhythmic backing track (standard bell pattern with additional percussive tactus-level cues) provided auditory context. Each condition lasted 242 s (up to 114 cycles). An augmented-reality drum circle displayed rotating spheres as timing instructions; participants could improvise by skipping taps. When both participants hit within ±62.5 ms of their instructed onset, the visual stimulus became more transparent in five steps; errors restored visibility, reinforcing successful coordination.
Design: 3 (partner realism: not seeing vs seeing-as-avatar vs seeing-as-real) × 2 (musical background: metronome vs polyrhythmic backing track) within-dyad design, six conditions total. Conditions were randomized with constraints (paired execution to reduce setup overhead). Outcome measures spanned three layers: performative (timing prediction error), embodied (movement energy, inter-player coherence), and experiential (agency, shared agency, SOI, flow).
Procedure: ~2-hour session. After consent, donning mocap suits, and ~45 minutes calibration/skeleton building, each participant practiced their part individually (2 min) with metronome and AR instructions. Dyads then virtually met as human-controlled avatars for familiarization and performed 2 min together with the polyrhythmic backing track. After brief in-person exchange, they completed the six experimental conditions.
Analysis framework and statistics: Layer 1 (Performance) used the BListener algorithm (Bayesian multivariate IOI tracking) to compute prediction errors for individual (per player and instruction) and joint performance. Layer 2 (Embodied coregulation) extracted postural position (center-back marker projected on the left–right foot axis), Quantity of Motion (sum of first differences), and wavelet coherence between players’ postural sway; coherence bandpower summed in 0.2 Hz bands around the common pulse (0.472 Hz), binary (0.944 Hz), and ternary (1.416 Hz) frequencies; only transparent-stimulus (good interaction) segments were analyzed. Layer 3 (Subjective) analyzed post-trial agency, shared agency, SOI, and flow. Linear mixed-effects (lme4) and cumulative link mixed models (ordinal) evaluated effects with dyad or individual as random factors where appropriate; model building by likelihood improvement; outliers removed by IQR and residual diagnostics. Data processing across Python, MATLAB, and R; figures and tables provided.
Performance output: Individual performance showed main effects of task and musical background. Binary-task players had larger prediction errors than ternary-task players. The polyrhythmic backing track reduced prediction error significantly for the binary task (t(146)=4.172, p<.001) and showed a trend for the ternary task (p=.075). Prediction errors decreased over trials (e.g., linear term β≈0.799, CI [0.700, 0.912], p=.001), indicating learning. Joint performance: the polyrhythmic backing track reduced joint prediction error (β=0.801, CI [0.674, 0.951], t(74)=-2.57, p=.012); higher musical engagement also reduced error (β=0.769, CI [0.612, 0.965], p=.024). Trial count reduced joint error (χ2(5)=16.3, p=.006).
Embodied coregulation: Quantity of Motion increased with polyrhythmic backing track (β=0.194, CI [0.133, 0.254], t(165)=6.34, p<.001) and with visual coupling (significant differences between not seeing and seeing-as-avatar/real). Longer relationship length reduced QoM slightly (β≈-0.003, p=.002). Wavelet coherence analysis retained only the ternary band (bandpower higher than common pulse and binary, p<.001). Musical training increased coherence (β=1.05, CI [1.01, 1.09], p=.016). Seeing-as-real slightly increased coherence (β=1.12, CI [1.00, 1.24], p=.042), whereas the polyrhythmic backing track tended to decrease coherence (β=0.924, CI [0.853, 1.01], p≈.052). Overall, coherence at the common pulse and binary bands was low, suggesting difficulty synchronizing at the slowest pulse and challenges in the binary rhythm.
Subjective experience: Agency increased with the polyrhythmic backing track (OR=2.75, CI [1.54, 4.92], p=.001) and over trials (strong linear increase with a later 5th-degree decrease). Shared agency increased with visual coupling: seeing-as-avatar (OR=3.12, CI [1.55, 6.27], p=.001) and seeing-as-real (OR=4.06, CI [1.90, 8.71], p<.001) vs not seeing; percussion training also increased shared agency (OR=2.77, CI [1.26, 6.07], p=.011); longer relationship length slightly decreased it. SOI increased with visual coupling (both avatar and real > not seeing) and with higher musical training (trend). Flow increased with the polyrhythmic backing track (β=0.035, CI [0.011, 0.059], p=.004) and longer relationship length (β≈0.001, p=.015); prior SOI (quadratic and cubic terms) predicted flow.
Overall: A richer musical background improved timing accuracy (individual and joint), increased movement energy, and enhanced agency and flow. Visual coupling particularly improved prosocial experiential outcomes (shared agency, SOI) and slightly enhanced movement coherence when seeing-as-real. Learning effects were evident across trials.
The study demonstrates that dyadic musical interaction in XR can sustain meaningful coregulatory dynamics. The polyrhythmic backing track, by providing information at multiple metrical levels, improved prediction-based timing, reduced asynchronies, and fostered greater movement energy and agency, consistent with entrainment and groove effects. Visual coupling primarily enhanced prosocial and experiential qualities, elevating shared agency and self–other merging; increased partner realism (avatar to real) amplified these effects, likely by improving prediction of partner actions and social presence. Movement coherence results suggest participants embodied their assigned rhythms individually over time, with limited synchrony at the shared pulse and binary rhythm, reflecting task difficulty and possibly compensatory strategies in response to XR novelty and sensory constraints. Musical expertise bolstered coherence and prosocial experiences, indicating differential reliance on visual coupling by expertise level. Collectively, findings support XR’s viability for ecologically valid yet controlled study of embodied coordination and agency, showing that performance and experience need not deteriorate in XR and that auditory context and visual realism play distinct, complementary roles in coregulation.
The work introduces and validates a networked XR platform enabling real-time, full-body avatar-mediated dyadic music interaction with high experimental control. It shows that an information-rich musical background improves sensorimotor timing and agency, while visual coupling—especially higher partner realism—enhances prosocial experiential qualities (shared agency, SOI). Embodied coordination metrics reveal that auditory context and visual coupling modulate movement energy and synchrony, with expertise contributing positively. The multi-layered analytical framework (performance, embodied, experiential) provides a comprehensive approach to studying joint action in XR. Future research directions include: systematically probing network latencies and multisensory integration windows in musical contexts; longitudinal training studies to track learning and adaptation; analyses of improvisation, expressivity, and shared intentions; expert-only cohorts to reduce variability; alternative tasks (e.g., synchronization/continuation) linking movement coherence more directly to performance; and expansion to multi-agent (human and computer-controlled) ensembles to investigate group dynamics.
Replication is challenged by technical complexity and multimodal data integration demands. Although audio latencies met natural interaction ranges (approximately 8–25 ms), visual stimuli lagged audio by about 40 ms, a pattern not typical in real settings and potentially exceeding drumming synchrony thresholds reported elsewhere. While within multisensory temporal binding windows for many contexts, this asynchrony could influence coordination and experience. The sample size was modest and mixed in expertise (novices and musicians), increasing variability. Improvisation was permitted to enhance playfulness but not analyzed, limiting insight into expressive strategies. The study was cross-sectional; longitudinal designs could better capture learning and adaptation. Generalizability to expert ensembles and larger groups remains to be tested.
Related Publications
Explore these studies to deepen your understanding of the subject.

