Computer Science
A Survey on Multi-player Bandits
E. Boursier and V. Perchet
Research question and purpose: The paper surveys the multi-player multi-armed bandits (MMAB) literature, asking how decentralized learners can coordinate to minimize regret under collisions and limited feedback, and what algorithmic/structural assumptions enable near-centralized performance. Context and importance: Classical single-player MAB is well-studied; the multiplayer variant is motivated by cognitive radio/OSA, IoT and Wi-Fi, where decentralized devices contend for shared channels and collisions lower reward. The survey formalizes MMAB models (homogeneous vs. heterogeneous rewards; sensing/no-sensing/collision feedback), baseline centralized results and lower bounds, and synthesizes coordination/communication routines and practical considerations. Importance lies in bridging theory and practice: despite strong theoretical progress (e.g., centralized-optimal regrets via collision-based communication), many algorithms rely on assumptions (synchronization, pre-agreement, parameter knowledge) that impede deployment in real networks.
The survey organizes MMAB research into: (1) Baseline models and centralized reductions: multiple plays and combinatorial bandits (CUCB, ESCB, CTS), and lower bounds (Anantharam et al. 1987; combinatorial lower bounds). (2) Decentralized coordination without/with communication: Orthogonalization routines (RAND ORTHOGONALISATION, MUSICAL CHAIRS) with ETC/UCB/SAR; estimating number of players M; behaviors like SELFISH UCB and its pitfalls. (3) Communication via collisions: Markov-chain Game of Thrones (GoT) dynamics; collision-as-bits protocols (SIC-MMAB, leader/follower), achieving centralized lower bounds in homogeneous case (DPE1) and extending to heterogeneous (SIC-MMAB-het, BEACON). (4) No-communication/no-collision-information regimes: algorithms with pre-agreement achieving optimal asymptotic regret when collision info is absent, and minimax constructions avoiding collisions via geometry/colorings. (5) Toward realistic models: non-stochastic rewards (Markovian/restless, abrupt changes, adversarial); different collision models (non-zero collision rewards; competing bandits with arm preferences); non-collaborative players (jammers, fairness, selfish robustness); time models (asynchronous activation; dynamic entry/exit). (6) Related multi-agent problems: cooperative bandits with graph communication; matching-market bandits; queuing systems with strategic agents. Tables summarize upper bounds across settings.
As a survey, the methodology is conceptual rather than empirical: - Formalize decentralized MMAB: M players, K arms, i.i.d. rewards per player-arm pair X^m_k(t) with mean μ^m_k, collision indicator η_k(t), utility U(π), regret R(T), gaps Δ. Distinguish feedback regimes: full sensing, statistic sensing, collision sensing, and no-sensing. Consider homogeneous (μ^m_k=μ_k) vs heterogeneous settings. - Establish baselines: Reduce centralized homogeneous to multiple-plays; heterogeneous to combinatorial bandits; report known lower bounds and optimal centralized algorithms. - Systematically categorize decentralized algorithms by how they use collisions: coordination (orthogonalization), communication (Markov chains; collision-as-bits), and no-communication. Describe initialization, exploration, communication, and exploitation phases, and key subroutines (e.g., SAR, MUSICAL CHAIRS, GoT, leader/follower). - Enumerate practical extensions: model non-stationarity (Markovian/restless, abrupt changes), adversarial rewards, non-zero collision rewards, arm preferences, strategic/jamming players, and asynchronous/dynamic arrival, detailing algorithmic adaptations and regret guarantees. - Summarize theoretical performance with instance-dependent and minimax bounds, highlighting dependencies on T, M, K, Δ and other parameters; compile summary tables.
- Centralized lower bounds extend: In homogeneous decentralized MMAB with collision sensing, algorithms can achieve centralized optimal regret. Example: DPE1 achieves limsup_{T→∞} R(T)/log T ≤ ∑{k>M} (μ(M)−μ(k))/kl(μ(M),μ(k)). - Coordination routines effectively reduce collisions: RAND ORTHOGONALISATION and MUSICAL CHAIRS rapidly orthogonalize players; combined with ETC/UCB/SAR yield sublinear regret without heavy communication; can estimate M via collision probabilities. - Communication via collisions enables near-centralized performance: SIC-MMAB and leader/follower protocols treat collisions as bits to share empirical means; Wang et al. (DPE1) further reduce exploration/communication to match Anantharam’s lower bound in homogeneous settings; BEACON adapts CUCB to decentralized heterogeneous settings using batched play and adaptive differential communication, achieving O(M^2 K log T / Δ) (up to log K factors). - No-sensing advances: Communication without collision sensing is possible but costly (e.g., Z-channel coding); recent work reduces parameter knowledge requirements (Huang et al.) though communication cost remains large; other approaches achieve minimax √(T log T) without collision info but suffer large dependencies in M,K. - Heterogeneous settings: Pareto/stable matchings are easier than optimal welfare; collision-as-bits and CUCB-style batching lead to centralized-order regrets; with multiple optimal matchings, some algorithms achieve polylog(T) via extended exploration. - Realistic models complicate learning: Markovian/restless processes raise exploration complexity; abrupt changes require sliding-window UCB with T log T regret when O(T^α) breakpoints; adversarial settings need block-based EXP3 and leader/follower designs yielding T^{2/3}–√T rates depending on sensing and players. - Alternative collision models: Non-zero collision rewards demand estimating μ^j_k(m) and coordinated exploration; special structures (μ ∝ 1/m) enable log^{3+δ}(T) bounds via GoT. Competing bandits (arm preferences) shift baselines to stable matchings; decentralized ETC/UCB with collision avoidance and structured preferences achieve log(T) optimal/pessimal regrets. - Time and strategic considerations: Asynchronous/dynamic arrivals require epoch-based or cautious greedy strategies, with constant/logarithmic regret regimes centralized, and O(T^{2/3}) decentralized ETC; strategic/jamming behaviors necessitate fairness/Nash-equilibrium-oriented designs and robust initializations/trigger strategies. - Data points and bounds: Examples include centralized lower bound ∑{k>M} (μ_(M)−μ_(k))/kl(μ_(M),μ_(k)); CUCB O(M^2 K log T / Δ); UCB single-player O(∑ (log T)/Δ_k) and minimax O(√(KT log T)); adversarial multi-player without collision info impossibility (linear regret) for adaptive adversaries; geometric no-communication designs achieving √(T log T) with MK^{11/2} dependence.
The survey demonstrates that the key to decentralized MMAB with collisions is how collisions are leveraged: coordination routines avoid collisions quickly; richer communication via collisions (or explicit messages) allows players to share estimates and compute matchings, yielding regret near centralized lower bounds, even in heterogeneous cases. However, these advances rely on synchronization, pre-agreements, and parameter knowledge, which limit practicality. By examining non-stationary rewards, alternative collision models, strategic agents, and asynchronous/dynamic timelines, the survey reframes MMAB in more realistic settings, identifying where current techniques break or need adaptation. This positions future work to balance theoretical optimality with implementability, highlighting significance for cognitive radio, IoT, and matching markets.
The paper synthesizes a decade of progress in multi-player bandits: decentralized algorithms now reach centralized optimal regret in classical models by treating collisions as coordination or communication signals. It highlights gaps between theory and deployment—chiefly synchronization, communication cost, parameter knowledge, and scalability in M,K—and surveys extensions toward realistic environments (non-stationary rewards, alternative collision models, strategic agents, asynchronous/dynamic operation). Key future directions include: designing algorithms that work in asynchronous and dynamic settings without heavy synchronization; reducing communication costs and parameter dependencies in no-sensing regimes; achieving robust performance under non-stationarity and heterogeneity; and implementing MMAB algorithms in real networks to validate theory and guide model refinement.
- Survey nature: No empirical benchmarking across algorithms; constant factors in bounds are often large and not comparable across works. - Practical gap: Many optimal algorithms assume synchronization, pre-agreements (ranks), shared clocks, or prior knowledge (M, Δ, μ_min), limiting generalizability to real networks. - Communication costs: Collision-based communication can scale poorly with M, K, and 1/Δ; regret improvements may be offset by significant overhead in realistic deployments. - Model simplifications: Classical MMAB assumes i.i.d. rewards, zero-reward collisions, and synchronous players; results may not transfer directly to Markovian/restless/adversarial rewards, non-zero collision models, or dynamic/asynchronous settings. - Heterogeneous complexity: Optimal welfare with multiple optimal matchings remains challenging; some guarantees rely on uniqueness or stable/Pareto baselines rather than global optimality. - Open dependencies: Some algorithms (e.g., GoT) have regret bounds with unspecified parameter dependencies, complicating interpretation and comparison.
Related Publications
Explore these studies to deepen your understanding of the subject.

