Safe Cooperative Multi-Agent Reinforcement Learning with Function Approximation
Cooperative multi-agent reinforcement learning (MARL) has demonstrated significant promise in dynamic control environments, where effective communication and tailored exploration strategies facilitate collaboration. However, ensuring safe exploration remains challenging, as even a single unsafe action from any agent may result in catastrophic consequences. To mitigate this risk, we introduce Scoop-LSVI, a UCB-based cooperative parallel RL framework that achieves low cumulative regret under minimal communication overhead while adhering to safety constraints. Scoop-LSVI enables multiple agents to solve isolated Markov Decision Processes (MDPs) concurrently and share information to enhance collective learning efficiency. We establish a regret bound of Õ(κd3/2H2 √MK), where d is the feature dimension, H is the horizon length, M is the number of agents, K is the number of episodes per agent, and κ quantifies safety constraints. Our result aligns with state-of-the-art findings for unsafe cooperative MARL and matches the regret bound of UCB-based safe single-agent RL algorithms when M = 1, highlighting the potential of Scoop-LSVI to support safe and efficient learning in cooperative MARL applications.