Federated Reinforcement Learning across Heterogeneous Environments
Abstract
Federated Reinforcement Learning (FRL) provides a powerful paradigm for enabling multiple agents to collaboratively learn decision-making policies while preserving data privacy. Despite its growing potential, FRL faces fundamental challenges arising from heterogeneous environments, unbalanced datasets, and diverse computational resources. This dissertation advances the theoretical and algorithmic foundations of FRL by developing a series of federated algorithms that jointly address efficiency, stability, and heterogeneity across both online and offline settings. We begin by investigating the impact of heterogeneous local computation on the convergence behavior of policy gradient methods. To this end, we propose a federated policy gradient algorithm that allows agents to operate under different computational configurations, such as varying batch sizes and numbers of local gradient updates. Through rigorous theoretical analysis, we derive explicit performance bounds that characterize the learning accuracy as a function of these heterogeneous configurations. Our results reveal that the proposed method achieves sample complexity comparable to that of centralized reinforcement learning, and we identify an optimal stability point that balances local computation and global aggregation to ensure stable convergence across diverse clients. Building on this foundation, we develop a federated temporal-difference (Fed-TD) algorithm for policy evaluation across heterogeneous environments. The algorithm enables each agent to estimate its local value function using off-policy data while collaboratively refining a global model through federated averaging. We prove that the aggregated value function asymptotically converges to the optimal model under bounded heterogeneity, providing the first theoretical guarantee for stable policy evaluation in FRL. This result establishes a bridge between distributed value estimation and classical TD learning, demonstrating that cooperative value alignment is achievable even when agents differ in their transition dynamics. To further improve sample efficiency in policy optimization, we design both double-loop and single-loop federated actor-critic algorithms that exploit structural similarities across agents. The double-loop variant alternates between local policy evaluation and global policy improvement, achieving tight convergence rates under mild smoothness assumptions, while the single-loop formulation reduces communication overhead by integrating these processes into one unified update. Our theoretical analyses establish near-optimal convergence guarantees and demonstrate that carefully coordinated local updates and global aggregation effectively mitigate divergence induced by heterogeneous environment dynamics. Finally, we extend FRL to the offline learning regime and propose the federated conservative offline reinforcement learning (FCORL) framework with effective sample size weighting. FCORL enables agents to collaboratively perform policy evaluation and optimization from static datasets without additional environment interactions. By introducing an ESS-based adaptive weighting mechanism, the framework quantifies the representativeness of each client’s dataset relative to the target policy and assigns aggregation weights accordingly. This principled design mitigates the effects of behavioral distribution shift and data imbalance, leading to improved stability and generalization in federated offline settings. We theoretically establish convergence toward the optimal Q-function and empirically validate the algorithm on continuous-control benchmarks, where FCORL consistently outperforms existing federated and offline baselines. In summary, this dissertation provides a unified theoretical and algorithmic framework for federated reinforcement learning under heterogeneous and offline environments. The proposed methods offer new insights into the stability–efficiency trade-off in distributed policy learning, and highlight the potential of FRL as a scalable, privacy-preserving, and communication-efficient paradigm for real-world multi-agent decision systems.
