Publications
For news about publications, follow us on X:
Click on any author names or tags to filter publications.
All topic tags:
surveydeep-rlmulti-agent-rlagent-modellingad-hoc-teamworkautonomous-drivinggoal-recognitionexplainable-aicausalgeneralisationsecurityemergent-communicationiterated-learningintrinsic-rewardsimulatorstate-estimationdeep-learningtransfer-learning
Selected tags (click to remove):
deep-rl
2024
Stefano V. Albrecht, Filippos Christianos, Lukas Schäfer
Multi-Agent Reinforcement Learning: Foundations and Modern Approaches
MIT Press (print version scheduled for December 2024), 2024
Abstract | BibTex | Book website | Book codebase
MITPmulti-agent-rldeep-rldeep-learningsurvey
Abstract:
Textbook published by MIT Press.
@book{ marl-book,
author = {Stefano V. Albrecht and Filippos Christianos and Lukas Sch\"afer},
title = {Multi-Agent Reinforcement Learning: Foundations and Modern Approaches},
publisher = {MIT Press},
year = {2024},
url = {https://www.marl-book.com}
}
Trevor McInroe, Lukas Schäfer, Stefano V. Albrecht
Multi-Horizon Representations with Hierarchical Forward Models for Reinforcement Learning
Transactions on Machine Learning Research, 2024
Abstract | BibTex | arXiv | Code
TMLRdeep-rl
Abstract:
Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider single-step transitions. Instead, we propose Hierarchical k-Step Latent (HKSL), an auxiliary task that learns representations via a hierarchy of forward models that operate at varying magnitudes of step skipping while also learning to communicate between levels in the hierarchy. We evaluate HKSL in a suite of 30 robotic control tasks and find that HKSL either reaches higher episodic returns or converges to maximum performance more quickly than several current baselines. Also, we find that levels in HKSL's hierarchy can learn to specialize in long- or short-term consequences of agent actions, thereby providing the downstream control policy with more informative representations. Finally, we determine that communication channels between hierarchy levels organize information based on both sides of the communication process, which improves sample efficiency.
@article{mcinroe2024hksl,
title = {Multi-Horizon Representations with Hierarchical Forward Models for Reinforcement Learning},
author = {Trevor McInroe and Lukas Schäfer and Stefano V. Albrecht},
journal = {Transactions on Machine Learning Research (TMLR)},
year = {2024}
}
Xuehui Yu, Mhairi Dunion, Xin Li, Stefano V. Albrecht
Skill-aware Mutual Information Optimisation for Generalisation in Reinforcement Learning
Conference on Neural Information Processing Systems, 2024
Abstract | BibTex | arXiv | Code
NeurIPSdeep-rl
Abstract:
Meta-Reinforcement Learning (Meta-RL) agents can struggle to operate across tasks with varying environmental features that require different optimal skills (i.e., different modes of behaviour). Using context encoders based on contrastive learning to enhance the generalisability of Meta-RL agents is now widely studied but faces challenges such as the requirement for a large sample size, also referred to as the log-K curse. To improve RL generalisation to different tasks, we first introduce Skill-aware Mutual Information (SaMI), an optimisation objective that aids in distinguishing context embeddings according to skills, thereby equipping RL agents with the ability to identify and execute different skills across tasks. We then propose Skill-aware Noise Contrastive Estimation (SaNCE), a K-sample estimator used to optimise the SaMI objective. We provide a framework for equipping an RL agent with SaNCE in practice and conduct experimental validation on modified MuJoCo and Panda-gym benchmarks. We empirically find that RL agents that learn by maximising SaMI achieve substantially improved zero-shot generalisation to unseen tasks. Additionally, the context encoder trained with SaNCE demonstrates greater robustness to a reduction in the number of available samples, thus possessing the potential to overcome the log-K curse.
@inproceedings{yu2024skillaware,
title={Skill-aware Mutual Information Optimisation for Generalisation in Reinforcement Learning},
author={Xuehui Yu and Mhairi Dunion and Xin Li and Stefano V. Albrecht},
booktitle={Conference on Neural Information Processing Systems},
year={2024}
}
Mhairi Dunion, Stefano V. Albrecht
Multi-view Disentanglement for Reinforcement Learning with Multiple Cameras
Reinforcement Learning Conference, 2024
Abstract | BibTex | arXiv | Code
RLCdeep-rlgeneralisation
Abstract:
The performance of image-based Reinforcement Learning (RL) agents can vary depending on the position of the camera used to capture the images. Training on multiple cameras simultaneously, including a first-person egocentric camera, can leverage information from different camera perspectives to improve the performance of RL. However, hardware constraints may limit the availability of multiple cameras in real-world deployment. Additionally, cameras may become damaged in the real-world preventing access to all cameras that were used during training. To overcome these hardware constraints, we propose Multi-View Disentanglement (MVD), which uses multiple cameras to learn a policy that achieves zero-shot generalisation to any single camera from the training set. Our approach is a self-supervised auxiliary task for RL that learns a disentangled representation from multiple cameras, with a shared representation that is aligned across all cameras to allow generalisation to a single camera, and a private representation that is camera-specific. We show experimentally that an RL agent trained on a single third-person camera is unable to learn an optimal policy in many control tasks; but, our approach, benefiting from multiple cameras during training, is able to solve the task using only the same single third-person camera.
@inproceedings{dunion2024mvd,
title={Multi-view Disentanglement for Reinforcement Learning with Multiple Cameras},
author={Mhairi Dunion and Stefano V. Albrecht},
booktitle={1st Reinforcement Learning Conference},
year={2024}
}
Trevor McInroe, Adam Jelley, Stefano V. Albrecht, Amos Storkey
Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning
Reinforcement Learning Conference, 2024
Abstract | BibTex | arXiv
RLCdeep-rl
Abstract:
Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm well matched to a real-world RL deployment process. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but we show this can unnecessarily limit policy performance if the behavior policy is far from optimal. Instead, we forgo constraints and frame OtO RL as an exploration problem that aims to maximize the benefit of online data-collection. We first study the major online RL exploration methods based on intrinsic rewards and UCB in the OtO setting, showing that intrinsic rewards add training instability through reward-function modification, and UCB methods are myopic and it is unclear which learned-component's ensemble to use for action selection. We then introduce an algorithm for planning to go out-of-distribution (PTGOOD) that avoids these issues. PTGOOD uses a non-myopic planning procedure that targets exploration in relatively high-reward regions of the state-action space unlikely to be visited by the behavior policy. By leveraging concepts from the Conditional Entropy Bottleneck, PTGOOD encourages data collected online to provide new information relevant to improving the final deployment policy without altering rewards. We show empirically in several continuous control tasks that PTGOOD significantly improves agent returns during online fine-tuning and avoids the suboptimal policy convergence that many of our baselines exhibit in several environments.
@inproceedings{mcinroe2024planning,
title={Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning},
author={Trevor McInroe and Adam Jelley and Stefano V. Albrecht and Amos Storkey},
booktitle={1st Reinforcement Learning Conference},
year={2024}
}
Aditya Kapoor, Sushant Swamy, Kale-ab Tessera, Mayank Baranwal, Mingfei Sun, Harshad Khadilkar, Stefano V. Albrecht
Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning
RLC Workshop on Coordination and Cooperation for Multi-Agent Reinforcement Learning Methods, 2024
Abstract | BibTex | Paper
RLCdeep-rlmulti-agent-rl
Abstract:
The ability of agents to learn optimal policies is hindered in multi-agent environments where all agents receive a global reward signal sparsely or only at the end of an episode. The delayed nature of these rewards, especially in long-horizon tasks, makes it challenging for agents to evaluate their actions at intermediate time steps. In this paper, we propose Agent-Temporal Reward Redistribution (ATRR), a novel approach to tackle the agent-temporal credit assignment problem by redistributing sparse environment rewards both temporally and at the agent level. ATRR first decomposes the sparse global rewards into rewards for each time step and then calculates agent-specific rewards by determining each agent's relative contribution to these decomposed temporal rewards. We theoretically prove that there exists a redistribution method equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirically, we demonstrate that ATRR stabilizes and expedites the learning process. We also show that ATRR, when used alongside single-agent reinforcement learning algorithms, performs as well as or better than their multi-agent counterparts.
@inproceedings{kapoor2024agenttemporal,
title={Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning},
author={Aditya Kapoor and Sushant Swamy and Kale-ab Tessera and Mayank Baranwal and Mingfei Sun and Harshad Khadilkar and Stefano V Albrecht},
booktitle={Coordination and Cooperation for Multi-Agent Reinforcement Learning Methods Workshop},
year={2024},
url={https://openreview.net/forum?id=dGS1e3FXUH}
}
Samuel Garcin, James Doran, Shangmin Guo, Christopher G. Lucas, Stefano V. Albrecht
DRED: Zero-Shot Transfer in Reinforcement Learning via Data-Regularised Environment Design
International Conference on Machine Learning, 2024
Abstract | BibTex | arXiv
ICMLdeep-rl
Abstract:
Autonomous agents trained using deep reinforcement learning (RL) often lack the ability to successfully generalise to new environments, even when they share characteristics with the environments they have encountered during training. In this work, we investigate how the sampling of individual environment instances, or levels, affects the zero-shot generalisation (ZSG) ability of RL agents. We discover that, for deep actor-critic architectures sharing their base layers, prioritising levels according to their value loss minimises the mutual information between the agent's internal representation and the set of training levels in the generated training data. This provides a novel theoretical justification for the implicit regularisation achieved by certain adaptive sampling strategies. We then turn our attention to unsupervised environment design (UED) methods, which have more control over the data generation mechanism. We find that existing UED methods can significantly shift the training distribution, which translates to low ZSG performance. To prevent both overfitting and distributional shift, we introduce data-regularised environment design (DRED). DRED generates levels using a generative model trained over an initial set of level parameters, reducing distributional shift, and achieves significant improvements in ZSG over adaptive level sampling strategies and UED methods.
@inproceedings{garcin2024dred,
title={{DRED}: Zero-Shot Transfer in Reinforcement Learning via Data-Regularised Environment Design},
author={Samuel Garcin and James Doran and Shangmin Guo and Christopher G. Lucas and Stefano V. Albrecht},
year={2024},
booktitle={International Conference on Machine Learning (ICML)}
}
Guy Azran, Mohamad H. Danesh, Stefano V. Albrecht, Sarah Keren
Contextual Pre-planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning
AAAI Conference on Artificial Intelligence, 2024
Abstract | BibTex | arXiv | Code | Video
AAAIdeep-rlcausal
Abstract:
Recent studies show that deep reinforcement learning (DRL) agents tend to overfit to the task on which they were trained and fail to adapt to minor environment changes. To expedite learning when transferring to unseen tasks, we propose a novel approach to representing the current task using reward machines (RMs), state machine abstractions that induce subtasks based on the current task’s rewards and dynamics. Our method provides agents with symbolic representations of optimal transitions from their current abstract state and rewards them for achieving these transitions. These representations are shared across tasks, allowing agents to exploit knowledge of previously encountered symbols and transitions, thus enhancing transfer. Empirical results show that our representations improve sample efficiency and few-shot transfer in a variety of domains.
@inproceedings{azran2024contextual,
title={Contextual Pre-planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning},
author={Guy Azran and Mohamad H. Danesh and Stefano V. Albrecht and Sarah Keren},
booktitle={Proceedings of the 38th AAAI Conference on Artificial Intelligence},
year={2024}
}
Guy Azran, Mohamad H. Danesh, Stefano V. Albrecht, Sarah Keren
Contextual Pre-planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning
ICAPS Workshop on Planning and Reinforcement Learning, 2024
Abstract | BibTex | arXiv | Code | Video
ICAPSdeep-rlcausal
Abstract:
Recent studies show that deep reinforcement learning (DRL) agents tend to overfit to the task on which they were trained and fail to adapt to minor environment changes. To expedite learning when transferring to unseen tasks, we propose a novel approach to representing the current task using reward machines (RMs), state machine abstractions that induce subtasks based on the current task’s rewards and dynamics. Our method provides agents with symbolic representations of optimal transitions from their current abstract state and rewards them for achieving these transitions. These representations are shared across tasks, allowing agents to exploit knowledge of previously encountered symbols and transitions, thus enhancing transfer. Empirical results show that our representations improve sample efficiency and few-shot transfer in a variety of domains.
@inproceedings{Azran2022enhancing,
title={Contextual Pre-planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning},
author={Azran, Guy and Danesh, Mohamad H. and Albrecht, Stefano V. and Keren, Sarah},
booktitle={ICAPS Workshop on Planning and Reinforcement Learning (https://prl-theworkshop.github.io/prl2024-icaps/},
year={2024}
}
2023
Arrasy Rahman, Ignacio Carlucho, Niklas Höpner, Stefano V. Albrecht
A General Learning Framework for Open Ad Hoc Teamwork Using Graph-based Policy Learning
Journal of Machine Learning Research, 2023
Abstract | BibTex | arXiv | Publisher | Code
JMLRad-hoc-teamworkdeep-rlagent-modellingmulti-agent-rl
Abstract:
Open ad hoc teamwork is the problem of training a single agent to efficiently collaborate with an unknown group of teammates whose composition may change over time. A variable team composition creates challenges for the agent, such as the requirement to adapt to new team dynamics and dealing with changing state vector sizes. These challenges are aggravated in real-world applications where the controlled agent has no access to the full state of the environment. In this work, we develop a class of solutions for open ad hoc teamwork under full and partial observability. We start by developing a solution for the fully observable case that leverages graph neural network architectures to obtain an optimal policy based on reinforcement learning. We then extend this solution to partially observable scenarios by proposing different methodologies that maintain belief estimates over the latent environment states and team composition. These belief estimates are combined with our solution for the fully observable case to compute an agent's optimal policy under partial observability in open ad hoc teamwork. Empirical results demonstrate that our approach can learn efficient policies in open ad hoc teamwork in full and partially observable cases. Further analysis demonstrates that our methods' success is a result of effectively learning the effects of teammates' actions while also inferring the inherent state of the environment under partial observability.
@article{JRahman2022POGPL,
author = {Arrasy Rahman and Ignacio Carlucho and Niklas H\"opner and Stefano V. Albrecht},
title = {A General Learning Framework for Open Ad Hoc Teamwork Using Graph-based Policy Learning},
journal = {Journal of Machine Learning Research},
year = {2023},
volume = {24},
number = {298},
pages = {1--74},
url = {http://jmlr.org/papers/v24/22-099.html}
}
Filippos Christianos, Georgios Papoudakis, Stefano V. Albrecht
Pareto Actor-Critic for Equilibrium Selection in Multi-Agent Reinforcement Learning
Transactions on Machine Learning Research, 2023
Abstract | BibTex | arXiv | Code
TMLRdeep-rlmulti-agent-rl
Abstract:
This work focuses on equilibrium selection in no-conflict multi-agent games, where we specifically study the problem of selecting a Pareto-optimal Nash equilibrium among several existing equilibria. It has been shown that many state-of-the-art multi-agent reinforcement learning (MARL) algorithms are prone to converging to Pareto-dominated equilibria due to the uncertainty each agent has about the policy of the other agents during training. To address sub-optimal equilibrium selection, we propose Pareto Actor-Critic (Pareto-AC), which is an actor-critic algorithm that utilises a simple property of no-conflict games (a superset of cooperative games): the Pareto-optimal equilibrium in a no-conflict game maximises the returns of all agents and, therefore, is the preferred outcome for all agents. We evaluate Pareto-AC in a diverse set of multi-agent games and show that it converges to higher episodic returns compared to seven state-of-the-art MARL algorithms and that it successfully converges to a Pareto-optimal equilibrium in a range of matrix games. Finally, we propose PACDCG, a graph neural network extension of Pareto-AC, which is shown to efficiently scale in games with a large number of agents.
@article{christianos2023pareto,
title={Pareto Actor-Critic for Equilibrium Selection in Multi-Agent Reinforcement Learning},
author={Filippos Christianos and Georgios Papoudakis and Stefano V. Albrecht},
journal={Transactions on Machine Learning Research (TMLR)},
year={2023}
}
Arrasy Rahman, Elliot Fosong, Ignacio Carlucho, Stefano V. Albrecht
Generating Teammates for Training Robust Ad Hoc Teamwork Agents via Best-Response Diversity
Transactions on Machine Learning Research, 2023
Abstract | BibTex | arXiv | Code
TMLRad-hoc-teamworkmulti-agent-rldeep-rl
Abstract:
Ad hoc teamwork (AHT) is the challenge of designing a robust learner agent that effectively collaborates with unknown teammates without prior coordination mechanisms. Early approaches address the AHT challenge by training the learner with a diverse set of handcrafted teammate policies, usually designed based on an expert's domain knowledge about the policies the learner may encounter. However, implementing teammate policies for training based on domain knowledge is not always feasible. In such cases, recent approaches attempted to improve the robustness of the learner by training it with teammate policies generated by optimising information-theoretic diversity metrics. The problem with optimising existing information-theoretic diversity metrics for teammate policy generation is the emergence of superficially different teammates. When used for AHT training, superficially different teammate behaviours may not improve a learner's robustness during collaboration with unknown teammates. In this paper, we present an automated teammate policy generation method optimising the Best-Response Diversity (BRDiv) metric, which measures diversity based on the compatibility of teammate policies in terms of returns. We evaluate our approach in environments with multiple valid coordination strategies, comparing against methods optimising information-theoretic diversity metrics and an ablation not optimising any diversity metric. Our experiments indicate that optimising BRDiv yields a diverse set of training teammate policies that improve the learner's performance relative to previous teammate generation approaches when collaborating with near-optimal previously unseen teammate policies.
@article{rahman2023BRDiv,
title={Generating Teammates for Training Robust Ad Hoc Teamwork Agents via Best-Response Diversity},
author={Arrasy Rahman and Elliot Fosong and Ignacio Carlucho and Stefano V. Albrecht},
journal={Transactions on Machine Learning Research (TMLR)},
year={2023}
}
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck, Josiah Hanna, Stefano V. Albrecht
Conditional Mutual Information for Disentangled Representations in Reinforcement Learning
Conference on Neural Information Processing Systems, 2023
Abstract | BibTex | arXiv | Code
NeurIPSdeep-rlcausalgeneralisation
Abstract:
Reinforcement Learning (RL) environments can produce training data with spurious correlations between features due to the amount of training data or its limited feature coverage. This can lead to RL agents encoding these misleading correlations in their latent representation, preventing the agent from generalising if the correlation changes within the environment or when deployed in the real world. Disentangled representations can improve robustness, but existing disentanglement techniques that minimise mutual information between features require independent features, thus they cannot disentangle correlated features. We propose an auxiliary task for RL algorithms that learns a disentangled representation of high-dimensional observations with correlated features by minimising the conditional mutual information between features in the representation. We demonstrate experimentally, using continuous control tasks, that our approach improves generalisation under correlation shifts, as well as improving the training performance of RL algorithms in the presence of correlated features.
@inproceedings{dunion2023cmid,
title={Conditional Mutual Information for Disentangled Representations in Reinforcement Learning},
author={Mhairi Dunion and Trevor McInroe and Kevin Sebastian Luck and Josiah Hanna and Stefano V. Albrecht},
booktitle={Conference on Neural Information Processing Systems},
year={2023}
}
Lukas Schäfer, Filippos Christianos, Amos Storkey, Stefano V. Albrecht
Learning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement Learning
NeurIPS Workshop on Generalization in Planning, 2023
Abstract | BibTex | arXiv | Code
NeurIPSmulti-agent-rldeep-rl
Abstract:
Successful deployment of multi-agent reinforcement learning often requires agents to adapt their behaviour. In this work, we discuss the problem of teamwork adaptation in which a team of agents needs to adapt their policies to solve novel tasks with limited fine-tuning. Motivated by the intuition that agents need to be able to identify and distinguish tasks in order to adapt their behaviour to the current task, we propose to learn multi-agent task embeddings (MATE). These task embeddings are trained using an encoder-decoder architecture optimised for reconstruction of the transition and reward functions which uniquely identify tasks. We show that a team of agents is able to adapt to novel tasks when provided with task embeddings. We propose three MATE training paradigms: independent MATE, centralised MATE, and mixed MATE which vary in the information used for the task encoding. We show that the embeddings learned by MATE identify tasks and provide useful information which agents leverage during adaptation to novel tasks.
@inproceedings{schaefer2023mate,
title={Learning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement Learning},
author={Lukas Schäfer and Filippos Christianos and Amos Storkey and Stefano V. Albrecht},
booktitle={NeurIPS Workshop on Generalization in Planning},
year={2023}
}
Guy Azran, Mohamad H Danesh, Stefano V. Albrecht, Sarah Keren
Contextual Pre-Planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning
NeurIPS Workshop on Generalization in Planning, 2023
Abstract | BibTex | arXiv
NeurIPSdeep-rlcausal
Abstract:
Recent studies show that deep reinforcement learning (DRL) agents tend to overfit to the task on which they were trained and fail to adapt to minor environment changes. To expedite learning when transferring to unseen tasks, we propose a novel approach to representing the current task using reward machines (RM), state machine abstractions that induce subtasks based on the current task’s rewards and dynamics. Our method provides agents with symbolic representations of optimal transitions from their current abstract state and rewards them for achieving these transitions. These representations are shared across tasks, allowing agents to exploit knowledge of previously encountered symbols and transitions, thus enhancing transfer. Our empirical evaluation shows that our representations improve sample efficiency and few-shot transfer in a variety of domains.
@inproceedings{azran2023contextual,
title={Contextual Pre-Planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning},
author={Guy Azran and Mohamad H. Danesh and Stefano V. Albrecht and Sarah Keren},
booktitle={NeurIPS Workshop on Generalization in Planning},
year={2023}
}
Samuel Garcin, James Doran, Shangmin Guo, Christopher G. Lucas, Stefano V. Albrecht
How the level sampling process impacts zero-shot generalisation in deep reinforcement learning
NeurIPS Workshop on Agent Learning in Open-Endedness, 2023
Abstract | BibTex | arXiv
NeurIPSdeep-rl
Abstract:
A key limitation preventing the wider adoption of autonomous agents trained via deep reinforcement learning (RL) is their limited ability to generalise to new environments, even when these share similar characteristics with environments encountered during training. In this work, we investigate how a non-uniform sampling strategy of individual environment instances, or levels, affects the zero-shot generalisation (ZSG) ability of RL agents, considering two failure modes: overfitting and over-generalisation. As a first step, we measure the mutual information (MI) between the agent's internal representation and the set of training levels, which we find to be well-correlated to instance overfitting. In contrast to uniform sampling, adaptive sampling strategies prioritising levels based on their value loss are more effective at maintaining lower MI, which provides a novel theoretical justification for this class of techniques. We then turn our attention to unsupervised environment design (UED) methods, which adaptively generate new training levels and minimise MI more effectively than methods sampling from a fixed set. However, we find UED methods significantly shift the training distribution, resulting in over-generalisation and worse ZSG performance over the distribution of interest. To prevent both instance overfitting and over-generalisation, we introduce self-supervised environment design (SSED). SSED generates levels using a variational autoencoder, effectively reducing MI while minimising the shift with the distribution of interest, and leads to statistically significant improvements in ZSG over fixed-set level sampling strategies and UED methods.
@inproceedings{garcin2023level,
title={How the level sampling process impacts zero-shot generalisation in deep reinforcement learning},
author={Samuel Garcin and James Doran and Shangmin Guo and Christopher G. Lucas and Stefano V. Albrecht},
booktitle={NeurIPS Workshop on Agent Learning in Open-Endedness},
year={2023}
}
Sabrina McCallum, Max Taylor-Davies, Stefano V. Albrecht, Alessandro Suglia
Is Feedback All You Need? Leveraging Natural Language Feedback in Goal-Conditioned Reinforcement Learning
NeurIPS Workshop on Goal-Conditioned Reinforcement Learning, 2023
Abstract | BibTex | arXiv | Code
NeurIPSdeep-rl
Abstract:
Despite numerous successes, the field of reinforcement learning (RL) remains far from matching the impressive generalisation power of human behaviour learning. One way to help bridge this gap may be to provide RL agents with richer, more human-like feedback expressed in natural language. To investigate this idea, we first extend BabyAI to automatically generate language feedback from the environment dynamics and goal condition success. Then, we modify the Decision Transformer architecture to take advantage of this additional signal. We find that training with language feedback either in place of or in addition to the return-to-go or goal descriptions improves agents’ generalisation performance, and that agents can benefit from feedback even when this is only available during training, but not at inference.
@inproceedings{mccallum2023feedback,
title={Is Feedback All You Need? Leveraging Natural Language Feedback in Goal-Conditioned Reinforcement Learning},
author={Sabrina McCallum and Max Taylor-Davies and Stefano V. Albrecht and Alessandro Suglia},
booktitle={NeurIPS Workshop on Goal-Conditioned Reinforcement Learning (GCRL)},
year={2023}
}
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck, Josiah Hanna, Stefano V. Albrecht
Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning
International Conference on Learning Representations, 2023
Abstract | BibTex | arXiv | Code
ICLRdeep-rlgeneralisationcausal
Abstract:
Reinforcement Learning (RL) agents are often unable to generalise well to environment variations in the state space that were not observed during training. This issue is especially problematic for image-based RL, where a change in just one variable, such as the background colour, can change many pixels in the image, which can lead to drastic changes in the agent's latent representation of the image, causing the learned policy to fail. To learn more robust representations, we introduce TEmporal Disentanglement (TED), a self-supervised auxiliary task that leads to disentangled image representations exploiting the sequential nature of RL observations. We find empirically that RL algorithms utilising TED as an auxiliary task adapt more quickly to changes in environment variables with continued training compared to state-of-the-art representation learning methods. Since TED enforces a disentangled structure of the representation, we also find that policies trained with TED generalise better to unseen values of variables irrelevant to the task (e.g. background colour) as well as unseen values of variables that affect the optimal policy (e.g. goal positions).
@inproceedings{dunion2023ted,
title={Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning},
author={Mhairi Dunion and Trevor McInroe and Kevin Sebastian Luck and Josiah Hanna and Stefano V. Albrecht},
booktitle={International Conference on Learning Representations (ICLR)},
year={2023}
}
Filippos Christianos, Peter Karkus, Boris Ivanovic, Stefano V. Albrecht, Marco Pavone
Planning with Occluded Traffic Agents using Bi-Level Variational Occlusion Models
IEEE International Conference on Robotics and Automation, 2023
Abstract | BibTex | arXiv
ICRAdeep-rlautonomous-driving
Abstract:
Reasoning with occluded traffic agents is a significant open challenge for planning for autonomous vehicles. Recent deep learning models have shown impressive results for predicting occluded agents based on the behaviour of nearby visible agents; however, as we show in experiments, these models are difficult to integrate into downstream planning. To this end, we propose Bi-level Variational Occlusion Models (BiVO), a two-step generative model that first predicts likely locations of occluded agents, and then generates likely trajectories for the occluded agents. In contrast to existing methods, BiVO outputs a trajectory distribution which can then be sampled from and integrated into standard downstream planning. We evaluate the method in closed-loop replay simulation using the real-world nuScenes dataset. Our results suggest that BiVO can successfully learn to predict occluded agent trajectories, and these predictions lead to better subsequent motion plans in critical scenarios.
@inproceedings{christianos2023planning,
title={Planning with Occluded Traffic Agents using Bi-Level Variational Occlusion Models},
author={Filippos Christianos and Peter Karkus and Boris Ivanovic and Stefano V. Albrecht and Marco Pavone},
booktitle={International Conference on Robotics and Automation (ICRA)},
year={2023}
}
Giuseppe Vecchio, Simone Palazzo, Dario C Guastella, Riccardo E. Sarpietro, Ignacio Carlucho, Stefano V. Albrecht, Giovanni Muscato, Concetto Spampinato
MIDGARD: A Simulation Platform for Autonomous Navigation in Unstructured Environments
RSS Workshop on Multi-Agent Planning and Navigation in Challenging Environments, 2023
Abstract | BibTex | arXiv
RSSsimulatordeep-rl
Abstract:
We present MIDGARD, an open-source simulation platform for autonomous robot navigation in outdoor unstructured environments. MIDGARD is designed to enable the training of autonomous agents (e.g., unmanned ground vehicles) in photorealistic 3D environments, and to support the generalization skills of learning-based agents through the variability in training scenarios. MIDGARD's main features include a configurable, extensible, and difficulty-driven procedural landscape generation pipeline, with fast and photorealistic scene rendering based on Unreal Engine. Additionally, MIDGARD has built-in support for OpenAI Gym, a programming interface for feature extension (e.g., integrating new types of sensors, customizing exposing internal simulation variables), and a variety of simulated agent sensors (e.g., RGB, depth and instance/semantic segmentation). We evaluate MIDGARD's capabilities as a benchmarking tool for robot navigation utilizing a set of state-of-the-art reinforcement learning algorithms. The results demonstrate MIDGARD's suitability as a simulation and training environment, as well as the effectiveness of our procedural generation approach in controlling scene difficulty, which directly reflects on accuracy metrics.
@inproceedings{vecchio2022midgard,
title={MIDGARD: A Simulation Platform for Autonomous Navigation in Unstructured Environments},
author={Vecchio, Giuseppe and Palazzo, Simone and Guastella, Dario C and Sarpietro, Riccardo E. and Carlucho, Ignacio and Albrecht, Stefano V. and Muscato, Giovanni and Spampinato, Concetto},
booktitle={RSS 2023 Workshop on Multi-Agent Planning and Navigation in Challenging Environments},
year={2023}
}
Filippos Christianos, Georgios Papoudakis, Stefano V. Albrecht
Pareto Actor-Critic for Equilibrium Selection in Multi-Agent Reinforcement Learning
AAMAS Workshop on Optimization and Learning in Multiagent Systems, 2023
Abstract | BibTex | arXiv
AAMASdeep-rlmulti-agent-rl
Abstract:
This work focuses on equilibrium selection in no-conflict multi-agent games, where we specifically study the problem of selecting a Pareto-optimal equilibrium among several existing equilibria. It has been shown that many state-of-the-art multi-agent reinforcement learning (MARL) algorithms are prone to converging to Pareto-dominated equilibria due to the uncertainty each agent has about the policy of the other agents during training. To address suboptimal equilibrium selection, we propose Pareto Actor-Critic (Pareto-AC), an actor-critic algorithm that utilises a simple property of no-conflict games (a superset of cooperative games with identical rewards): each agent can assume the others will choose actions that will lead to a Pareto-optimal equilibrium. We evaluate Pareto-AC in a diverse set of multi-agent games and show that it converges to higher episodic returns compared to alternative MARL algorithms, as well as successfully converging to a Pareto-optimal equilibrium in a range of matrix games.
@inproceedings{christianos2023pareto,
title={Pareto Actor-Critic for Equilibrium Selection in Multi-Agent Reinforcement Learning},
author={Filippos Christianos and Georgios Papoudakis and Stefano V. Albrecht},
booktitle={AAMAS Workshop on Optimization and Learning in Multiagent Systems},
year={2023}
}
Adam Michalski, Filippos Christianos, Stefano V. Albrecht
SMAClite: A Lightweight Environment for Multi-Agent Reinforcement Learning
AAMAS Workshop on Multiagent Sequential Decision Making Under Uncertainty, 2023
Abstract | BibTex | arXiv | Code
AAMASdeep-rlmulti-agent-rl
Abstract:
There is a lack of standard benchmarks for Multi-Agent Reinforcement Learning (MARL) algorithms. The Starcraft Multi-Agent Challenge (SMAC) has been widely used in MARL research, but is built on top of a heavy, closed-source computer game, StarCraft II. Thus, SMAC is computationally expensive and requires knowledge and the use of proprietary tools specific to the game for any meaningful alteration or contribution to the environment. We introduce SMAClite -- a challenge based on SMAC that is both decoupled from Starcraft II and open-source, along with a framework which makes it possible to create new content for SMAClite without any special knowledge. We conduct experiments to show that SMAClite is equivalent to SMAC, by training MARL algorithms on SMAClite and reproducing SMAC results. We then show that SMAClite outperforms SMAC in both runtime speed and memory.
@inproceedings{michalski2023smaclite,
title={SMAClite: A Lightweight Environment for Multi-Agent Reinforcement Learning},
author={Adam Michalski and Filippos Christianos and Stefano V. Albrecht},
booktitle={AAMAS workshop on Multiagent Sequential Decision Making Under Uncertainty (MSDM)},
year={2023}
}
Lukas Schäfer, Oliver Slumbers, Stephen McAleer, Yali Du, Stefano V. Albrecht, David Mguni
Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning
AAMAS Workshop on Adaptive and Learning Agents, 2023
Abstract | BibTex | arXiv
AAMASmulti-agent-rldeep-rl
Abstract:
Cooperative multi-agent reinforcement learning (MARL) requires agents to explore to learn to cooperate. Existing value-based MARL algorithms commonly rely on random exploration, such as ϵ-greedy, which is inefficient in discovering multi-agent cooperation. Additionally, the environment in MARL appears non-stationary to any individual agent due to the simultaneous training of other agents, leading to highly variant and thus unstable optimisation signals. In this work, we propose ensemble value functions for multi-agent exploration (EMAX), a general framework to extend any value-based MARL algorithm. EMAX trains ensembles of value functions for each agent to address the key challenges of exploration and non-stationarity: (1) The uncertainty of value estimates across the ensemble is used in a UCB policy to guide the exploration of agents to parts of the environment which require cooperation. (2) Average value estimates across the ensemble serve as target values. These targets exhibit lower variance compared to commonly applied target networks and we show that they lead to more stable gradients during the optimisation. We instantiate three value-based MARL algorithms with EMAX, independent DQN, VDN and QMIX, and evaluate them in 21 tasks across four environments. Using ensembles of five value functions, EMAX improves sample efficiency and final evaluation returns of these algorithms by 53%, 36%, and 498%, respectively, averaged all 21 tasks.
@inproceedings{schaefer2023emax,
title={Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning},
author={Lukas Schäfer and Oliver Slumbers and Stephen McAleer and Yali Du and Stefano V. Albrecht and David Mguni},
year={2023},
booktitle={AAMAS Workshop on Adaptive and Learning Agents (ALA)},
}
Callum Tilbury, Filippos Christianos, Stefano V. Albrecht
Revisiting the Gumbel-Softmax in MADDPG
AAMAS Workshop on Adaptive and Learning Agents, 2023
Abstract | BibTex | arXiv | Code
AAMASmulti-agent-rldeep-rl
Abstract:
MADDPG is an algorithm in multi-agent reinforcement learning (MARL) that extends the popular single-agent method, DDPG, to multi-agent scenarios. Importantly, DDPG is an algorithm designed for continuous action spaces, where the gradient of the state-action value function exists. For this algorithm to work in discrete action spaces, discrete gradient estimation must be performed. For MADDPG, the Gumbel-Softmax (GS) estimator is used -- a reparameterisation which relaxes a discrete distribution into a similar continuous one. This method, however, is statistically biased, and a recent MARL benchmarking paper suggests that this bias makes MADDPG perform poorly in grid-world situations, where the action space is discrete. Fortunately, many alternatives to the GS exist, boasting a wide range of properties. This paper explores several of these alternatives and integrates them into MADDPG for discrete grid-world scenarios. The corresponding impact on various performance metrics is then measured and analysed. It is found that one of the proposed estimators performs significantly better than the original GS in several tasks, achieving up to 55\% higher returns, along with faster convergence.
@inproceedings{tilbury2023revisitingmaddpg,
title={Revisiting the Gumbel-Softmax in MADDPG},
author={Callum Tilbury and Filippos Christianos and Stefano V. Albrecht},
year={2023},
booktitle={AAMAS Workshop on Adaptive and Learning Agents (ALA)},
}
Alain Andres, Lukas Schäfer, Esther Villar-Rodriguez, Stefano V. Albrecht, Javier Del Ser
Using Offline Data to Speed-up Reinforcement Learning in Procedurally Generated Environments
AAMAS Workshop on Adaptive and Learning Agents, 2023
Abstract | BibTex | arXiv
AAMASdeep-rl
Abstract:
One of the key challenges of Reinforcement Learning (RL) is the ability of agents to generalise their learned policy to unseen settings. Moreover, training RL agents requires large numbers of interactions with the environment. Motivated by the recent success of Offline RL and Imitation Learning (IL), we conduct a study to investigate whether agents can leverage offline data in the form of trajectories to improve the sample-efficiency in procedurally generated environments. We consider two settings of using IL from offline data for RL: (1) pre-training a policy before online RL training and (2) concurrently training a policy with online RL and IL from offline data. We analyse the impact of the quality (optimality of trajectories) and diversity (number of trajectories and covered level) of available offline trajectories on the effectiveness of both approaches. Across four well-known sparse reward tasks in the MiniGrid environment, we find that using IL for pre-training and concurrently during online RL training both consistently improve the sample-efficiency while converging to optimal policies. Furthermore, we show that pre-training a policy from as few as two trajectories can make the difference between learning an optimal policy at the end of online training and not learning at all. Our findings motivate the widespread adoption of IL for pre-training and concurrent IL in procedurally generated environments whenever offline trajectories are available or can be generated.
@inproceedings{andres2023using,
title={Using Offline Data to Speed-up Reinforcement Learning in Procedurally Generated Environments},
author={Andres, Alain and Schäfer, Lukas and Villar-Rodriguez, Esther and Albrecht, Stefano V. and Del Ser, Javier},
booktitle={AAMAS Workshop on Adaptive and Learning Agents (ALA)},
year={2023}
}
Guy Azran, Mohamad H. Danesh, Stefano V. Albrecht, Sarah Keren
Contextual Pre-Planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning
IJCAI Workshop on Planning and Reinforcement Learning, 2023
Abstract | BibTex | arXiv
IJCAIdeep-rlcausal
Abstract:
Recent studies show that deep reinforcement learning (DRL) agents tend to overfit to the task on which they were trained and fail to adapt to minor environment changes. To expedite learning when transferring to unseen tasks, we propose a novel approach to representing the current task using reward machines (RM), state machine abstractions that induce subtasks based on the current task’s rewards and dynamics. Our method provides agents with symbolic representations of optimal transitions from their current abstract state and rewards them for achieving these transitions. These representations are shared across tasks, allowing agents to exploit knowledge of previously encountered symbols and transitions, thus enhancing transfer. Our empirical evaluation shows that our representations improve sample efficiency and few-shot transfer in a variety of domains.
@inproceedings{azran2023contextual,
title={Contextual Pre-Planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning},
author={Guy Azran and Mohamad H. Danesh and Stefano V. Albrecht and Sarah Keren},
booktitle={IJCAI Workshop on Planning and Reinforcement Learning (https://prl-theworkshop.github.io/)},
year={2023}
}
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck, Josiah Hanna, Stefano V. Albrecht
Conditional Mutual Information for Disentangled Representations in Reinforcement Learning
European Workshop on Reinforcement Learning, 2023
Abstract | BibTex | arXiv | Code
EWRLdeep-rlcausalgeneralisation
Abstract:
Reinforcement Learning (RL) environments can produce training data with spurious correlations between features due to the amount of training data or its limited feature coverage. This can lead to RL agents encoding these misleading correlations in their latent representation, preventing the agent from generalising if the correlation changes within the environment or when deployed in the real world. Disentangled representations can improve robustness, but existing disentanglement techniques that minimise mutual information between features require independent features, thus they cannot disentangle correlated features. We propose an auxiliary task for RL algorithms that learns a disentangled representation of high-dimensional observations with correlated features by minimising the conditional mutual information between features in the representation. We demonstrate experimentally, using continuous control tasks, that our approach improves generalisation under correlation shifts, as well as improving the training performance of RL algorithms in the presence of correlated features.
@inproceedings{dunion2023cmid,
title={Conditional Mutual Information for Disentangled Representations in Reinforcement Learning},
author={Mhairi Dunion and Trevor McInroe and Kevin Sebastian Luck and Josiah Hanna and Stefano V. Albrecht},
booktitle={European Workshop on Reinforcement Learning},
year={2023}
}
Samuel Garcin, James Doran, Shangmin Guo, Christopher G. Lucas, Stefano V. Albrecht
How the level sampling process impacts zero-shot generalisation in deep reinforcement learning
arXiv:2310.03494, 2023
Abstract | BibTex | arXiv
deep-rl
Abstract:
A key limitation preventing the wider adoption of autonomous agents trained via deep reinforcement learning (RL) is their limited ability to generalise to new environments, even when these share similar characteristics with environments encountered during training. In this work, we investigate how a non-uniform sampling strategy of individual environment instances, or levels, affects the zero-shot generalisation (ZSG) ability of RL agents, considering two failure modes: overfitting and over-generalisation. As a first step, we measure the mutual information (MI) between the agent's internal representation and the set of training levels, which we find to be well-correlated to instance overfitting. In contrast to uniform sampling, adaptive sampling strategies prioritising levels based on their value loss are more effective at maintaining lower MI, which provides a novel theoretical justification for this class of techniques. We then turn our attention to unsupervised environment design (UED) methods, which adaptively generate new training levels and minimise MI more effectively than methods sampling from a fixed set. However, we find UED methods significantly shift the training distribution, resulting in over-generalisation and worse ZSG performance over the distribution of interest. To prevent both instance overfitting and over-generalisation, we introduce self-supervised environment design (SSED). SSED generates levels using a variational autoencoder, effectively reducing MI while minimising the shift with the distribution of interest, and leads to statistically significant improvements in ZSG over fixed-set level sampling strategies and UED methods.
@misc{garcin2023level,
title={How the level sampling process impacts zero-shot generalisation in deep reinforcement learning},
author={Samuel Garcin and James Doran and Shangmin Guo and Christopher G. Lucas and Stefano V. Albrecht},
year={2023},
eprint={2310.03494},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Trevor McInroe, Stefano V. Albrecht, Amos Storkey
Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning
arXiv:2310.05723, 2023
Abstract | BibTex | arXiv
deep-rl
Abstract:
Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm that is well matched to a real-world RL deployment process: in few real settings would one deploy an offline policy with no test runs and tuning. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but this unnecessarily limits policy performance if the behavior policy is far from optimal. Instead, we forgo policy constraints and frame OtO RL as an exploration problem: we must maximize the benefit of the online data-collection. We study major online RL exploration paradigms, adapting them to work well with the OtO setting. These adapted methods contribute several strong baselines. Also, we introduce an algorithm for planning to go out of distribution (PTGOOD), which targets online exploration in relatively high-reward regions of the state-action space unlikely to be visited by the behavior policy. By leveraging concepts from the Conditional Entropy Bottleneck, PTGOOD encourages data collected online to provide new information relevant to improving the final deployment policy. In that way the limited interaction budget is used effectively. We show that PTGOOD significantly improves agent returns during online fine-tuning and finds the optimal policy in as few as 10k online steps in Walker and in as few as 50k in complex control tasks like Humanoid. Also, we find that PTGOOD avoids the suboptimal policy convergence that many of our baselines exhibit in several environments.
@misc{mcinroe2023planning,
title={Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning},
author={Trevor McInroe and Stefano V. Albrecht and Amos Storkey},
year={2023},
eprint={2310.05723},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
2022
Stefano V. Albrecht, Michael Wooldridge
Special Issue on Multi-Agent Systems Research in the United Kingdom: Guest Editorial
AI Communications, 2022
Abstract | BibTex | Publisher | Special Issue
AICsurveydeep-rlmulti-agent-rlagent-modelling
Abstract:
The purpose of this special issue is to showcase current multi-agent systems research led by university and industry groups based in the United Kingdom. Research groups and institutes in the UK which have significant activity in multi-agent systems research were invited to submit an article describing: (1) the technical problems in multi-agent systems tackled by the group (their core research agenda), including applications and industry collaboration; (2) the main approaches developed by the group and any key results achieved; and (3) important open challenges in multi-agent systems research from the perspective of the group.
@article{albrecht2020special,
title = {Special Issue on Multi-Agent Systems Research in the United Kingdom: Guest Editorial},
author = {Stefano V. Albrecht and Michael Wooldridge},
journal = {AI Communications},
volume = {35},
number = {4},
year = {2022},
publisher = {IOS Press},
url = {https://content.iospress.com/articles/ai-communications/aic229003}
}
Ibrahim H. Ahmed, Cillian Brewitt, Ignacio Carlucho, Filippos Christianos, Mhairi Dunion, Elliot Fosong, Samuel Garcin, Shangmin Guo, Balint Gyevnar, Trevor McInroe, Georgios Papoudakis, Arrasy Rahman, Lukas Schäfer, Massimiliano Tamborski, Giuseppe Vecchio, Cheng Wang, Stefano V. Albrecht
Deep Reinforcement Learning for Multi-Agent Interaction
AI Communications, 2022
Abstract | BibTex | arXiv | Publisher
AICsurveydeep-rlmulti-agent-rlad-hoc-teamworkagent-modellinggoal-recognitionsecurityexplainable-aiautonomous-driving
Abstract:
The development of autonomous agents which can interact with other agents to accomplish a given task is a core area of research in artificial intelligence and machine learning. Towards this goal, the Autonomous Agents Research Group develops novel machine learning algorithms for autonomous systems control, with a specific focus on deep reinforcement learning and multi-agent reinforcement learning. Research problems include scalable learning of coordinated agent policies and inter-agent communication; reasoning about the behaviours, goals, and composition of other agents from limited observations; and sample-efficient learning based on intrinsic motivation, curriculum learning, causal inference, and representation learning. This article provides a broad overview of the ongoing research portfolio of the group and discusses open problems for future directions.
@article{albrecht2022aic,
author = {Ahmed, Ibrahim H. and Brewitt, Cillian and Carlucho, Ignacio and Christianos, Filippos and Dunion, Mhairi and Fosong, Elliot and Garcin, Samuel and Guo, Shangmin and Gyevnar, Balint and McInroe, Trevor and Papoudakis, Georgios and Rahman, Arrasy and Schäfer, Lukas and Tamborski, Massimiliano and Vecchio, Giuseppe and Wang, Cheng and Albrecht, Stefano V.},
title = {Deep Reinforcement Learning for Multi-Agent Interaction},
journal = {AI Communications, Special Issue on Multi-Agent Systems Research in the UK},
year = {2022}
}
Rujie Zhong, Duohan Zhang, Lukas Schäfer, Stefano V. Albrecht, Josiah P. Hanna
Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning
Conference on Neural Information Processing Systems, 2022
Abstract | BibTex | arXiv | Code
NeurIPSdeep-rl
Abstract:
Reinforcement learning (RL) algorithms are often categorized as either on-policy or off-policy depending on whether they use data from a target policy of interest or from a different behavior policy. In this paper, we study a subtle distinction between on-policy data and on-policy sampling in the context of the RL sub-problem of policy evaluation. We observe that on-policy sampling may fail to match the expected distribution of on-policy data after observing only a finite number of trajectories and this failure hinders data-efficient policy evaluation. Towards improved data-efficiency, we show how non-i.i.d., off-policy sampling can produce data that more closely matches the expected on-policy data distribution and consequently increases the accuracy of the Monte Carlo estimator for policy evaluation. We introduce a method called Robust On-Policy Sampling and demonstrate theoretically and empirically that it produces data that converges faster to the expected on-policy distribution compared to on-policy sampling. Empirically, we show that this faster convergence leads to lower mean squared error policy value estimates.
@inproceedings{zhong2022datacollection,
title={Robust On-Policy Data Collection for Data Efficient Policy Evaluation},
author={Rujie Zhong and Duohan Zhang and Lukas Sch\"afer and Stefano V. Albrecht and Josiah P. Hanna},
booktitle={Conference on Neural Information Processing Systems},
year={2022}
}
Trevor McInroe, Lukas Schäfer, Stefano V. Albrecht
Learning Representations for Reinforcement Learning with Hierarchical Forward Models
NeurIPS Workshop on Deep Reinforcement Learning, 2022
Abstract | BibTex | arXiv
NeurIPSdeep-rlgeneralisation
Abstract:
Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider single-step transitions, which may miss relevant information if important environmental changes take many steps to manifest. We propose Hierarchical k-Step Latent (HKSL), an auxiliary task that learns representations via a hierarchy of forward models that operate at varying magnitudes of step skipping while also learning to communicate between levels in the hierarchy. We evaluate HKSL in a suite of 30 robotic control tasks with and without distractors and a task of our creation. We find that HKSL either converges to higher or optimal episodic returns more quickly than several alternative representation learning approaches. Furthermore, we find that HKSL's representations capture task-relevant details accurately across timescales (even in the presence of distractors) and that communication channels between hierarchy levels organize information based on both sides of the communication process, both of which improve sample efficiency.
@inproceedings{mcinroe2022hksl,
title={Learning Representations for Reinforcement Learning with Hierarchical Forward Models},
author={Trevor McInroe and Lukas Schäfer and Stefano V. Albrecht},
booktitle={NeurIPS Workshop on Deep RL},
year={2022}
}
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck, Josiah Hanna, Stefano V. Albrecht
Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning
NeurIPS Workshop on Deep Reinforcement Learning, 2022
Abstract | BibTex | arXiv | Code
NeurIPSdeep-rlgeneralisationcausal
Abstract:
Reinforcement Learning (RL) agents are often unable to generalise well to environment variations in the state space that were not observed during training. This issue is especially problematic for image-based RL, where a change in just one variable, such as the background colour, can change many pixels in the image, which can lead to drastic changes in the agent's latent representation of the image, causing the learned policy to fail. To learn more robust representations, we introduce TEmporal Disentanglement (TED), a self-supervised auxiliary task that leads to disentangled image representations exploiting the sequential nature of RL observations. We find empirically that RL algorithms utilising TED as an auxiliary task adapt more quickly to changes in environment variables with continued training compared to state-of-the-art representation learning methods. Since TED enforces a disentangled structure of the representation, we also find that policies trained with TED generalise better to unseen values of variables irrelevant to the task (e.g. background colour) as well as unseen values of variables that affect the optimal policy (e.g. goal positions).
@inproceedings{dunion2022ted,
title={Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning},
author={Mhairi Dunion and Trevor McInroe and Kevin Sebastian Luck and Josiah Hanna and Stefano V. Albrecht},
booktitle={NeurIPS Workshop on Deep Reinforcement Learning},
year={2022}
}
Guy Azran, Mohamad Hosein Danesh, Stefano V. Albrecht, Sarah Keren
Enhancing Transfer of Reinforcement Learning Agents with Abstract Contextual Embeddings
NeurIPS Workshop on Neuro Causal and Symbolic AI, 2022
Abstract | BibTex
NeurIPSdeep-rlcausal
Abstract:
Deep reinforcement learning (DRL) algorithms have seen great success in performing a plethora of tasks, but often have trouble adapting to changes in the environment. We address this issue by using reward machines (RM), a graph-based abstraction of the underlying task to represent the current setting or context. Using a graph neural network (GNN), we embed the RMs into deep latent vector representations and provide them to the agent to enhance its ability to adapt to new contexts. To the best of our knowledge, this is the first work to embed contextual abstractions and let the agent decide how to use them. Our preliminary empirical evaluation demonstrates improved sample efficiency of our approach upon context transfer on a set of grid navigation tasks.
@inproceedings{Azran2022enhancing,
title={Enhancing Transfer of Reinforcement Learning Agents with Abstract Contextual Embeddings},
author={Guy Azran and Mohamad Hosein Danesh and Stefano V. Albrecht and Sarah Keren},
booktitle={NeurIPS Workshop on Neuro Causal and Symbolic AI (https://ncsi.cause-lab.net)},
year={2022}
}
Lukas Schäfer, Filippos Christianos, Josiah P. Hanna, Stefano V. Albrecht
Decoupled Reinforcement Learning to Stabilise Intrinsically-Motivated Exploration
International Conference on Autonomous Agents and Multi-Agent Systems, 2022
Abstract | BibTex | arXiv | Code
AAMASdeep-rlintrinsic-reward
Abstract:
Intrinsic rewards can improve exploration in reinforcement learning, but the exploration process may suffer from instability caused by non-stationary reward shaping and strong dependency on hyperparameters. In this work, we introduce Decoupled RL (DeRL) as a general framework which trains separate policies for intrinsically-motivated exploration and exploitation. Such decoupling allows DeRL to leverage the benefits of intrinsic rewards for exploration while demonstrating improved robustness and sample efficiency. We evaluate DeRL algorithms in two sparse-reward environments with multiple types of intrinsic rewards. Our results show that DeRL is more robust to varying scale and rate of decay of intrinsic rewards and converges to the same evaluation returns than intrinsically-motivated baselines in fewer interactions. Lastly, we discuss the challenge of distribution shift and show that divergence constraint regularisers can successfully minimise instability caused by divergence of exploration and exploitation policies.
@inproceedings{schaefer2022derl,
title={Decoupled Reinforcement Learning to Stabilise Intrinsically-Motivated Exploration},
author={Lukas Schäfer and Filippos Christianos and Josiah P. Hanna and Stefano V. Albrecht},
booktitle={International Conference on Autonomous Agents and Multiagent Systems (AAMAS)},
year={2022}
}
Giuseppe Vecchio, Simone Palazzo, Dario C Guastella, Ignacio Carlucho, Stefano V. Albrecht, Giovanni Muscato, Concetto Spampinato
MIDGARD: A Simulation Platform for Autonomous Navigation in Unstructured Environments
ICRA Workshop on Releasing Robots into the Wild: Simulations, Benchmarks, and Deployment, 2022
Abstract | BibTex | arXiv
ICRAdeep-rlsimulator
Abstract:
We present MIDGARD, an open source simulation platform for autonomous robot navigation in unstructured outdoor environments. We specifically design MIDGARD to enable training of autonomous agents (e.g., unmanned ground vehicles) in photorealistic 3D environments, and to support the generalization skills of learning-based agents by means of diverse and variable training scenarios. MIDGARD differs from other major simulation platforms in that it proposes a highly configurable procedural landscape generation pipeline, which enables autonomous agents to be trained in diverse scenarios while reducing the efforts and costs needed to create digital content from scratch.
@misc{Vecchio2022MIDGARD,
title={MIDGARD: A Simulation Platform for Autonomous Navigation in Unstructured Environments},
author={Giuseppe Vecchio, Simone Palazzo, Dario C Guastella, Ignacio Carlucho, Stefano V. Albrecht, Giovanni Muscato, Concetto Spampinato},
year={2022},
eprint={2205.08389},
archivePrefix={arXiv},
primaryClass={cs.MA}
}
Arrasy Rahman, Ignacio Carlucho, Niklas Höpner, Stefano V. Albrecht
A General Learning Framework for Open Ad Hoc Teamwork Using Graph-based Policy Learning
arXiv:2210.05448, 2022
Abstract | BibTex | arXiv
ad-hoc-teamworkdeep-rlagent-modelling
Abstract:
Open ad hoc teamwork is the problem of training a single agent to efficiently collaborate with an unknown group of teammates whose composition may change over time. A variable team composition creates challenges for the agent, such as the requirement to adapt to new team dynamics and dealing with changing state vector sizes. These challenges are aggravated in real-world applications where the controlled agent has no access to the full state of the environment. In this work, we develop a class of solutions for open ad hoc teamwork under full and partial observability. We start by developing a solution for the fully observable case that leverages graph neural network architectures to obtain an optimal policy based on reinforcement learning. We then extend this solution to partially observable scenarios by proposing different methodologies that maintain belief estimates over the latent environment states and team composition. These belief estimates are combined with our solution for the fully observable case to compute an agent's optimal policy under partial observability in open ad hoc teamwork. Empirical results demonstrate that our approach can learn efficient policies in open ad hoc teamwork in full and partially observable cases. Further analysis demonstrates that our methods' success is a result of effectively learning the effects of teammates' actions while also inferring the inherent state of the environment under partial observability.
@misc{Rahman2022POGPL,
title={A General Learning Framework for Open Ad Hoc Teamwork Using Graph-based Policy Learning},
author={Arrasy Rahman and Ignacio Carlucho and Niklas H\"opner and Stefano V. Albrecht},
year={2022},
eprint={2210.05448},
archivePrefix={arXiv}
}
Aleksandar Krnjaic, Jonathan D. Thomas, Georgios Papoudakis, Lukas Schäfer, Peter Börsting, Stefano V. Albrecht
Scalable Multi-Agent Reinforcement Learning for Warehouse Logistics with Robotic and Human Co-Workers
arXiv:2212.11498, 2022
Abstract | BibTex | arXiv
deep-rlmulti-agent-rl
Abstract:
This project leverages advances in Multi-Agent Reinforcement Learning (MARL) to improve the efficiency and flexibility of order-picking systems for large-scale commercial warehouses. We envision a warehouse of the future in which dozens or even hundreds of mobile robots and humans work together to collect and deliver items. The fundamental problem we tackle - called the order-picking problem - is how these agents must coordinate their movement and actions in the warehouse to maximise performance (e.g. order throughput) under given resource constraints. MARL algorithms implement a paradigm whereby the agents learn via a process of trial-and-error how to optimally collaborate with one another. Established industry methods using fixed heuristics require a large engineering effort to operate in specific warehouse configurations and resource constraints, and their achievable performance is often limited by heuristic design limitations. In contrast, the MARL framework can be applied to any warehouse configuration (e.g. size, layout, number/types of workers, item replenishment frequency) and resource constraints, and the learning process maximises performance by optimising agent behaviours for the specified warehouse environment.
@misc{Krnjaic2022HSNAC,
title={Scalable Multi-Agent Reinforcement Learning for Warehouse Logistics with Robotic and Human Co-Workers},
author={Aleksandar Krnjaic and Jonathan D. Thomas and Georgios Papoudakis and Lukas Sch\"afer and Peter B\"orsting and Stefano V. Albrecht,
year={2022},
eprint={2212.11498},
archivePrefix={arXiv}
}
Lukas Schäfer, Filippos Christianos, Amos Storkey, Stefano V. Albrecht
Learning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement Learning
arxiv:2207.02249, 2022
Abstract | BibTex | arXiv
deep-rlmulti-agent-rl
Abstract:
Successful deployment of multi-agent reinforcement learning often requires agents to adapt their behaviour. In this work, we discuss the problem of teamwork adaptation in which a team of agents needs to adapt their policies to solve novel tasks with limited fine-tuning. Motivated by the intuition that agents need to be able to identify and distinguish tasks in order to adapt their behaviour to the current task, we propose to learn multi-agent task embeddings (MATE). These task embeddings are trained using an encoder-decoder architecture optimised for reconstruction of the transition and reward functions which uniquely identify tasks. We show that a team of agents is able to adapt to novel tasks when provided with task embeddings. We propose three MATE training paradigms: independent MATE, centralised MATE, and mixed MATE which vary in the information used for the task encoding. We show that the embeddings learned by MATE identify tasks and provide useful information which agents leverage during adaptation to novel tasks.
@misc{schaefer2022mate,
title={Learning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement Learning},
author={Lukas Schäfer and Filippos Christianos and Amos Storkey and Stefano V. Albrecht},
year={2022},
eprint={2207.02249},
archivePrefix={arXiv},
primaryClass={cs.MA}
}
Filippos Christianos, Georgios Papoudakis, Stefano V. Albrecht
Pareto Actor-Critic for Equilibrium Selection in Multi-Agent Reinforcement Learning
arXiv:2209.14344, 2022
Abstract | BibTex | arXiv
deep-rlmulti-agent-rl
Abstract:
Equilibrium selection in multi-agent games refers to the problem of selecting a Pareto-optimal equilibrium. It has been shown that many state-of-the-art multi-agent reinforcement learning (MARL) algorithms are prone to converging to Pareto-dominated equilibria due to the uncertainty each agent has about the policy of the other agents during training. To address suboptimal equilibrium selection, we propose Pareto-AC (PAC), an actor-critic algorithm that utilises a simple principle of no-conflict games (a superset of cooperative games with identical rewards): each agent can assume the others will choose actions that will lead to a Pareto-optimal equilibrium. We evaluate PAC in a diverse set of multi-agent games and show that it converges to higher episodic returns compared to alternative MARL algorithms, as well as successfully converging to a Pareto-optimal equilibrium in a range of matrix games. Finally, we propose a graph neural network extension which is shown to efficiently scale in games with up to 15 agents.
@misc{christianos2022pareto,
title={Pareto Actor-Critic for Equilibrium Selection in Multi-Agent Reinforcement Learning},
author={Filippos Christianos and Georgios Papoudakis and Stefano V. Albrecht},
year={2022},
eprint={2209.14344},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
2021
Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, Stefano V. Albrecht
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks
Conference on Neural Information Processing Systems, Datasets and Benchmarks Track, 2021
Abstract | BibTex | arXiv | Code
NeurIPSdeep-rlmulti-agent-rl
Abstract:
Multi-agent deep reinforcement learning (MARL) suffers from a lack of commonly-used evaluation tasks and criteria, making comparisons between approaches difficult. In this work, we consistently evaluate and compare three different classes of MARL algorithms (independent learning, centralised multi-agent policy gradient, value decomposition) in a diverse range of cooperative multi-agent learning tasks. Our experiments serve as a reference for the expected performance of algorithms across different learning tasks, and we provide insights regarding the effectiveness of different learning approaches. We open-source EPyMARL, which extends the PyMARL codebase [Samvelyan et al., 2019] to include additional algorithms and allow for flexible configuration of algorithm implementation details such as parameter sharing. Finally, we open-source two environments for multi-agent research which focus on coordination under sparse rewards.
@inproceedings{papoudakis2021benchmarking,
title={Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks},
author={Georgios Papoudakis and Filippos Christianos and Lukas Sch\"afer and Stefano V. Albrecht},
booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS)},
year={2021},
url = {http://arxiv.org/abs/2006.07869},
openreview = {https://openreview.net/forum?id=cIrPX-Sn5n},
code = {https://github.com/uoe-agents/epymarl}
}
Georgios Papoudakis, Filippos Christianos, Stefano V. Albrecht
Agent Modelling under Partial Observability for Deep Reinforcement Learning
Conference on Neural Information Processing Systems, 2021
Abstract | BibTex | arXiv | Code
NeurIPSdeep-rlagent-modelling
Abstract:
Modelling the behaviours of other agents is essential for understanding how agents interact and making effective decisions. Existing methods for agent modelling commonly assume knowledge of the local observations and chosen actions of the modelled agents during execution. To eliminate this assumption, we extract representations from the local information of the controlled agent using encoder-decoder architectures. Using the observations and actions of the modelled agents during training, our models learn to extract representations about the modelled agents conditioned only on the local observations of the controlled agent. The representations are used to augment the controlled agent's decision policy which is trained via deep reinforcement learning; thus, during execution, the policy does not require access to other agents' information. We provide a comprehensive evaluation and ablations studies in cooperative, competitive and mixed multi-agent environments, showing that our method achieves significantly higher returns than baseline methods which do not use the learned representations.
@inproceedings{papoudakis2021local,
title={Agent Modelling under Partial Observability for Deep Reinforcement Learning},
author={Georgios Papoudakis and Filippos Christianos and Stefano V. Albrecht},
booktitle = {Proceedings of the Neural Information Processing Systems (NeurIPS)},
year = {2021}
}
Rujie Zhong, Josiah P. Hanna, Lukas Schäfer, Stefano V. Albrecht
Robust On-Policy Data Collection for Data-Efficient Policy Evaluation
NeurIPS Workshop on Offline Reinforcement Learning, 2021
Abstract | BibTex | arXiv | Code
NeurIPSdeep-rl
Abstract:
This paper considers how to complement offline reinforcement learning (RL) data with additional data collection for the task of policy evaluation. In policy evaluation, the task is to estimate the expected return of an evaluation policy on an environment of interest. Prior work on offline policy evaluation typically only considers a static dataset. We consider a setting where we can collect a small amount of additional data to combine with a potentially larger offline RL dataset. We show that simply running the evaluation policy – on-policy data collection – is sub-optimal for this setting. We then introduce two new data collection strategies for policy evaluation, both of which consider previously collected data when collecting future data so as to reduce distribution shift (or sampling error) in the entire dataset collected. Our empirical results show that compared to on-policy sampling, our strategies produce data with lower sampling error and generally lead to lower mean-squared error in policy evaluation for any total dataset size. We also show that these strategies can start from initial off-policy data, collect additional data, and then use both the initial and new data to produce low mean-squared error policy evaluation without using off-policy corrections.
@inproceedings{zhong2021robust,
title={Robust On-Policy Data Collection for Data-Efficient Policy Evaluation},
author={Rujie Zhong and Josiah P. Hanna and Lukas Sch\"afer and Stefano V. Albrecht},
booktitle={NeurIPS Workshop on Offline Reinforcement Learning (OfflineRL)},
year={2021}
}
Arrasy Rahman, Niklas Höpner, Filippos Christianos, Stefano V. Albrecht
Towards Open Ad Hoc Teamwork Using Graph-based Policy Learning
International Conference on Machine Learning, 2021
Abstract | BibTex | arXiv | Video | Code
ICMLdeep-rlagent-modellingad-hoc-teamwork
Abstract:
Ad hoc teamwork is the challenging problem of designing an autonomous agent which can adapt quickly to collaborate with teammates without prior coordination mechanisms, including joint training. Prior work in this area has focused on closed teams in which the number of agents is fixed. In this work, we consider open teams by allowing agents with different fixed policies to enter and leave the environment without prior notification. Our solution builds on graph neural networks to learn agent models and joint-action value models under varying team compositions. We contribute a novel action-value computation that integrates the agent model and joint-action value model to produce action-value estimates. We empirically demonstrate that our approach successfully models the effects other agents have on the learner, leading to policies that robustly adapt to dynamic team compositions and significantly outperform several alternative methods.
@inproceedings{rahman2021open,
title={Towards Open Ad Hoc Teamwork Using Graph-based Policy Learning},
author={Arrasy Rahman and Niklas H\"opner and Filippos Christianos and Stefano V. Albrecht},
booktitle={International Conference on Machine Learning (ICML)},
year={2021}
}
Filippos Christianos, Georgios Papoudakis, Arrasy Rahman, Stefano V. Albrecht
Scaling Multi-Agent Reinforcement Learning with Selective Parameter Sharing
International Conference on Machine Learning, 2021
Abstract | BibTex | arXiv | Video | Code
ICMLdeep-rlmulti-agent-rl
Abstract:
Sharing parameters in multi-agent deep reinforcement learning has played an essential role in allowing algorithms to scale to a large number of agents. Parameter sharing between agents significantly decreases the number of trainable parameters, shortening training times to tractable levels, and has been linked to more efficient learning. However, having all agents share the same parameters can also have a detrimental effect on learning. We demonstrate the impact of parameter sharing methods on training speed and converged returns, establishing that when applied indiscriminately, their effectiveness is highly dependent on the environment. We propose a novel method to automatically identify agents which may benefit from sharing parameters by partitioning them based on their abilities and goals. Our approach combines the increased sample efficiency of parameter sharing with the representational capacity of multiple independent networks to reduce training time and increase final returns.
@inproceedings{christianos2021scaling,
title={Scaling Multi-Agent Reinforcement Learning with Selective Parameter Sharing},
author={Filippos Christianos and Georgios Papoudakis and Arrasy Rahman and Stefano V. Albrecht},
booktitle={International Conference on Machine Learning (ICML)},
year={2021}
}
Lukas Schäfer, Filippos Christianos, Josiah Hanna, Stefano V. Albrecht
Decoupling Exploration and Exploitation in Reinforcement Learning
ICML Workshop on Unsupervised Reinforcement Learning, 2021
Abstract | BibTex | arXiv | Code
ICMLdeep-rlintrinsic-reward
Abstract:
Intrinsic rewards are commonly applied to improve exploration in reinforcement learning. However, these approaches suffer from instability caused by non-stationary reward shaping and strong dependency on hyperparameters. In this work, we propose Decoupled RL (DeRL) which trains separate policies for exploration and exploitation. DeRL can be applied with on-policy and off-policy RL algorithms. We evaluate DeRL algorithms in two sparse-reward environments with multiple types of intrinsic rewards. We show that DeRL is more robust to scaling and speed of decay of intrinsic rewards and converges to the same evaluation returns than intrinsically motivated baselines in fewer interactions.
@inproceedings{schaefer2021decoupling,
title={Decoupling Exploration and Exploitation in Reinforcement Learning},
author={Lukas Schäfer and Filippos Christianos and Josiah Hanna and Stefano V. Albrecht},
booktitle={ICML Workshop on Unsupervised Reinforcement Learning (URL)},
year={2021}
}
Trevor McInroe, Lukas Schäfer, Stefano V. Albrecht
Learning Temporally-Consistent Representations for Data-Efficient Reinforcement Learning
arXiv:2110.04935, 2021
Abstract | BibTex | arXiv | Code
deep-rl
Abstract:
Deep reinforcement learning (RL) agents that exist in high-dimensional state spaces, such as those composed of images, have interconnected learning burdens. Agents must learn an action-selection policy that completes their given task, which requires them to learn a representation of the state space that discerns between useful and useless information. The reward function is the only supervised feedback that RL agents receive, which causes a representation learning bottleneck that can manifest in poor sample efficiency. We present k-Step Latent (KSL), a new representation learning method that enforces temporal consistency of representations via a self-supervised auxiliary task wherein agents learn to recurrently predict action-conditioned representations of the state space. The state encoder learned by KSL produces low-dimensional representations that make optimization of the RL task more sample efficient. Altogether, KSL produces state-of-the-art results in both data efficiency and asymptotic performance in the popular PlaNet benchmark suite. Our analyses show that KSL produces encoders that generalize better to new tasks unseen during training, and its representations are more strongly tied to reward, are more invariant to perturbations in the state space, and move more smoothly through the temporal axis of the RL problem than other methods such as DrQ, RAD, CURL, and SAC-AE.
@misc{mcinroe2021learning,
title={Learning Temporally-Consistent Representations for Data-Efficient Reinforcement Learning},
author={Trevor McInroe and Lukas Schäfer and Stefano V. Albrecht},
year={2021},
eprint={2110.04935},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
2020
Filippos Christianos, Lukas Schäfer, Stefano V. Albrecht
Shared Experience Actor-Critic for Multi-Agent Reinforcement Learning
Conference on Neural Information Processing Systems, 2020
Abstract | BibTex | arXiv
NeurIPSdeep-rlmulti-agent-rl
Abstract:
Exploration in multi-agent reinforcement learning is a challenging problem, especially in environments with sparse rewards. We propose a general method for efficient exploration by sharing experience amongst agents. Our proposed algorithm, called Shared Experience Actor-Critic (SEAC), applies experience sharing in an actor-critic framework. We evaluate SEAC in a collection of sparse-reward multi-agent environments and find that it consistently outperforms two baselines and two state-of-the-art algorithms by learning in fewer steps and converging to higher returns. In some harder environments, experience sharing makes the difference between learning to solve the task and not learning at all.
@inproceedings{christianos2020shared,
title={Shared Experience Actor-Critic for Multi-Agent Reinforcement Learning},
author={Filippos Christianos and Lukas Sch\"afer and Stefano V. Albrecht},
booktitle={34th Conference on Neural Information Processing Systems},
year={2020}
}
Georgios Papoudakis, Stefano V. Albrecht
Variational Autoencoders for Opponent Modeling in Multi-Agent Systems
AAAI Workshop on Reinforcement Learning in Games, 2020
Abstract | BibTex | arXiv
AAAIdeep-rlagent-modelling
Abstract:
Multi-agent systems exhibit complex behaviors that emanate from the interactions of multiple agents in a shared environment. In this work, we are interested in controlling one agent in a multi-agent system and successfully learn to interact with the other agents that have fixed policies. Modeling the behavior of other agents (opponents) is essential in understanding the interactions of the agents in the system. By taking advantage of recent advances in unsupervised learning, we propose modeling opponents using variational autoencoders. Additionally, many existing methods in the literature assume that the opponent models have access to opponent's observations and actions during both training and execution. To eliminate this assumption, we propose a modification that attempts to identify the underlying opponent model using only local information of our agent, such as its observations, actions, and rewards. The experiments indicate that our opponent modeling methods achieve equal or greater episodic returns in reinforcement learning tasks against another modeling method.
@inproceedings{papoudakis2020variational,
title={Variational Autoencoders for Opponent Modeling in Multi-Agent Systems},
author={Georgios Papoudakis and Stefano V. Albrecht},
booktitle={AAAI Workshop on Reinforcement Learning in Games},
year={2020}
}
Arrasy Rahman, Niklas Höpner, Filippos Christianos, Stefano V. Albrecht
Open Ad Hoc Teamwork using Graph-based Policy Learning
arXiv:2006.10412, 2020
Abstract | BibTex | arXiv
deep-rlagent-modellingad-hoc-teamwork
Abstract:
Ad hoc teamwork is the challenging problem of designing an autonomous agent which can adapt quickly to collaborate with previously unknown teammates. Prior work in this area has focused on closed teams in which the number of agents is fixed. In this work, we consider open teams by allowing agents of varying types to enter and leave the team without prior notification. Our proposed solution builds on graph neural networks to learn scalable agent models and value decompositions under varying team sizes, which can be jointly trained with a reinforcement learning agent using discounted returns objectives. We demonstrate empirically that our approach results in agent policies which can robustly adapt to dynamic team composition, and is able to effectively generalize to larger teams than were seen during training.
@misc{rahman2020open,
title={Open Ad Hoc Teamwork using Graph-based Policy Learning},
author={Arrasy Rahman and Niklas H\"opner and Filippos Christianos and Stefano V. Albrecht},
year={2020},
eprint={2006.10412},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Georgios Papoudakis, Filippos Christianos , Lukas Schäfer, Stefano V. Albrecht
Comparative Evaluation of Multi-Agent Deep Reinforcement Learning Algorithms
arXiv:2006.07869, 2020
Abstract | BibTex | arXiv
deep-rlmulti-agent-rl
Abstract:
Multi-agent deep reinforcement learning (MARL) suffers from a lack of commonly-used evaluation tasks and criteria, making comparisons between approaches difficult. In this work, we evaluate and compare three different classes of MARL algorithms (independent learners, centralised training with decentralised execution, and value decomposition) in a diverse range of multi-agent learning tasks. Our results show that (1) algorithm performance depends strongly on environment properties and no algorithm learns efficiently across all learning tasks; (2) independent learners often achieve equal or better performance than more complex algorithms; (3) tested algorithms struggle to solve multi-agent tasks with sparse rewards. We report detailed empirical data, including a reliability analysis, and provide insights into the limitations of the tested algorithms.
@misc{papoudakis2020comparative,
title={Comparative Evaluation of Multi-Agent Deep Reinforcement Learning Algorithms},
author={Georgios Papoudakis and Filippos Christianos and Lukas Sch\"afer and Stefano V. Albrecht},
year={2020},
eprint={2006.07869},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Georgios Papoudakis, Filippos Christianos, Stefano V. Albrecht
Local Information Opponent Modelling Using Variational Autoencoders
arXiv:2006.09447, 2020
Abstract | BibTex | arXiv
deep-rlagent-modelling
Abstract:
Modelling the behaviours of other agents (opponents) is essential for understanding how agents interact and making effective decisions. Existing methods for opponent modelling commonly assume knowledge of the local observations and chosen actions of the modelled opponents, which can significantly limit their applicability. We propose a new modelling technique based on variational autoencoders, which are trained to reconstruct the local actions and observations of the opponent based on embeddings which depend only on the local observations of the modelling agent (its observed world state, chosen actions, and received rewards). The embeddings are used to augment the modelling agent's decision policy which is trained via deep reinforcement learning; thus the policy does not require access to opponent observations. We provide a comprehensive evaluation and ablation study in diverse multi-agent tasks, showing that our method achieves comparable performance to an ideal baseline which has full access to opponent's information, and significantly higher returns than a baseline method which does not use the learned embeddings.
@misc{papoudakis2020opponent,
title={Local Information Opponent Modelling Using Variational Autoencoders},
author={Georgios Papoudakis and Filippos Christianos and Stefano V. Albrecht},
year={2020},
eprint={2006.09447},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
2019
Georgios Papoudakis, Filippos Christianos, Arrasy Rahman, Stefano V. Albrecht
Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning
arXiv:1906.04737, 2019
Abstract | BibTex | arXiv
surveydeep-rlmulti-agent-rl
Abstract:
Recent developments in deep reinforcement learning are concerned with creating decision-making agents which can perform well in various complex domains. A particular approach which has received increasing attention is multi-agent reinforcement learning, in which multiple agents learn concurrently to coordinate their actions. In such multi-agent environments, additional learning problems arise due to the continually changing decision-making policies of agents. This paper surveys recent works that address the non-stationarity problem in multi-agent deep reinforcement learning. The surveyed methods range from modifications in the training procedure, such as centralized training, to learning representations of the opponent's policy, meta-learning, communication, and decentralized learning. The survey concludes with a list of open problems and possible lines of future research.
@misc{papoudakis2019dealing,
title={Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning},
author={Georgios Papoudakis and Filippos Christianos and Arrasy Rahman and Stefano V. Albrecht},
year={2019},
eprint={1906.04737},
archivePrefix={arXiv},
primaryClass={cs.LG}
}