Master's Dissertations in Reinforcement Learning
Since the foundation of our research group, a number of MSc (Master of Science) students have completed their dissertation projects in our group. These dissertations have covered a range of topics in reinforcement learning, and some were published in top-tier conferences. This blog post lists MSc dissertations completed in our group that have a focus on reinforcement learning.
Context: In the School of Informatics, the MSc degree is typically a 1-year programme with a final dissertation project completed over a period of around 3 months.
2024
Integrating Agent Modelling in Multi-Agent Reinforcement Learning Algorithms
Author: Rahat Santosh
In multi-agent systems, agents often rely on implicit modelling, where they infer the behaviours of others through indirect observation, which can result in slower learning and suboptimal performance. However, explicit agent modelling—where agents directly anticipate and adapt to the strategies of their peers—has the potential to significantly enhance learning efficiency and outcomes. This dissertation explores the potential of explicit agent modelling within Multi-Agent Reinforcement Learning (MARL), with a focus on autoencoder-based techniques designed to capture and predict opponent behaviours with greater precision.
2023
Out-of-Distribution Dynamics Detection in Deep Reinforcement Learning
Author: Ivalin Chobanov
Detecting Out-of-Distribution Dynamics (OODD) in Deep Reinforcement Learning (DRL) is a problem with significant importance for real-world applications employing DRL agents. Failure to detect OODD could lead to serious problems or even result in life-threatening scenarios. For example, an autonomous vehicle could crash due to its inability to drive in an environment with OODD. So far, research in the area has focused on noise-free environments with deterministic transitions. The purpose of this project is to develop a framework that can operate in non-deterministic and noisy environments. Our results demonstrate that the developed framework is capable of detecting OODD in both noisy and non-deterministic environments. This makes our framework a step towards developing OODD detection methods for noisy, non- deterministic environments.
Environment Design for Cooperative AI
Author: Jake Amo
As artificial intelligence becomes more integrated into our society, the need for strong cooperation between agents, both virtual and real, becomes increasingly evident. Co- operative multi-agent reinforcement learning looks to promote vital cooperative capa- bilities by training agents on specific tasks across a range of environments, such that agents may learn to solve the task through cooperation. However, the current landscape of cooperative MARL environments on which agents can be trained and develop these capabilities is limited, hindering progress in the field. This project presents three novel environments each built for testing and developing different cooperative capabilities, simultaneously expanding on the collection of environments used in cooperative MARL research. Designed with cooperation and flexibility at their core, these environments are carefully designed, presented and benchmarked to act as a catalyst for future research and we hope will provide a valuable contribution to the cooperative MARL research community for many years to come.
2022
Revisiting Discrete Gradient Estimation in MADDPG
Author: Callum Tilbury
MADDPG is an algorithm in Multi-Agent Reinforcement Learning that extends the
popular single-agent method, DDPG, to multi-agent scenarios. Importantly, DDPG is
an algorithm designed for continuous-action spaces, where the gradient of the state-action
value function exists. For this algorithm to work in discrete-action spaces, then,
discrete gradient estimation must be performed. For MADDPG, this is done using the
Gumbel-Softmax estimator—a reparameterisation which relaxes a discrete distribution
into a somewhat-similar continuous one. This method, however, is statistically biased,
and some authors believe that this bias makes MADDPG perform poorly in grid-world
situations, where the action-space is discrete. Fortunately, many alternatives to the
Gumbel-Softmax exist, boasting a wide range of properties. This project details a
handful of these alternatives, sourced from the literature, and integrates them into
MADDPG for discrete grid-world scenarios. The corresponding impact on various
performance metrics is then measured and analysed. It is found that one of the proposed
estimators performs significantly better than the original Gumbel-Softmax in several
tasks, both in the returns achieved and the time it takes for the training to converge.
Based on these results, it is argued that a rich field of future work exists.
Value Decomposition based on the Actor-Critic Framework for Cooperative Multi-Agent Reinforcement Learning
Author: Yuyang Zhou
With the development of deep learning, Multi-Agent Reinforcement Learning (MARL)
has become a popular method in multi-agent decision-making tasks. In recent years,
algorithms under the Multi-Agent Actor-Critic (MAAC) framework achieved competitive results in cooperative MARL by comparing achieved returns, such as MAA2C,
MADDPG, and MAPPO. However, the credit assignment problem is still challenging in
MAAC algorithms. Although some MAAC methods try to mitigate it, such as COMA,
they cannot achieve competitive returns. Therefore, Value Decomposition (VD) methods are proposed to mitigate the credit assignment problem. However, sometimes they
cannot achieve as high returns as MAAC methods. Most recently, the Value Decomposition Actor-Critic (VDAC) framework, which implements VD methods for MAAC
methods, was proposed. However, all these algorithms are implemented with different
implementation details and the advantages of VDAC methods are vague. Therefore, we
reimplement VD methods for MAA2C and MADDPG with the same implementation
details and compare four VDAC algorithms to their original VD methods and MAAC
methods in a total of eight different tasks. The results of our experiments show that
VDAC methods are suitable to be used in environments containing numerous agents.
Finally, we empirically find that MAA2C+VD methods mitigate the credit assignment
problem in the MAAC framework.
Starcraft II ”Lite” Environment For Multi-Agent Reinforcement Learning Research
Author: Adam Michalski
With artificially intelligent agents entering more and more areas of our daily lives,
we need methods to ensure we are only using the best algorithms to guarantee our
safety and high quality of life, and there is a clear lack of standard benchmarks for
Multi-Agent Reinforcement Learning (MARL) algorithms. The Starcraft Multi-Agent
Challenge (SMAC) claims to be a good candidate for a standard MARL benchmark,
but it is fully based on top of a heavy, closed-source computer game Starcraft II. This
makes it completely closed to extension for researchers not familiar with the game, and
requires the knowledge and use of other proprietary tools specific to the game for any
meaningful contribution to the challenge. We introduce SMAClite – a challenge based
on SMAC that is both completely decoupled from Starcraft II and open-source, along
with a framework which makes it possible to create new content for SMAClite without
any special knowledge. We conduct experiments to show it is equivalent to SMAC in
many ways, and it outperforms SMAC in both speed and memory.
SMAC-lite: A lightweight SMAC-based environment for Multi-agent Reinforcement Learning Research
Author: Tawqir Sohail
Multi-agent reinforcement learning (MARL) is an active area of research in machine
learning with many important practical applications. The StarCraft Multi-Agent Challenge (SMAC), a set of fully-cooperative partially observable MARL test environments
using StarCraft II, is an important standardized benchmark in this field. Despite its value,
SMAC remains inaccessible to many researchers in this field as it requires expensive
computer hardware to use in practice. This project introduces SMAC-lite, a MARL test environment that can be used as
an alternative to SMAC, based on SMAC’s 5m vs 6m test environment. We show that
SMAC-lite is significantly faster than SMAC whilst showing comparable performance
on the QMIX and IQL MARL algorithms. With this project, we aim to introduce and
lay the groundwork for an alternative to SMAC that is more accessible and inclusive to
researchers working in this field.
2021
Data Collection for Policy Evaluation in Reinforcement Learning
Author: Rujie Zhong
In reinforcement learning (RL), policy evaluation is a task to estimate the expected
returns when deploying a particular policy (called the evaluation policy) to the real
environment. One common strategy to conduct policy evaluation is On-policy Sampling
(OS), which collects data by sampling actions from the evaluation policy, and
makes estimations by averaging the collected returns. However, due to sampling error,
the distribution of OS data collection could be arbitrarily different from that under the
evaluation policy, making the value estimation suffer from a large variance.
In this work, we propose two data collection strategies, Robust On-policy Sampling
and Robust On-policy Acting, both of which can consider the historical collected
data and reduce sampling error in future data collection. Experiments in different RL
domains show that limited to the same amount, this low-sampling-error data can generally
enable a more accurate policy value estimation.
Reinforcement Learning with Non-Conventional Value Function Approximation
Author: Antreas Tsiakkas
The use of a function approximation model to represent the value function has
been a major contributor to the success of Reinforcement Learning methods. Despite
its importance, the choice of model defaults to either a linear- or neural network-based
approach mainly due to their good empirical performance, vast amount of research,
and available resources. Still, these conventional approaches are not without limitations
and alternative approaches could offer advantages under certain criteria. Yet,
these remain under-utilised due to a lack of understanding of the problems or requirements
where their use would be beneficial. This dissertation provides empirical results
on the performance, reliability, sample efficiency, training time, and interpretability of
non-conventional value function approximation methods, which are evaluated under a
consistent evaluation framework and compared to the conventional approaches. This
allows the identification of the relative strengths and weaknesses of each model that is
highly dependent on the environment and task at hand. Results suggest that both the
linear and neural network models suffer from sample inefficiencies, whilst alternative
models –such as support vector regression– are significantly more sample efficient.
Further, the neural network model is widely considered a black-box model whilst alternative
models –such as decision trees– offer interpretable architectures. Both limitations
have become increasingly important in the adaptation of Reinforcement Learning
models in real-world applications where they are required to be efficient, transparent
and accountable. Hence, this work can inform future research and promote the use of
non-conventional approaches as a viable alternative to the conventional approaches.
Multi-Agent Deep Reinforcement Learning: Revisiting MADDPG
Author: Joshua Wilkins
Multi-agent reinforcement learning (MARL) is a rapidly expanding field at the forefront
of current research into artificial intelligence. We examine MADDPG, one of
the first MARL algorithms to use deep reinforcement learning, on discrete action environments
to determine whether its application of a Gumble-Softmax impacts its performance
in terms of average and maximum returns. Our findings suggest that while
Gumbel-Softmax negatively impacts performance, the deterministic policy that necessitates
its use performs far better with the multi-agent actor-critic method than the
stochastic alternative we propose without Gumbel-Softmax.
Mitigating Exploitation Caused by Incentivization in Multi-Agent Reinforcement Learning
Author: Paul Chelarescu
This thesis focuses on the issue of exploitation enabled by reinforcement learning
agents being able to incentivize each other via reward sending mechanisms. Motivated
by creating cooperation in the face of sequential social dilemmas, incentivization is a
method of allowing agents to directly shape the learning process of other agents by
influencing their rewards to incentivize cooperative actions. This method is versatile,
but it also leads to scenarios in which agents can be exploited via their collaboration
when their environmental returns together with their received incentives end up being
lower than if they had rejected collaboration and independently accumulated only environmental
rewards. In my work, I have defined this behavior as exploitation via a
metricwith how a set of decision makers can act to maximize their cumulative rewards
in the presence of each other. I have expanded the action space to include a ”reject reward”
action that allows an agent to control whether it receives incentives from others
in order to prevent exploitation. My results indicate that cooperation without exploitation
is a delicate scenario to achieve, and that rejecting rewards from others usually
leads to failure of cooperation.
Reinforcement Learning with Function Approximation in Continuing Tasks: Discounted Return or Average Reward?
Author: Panagiotis Kyriakou
Reinforcement learning is a machine learning sub-field, involving an agent performing
sequential decision making and learning through trial and error inside a predefined
environment. An important design decision for a reinforcement learning algorithm is
the return formulation, which formulates the future expected returns that the agent receives
after following any action in a specific environment state. In continuing tasks
with value function approximation (VFA), average rewards and discounted returns can
be used as the return formulation but it is unclear how the two formulations compare
empirically. This dissertation aims at empirically comparing the two return formulations.
We experiment with three continuing tasks of varying complexity, three learning
algorithms and four different VFA methods. We conduct three experiments investigating
the average performance over multiple hyperparameters, the performance with
near-optimal hyperparameters and the hyperparameter sensitivity of each return formulation.
Our results show that there is an apparent performance advantage in favour of
the average rewards formulation because it is less sensitive to hyperparameters. Once
hyperparameters are optimized, the two formulations seem to perform similarly.
2020
Deep Reinforcement Learning with Relational Forward Models: Can Simpler Methods Perform Better?
Author: Niklas Höpner
Learning in multi-agent systems is difficult due to the multi-agent credit assignment
problem. Agents must obtain an understanding how their reward depends on ther other
agents actions. One approach to encourage coordination between a learning agent and
teammate agents is augmenting deep reinforcement learning algorithms with an agent
model. Relational forward models (RFM) (Tacchetti et al., 2019), a class of recurrent
graph neural networks, try to leverage the relational inductive bias inherent in the set
of agents and environment objects to provide a powerful agent model. Augmenting an
actor-critic algorithm with action predictions from a RFM has been shown to improve
the sample efficiency of the actor-critic algorithm. However, it is an open question
whether agents equipped with a RFM also achieve a fixed level of reward in shorter
computation time compared to agents without an embedded RFM. Since utilizing a
RFM introduces a computational overhead this must not be the case. To investigate
this question, we follow the experimental methodology of Tacchetti et al. (2019) but
additionally look at the computation time of the learning algorithm with and without
an agent model. Further, we look into whether replacing the RFM with a computationally
more efficient conditional action frequency (CAF) model, provides an option for
a better trade-off between sample efficiency and computational complexity. We replicate
the improvement in sample efficiency caused by incorporating an RFM into the
learning algorithm on one of the proposed environments from Tacchetti et al. (2019).
Further we show that in terms of computation time an agent without an agent model
achieves the same reward up to ten hours faster. This suggests that in practice only
reporting the sample efficiency gives an incomplete picture of the properties of any
newly proposed algorithm. It should therefore become standard to report both, results
on the sample efficiency of an algorithm as well as results on the computation time
needed to achieve a fixed reward.
2019
Curiosity in Multi-Agent Reinforcement Learning
Author: Lukas Schäfer
Multi-agent reinforcement learning has seen considerable achievements on a variety
of tasks. However, suboptimal conditions involving sparse feedback and partial observability,
as frequently encountered in applications, remain a significant challenge.
In this thesis, we apply curiosity as exploration bonuses to such multi-agent systems
and analyse their impact on a variety of cooperative and competitive tasks. In addition,
we consider modified scenarios involving sparse rewards and partial observability to
evaluate the influence of curiosity on these challenges.
We apply the independent Q-learning and state-of-the-art multi-agent deep deterministic
policy gradient methods to these tasks with and without intrinsic rewards. Curiosity
is defined using pseudo-counts of observations or relying on models to predict
environment dynamics.
Our evaluation illustrates that intrinsic rewards can cause considerable instability
in training without benefiting exploration. This outcome can be observed on the original
tasks and against our expectation under partial observability, where curiosity is
unable to alleviate the introduced instability. However, curiosity leads to significantly
improved stability and converged performance when applied to policy-gradient reinforcement
learning with sparse rewards. While the sparsity causes training of such
methods to be highly unstable, additional intrinsic rewards assist training and agents
show intended behaviour on most tasks.
This work contributes to understanding the impact of intrinsic rewards in challenging
multi-agent reinforcement learning environments and will serve as a foundation for
further research to expand on.
Reinforcement Learning with Function Approximation in Continuing Tasks: Discounted Return or Average Reward?
Author: Lucas Descause
Discounted returns, a common metric used in Reinforcement Learning to estimate the
value of a state or action, has been claimed to be ill-defined for continuing tasks using
value function approximation. Instead, average rewards have been proposed
as an alternative. We conduct an empirical comparison of discounted returns
and average rewards on three different continuing tasks using various value function
approximation methods. Results show a clear performance advantage for average rewards.
Through further experiments we show that this advantage may not be caused by
partial observability, as hypothesized by the claims against discounted returns.
Instead, we show that this advantage may be due to average reward values requiring a
simpler function to estimate than discounted returns.
Variational Autoencoders for Opponent Modelling in Multi-Agent Systems
Author: Georgios Papoudakis
Multi-agent systems exhibit complex behaviors that comes from the interactions of
multiple agents in a shared environment. Therefore, modelling the behavior of other
agents is a key requirement in order to better understand the local perspective of each
agent in the system. In this thesis, taking advantage of recent advances in unsupervised
learning, we propose an approach to model opponents using variational autoencoders.
Additionally, most existing methods in the literature make the assumption that models
of opponents require access to opponents’ observations and actions both during training
and execution. To this end, we propose an inference method that tries to identify the
underling opponent model, using only local information, such as the observation, action
and reward sequence of our agent. We provide a thorough experimental evaluation of
the proposed methods, both for reinforcement learning as well as unsupervised learning.
The experiments indicate that our opponent modelling method achieves equal of better
performance in reinforcement learning task than another recent modelling method and
other baselines.
Exploration in Multi-Agent Deep Reinforcement Learning
Author: Filippos Christianos
Much emphasis in reinforcement learning research is placed on exploration, ensuring that optimal policies are found. In multi-agent deep reinforcement learning, efficiency in exploration is even more important since the state-action spaces grow beyond our computational capabilities. In this thesis, we motivate and experiment on coordinating exploration between agents, to improve efficiency. We propose three novel methods of increasing complexity, which coordinate agents that explore an environment. We demonstrate through experiments that coordinated exploration outperforms non-coordinated variants.
2018
Communication and Cooperation in Decentralized Multi-Agent Reinforcement Learning
Author: Nico Ring
Many real-world problems can only be solved when multiple agents work together.
Each agent often has only a partial view of the whole environment. Therefore,
communication and cooperation are important abilities for agents in these environments.
In this thesis, we test the ability of an existing deep multi-agent
actor-critic algorithm to cope with partially observable scenarios. In addition, we
propose to adapt this algorithm to use recurrent neural networks, which enables
using information from past steps to improve action-value predictions. We evaluate
the algorithms in cooperative environments that require cooperation and
communication. Furthermore, we simulate varying degrees of partial observability
of actions and observations from other agents. Our experiments show that
our proposed recurrent adaptation achieves equally good or signi cantly better
success rates compared to the original method when only actions of other agents
are partially observed. However, for other settings our adaptation also can perform
worse than the original method, especially when the observations of other
agents are only partially observed.