Master's Dissertations in Reinforcement Learning

Date: 2021-10-06

Since the foundation of our research group, a number of MSc (Master of Science) students have completed their dissertation projects in our group. These dissertations have covered a range of topics in reinforcement learning, and some were published in top-tier conferences. This blog post lists MSc dissertations completed in our group that have a focus on reinforcement learning.

Context: In the School of Informatics, the MSc degree is typically a 1-year programme with a final dissertation project completed over a period of around 3 months.

2024

Integrating Agent Modelling in Multi-Agent Reinforcement Learning Algorithms
Author: Rahat Santosh
In multi-agent systems, agents often rely on implicit modelling, where they infer the behaviours of others through indirect observation, which can result in slower learning and suboptimal performance. However, explicit agent modelling—where agents directly anticipate and adapt to the strategies of their peers—has the potential to significantly enhance learning efficiency and outcomes. This dissertation explores the potential of explicit agent modelling within Multi-Agent Reinforcement Learning (MARL), with a focus on autoencoder-based techniques designed to capture and predict opponent behaviours with greater precision.

2023

Out-of-Distribution Dynamics Detection in Deep Reinforcement Learning
Author: Ivalin Chobanov
Detecting Out-of-Distribution Dynamics (OODD) in Deep Reinforcement Learning (DRL) is a problem with significant importance for real-world applications employing DRL agents. Failure to detect OODD could lead to serious problems or even result in life-threatening scenarios. For example, an autonomous vehicle could crash due to its inability to drive in an environment with OODD. So far, research in the area has focused on noise-free environments with deterministic transitions. The purpose of this project is to develop a framework that can operate in non-deterministic and noisy environments. Our results demonstrate that the developed framework is capable of detecting OODD in both noisy and non-deterministic environments. This makes our framework a step towards developing OODD detection methods for noisy, non- deterministic environments.

Environment Design for Cooperative AI
Author: Jake Amo
As artificial intelligence becomes more integrated into our society, the need for strong cooperation between agents, both virtual and real, becomes increasingly evident. Co- operative multi-agent reinforcement learning looks to promote vital cooperative capa- bilities by training agents on specific tasks across a range of environments, such that agents may learn to solve the task through cooperation. However, the current landscape of cooperative MARL environments on which agents can be trained and develop these capabilities is limited, hindering progress in the field. This project presents three novel environments each built for testing and developing different cooperative capabilities, simultaneously expanding on the collection of environments used in cooperative MARL research. Designed with cooperation and flexibility at their core, these environments are carefully designed, presented and benchmarked to act as a catalyst for future research and we hope will provide a valuable contribution to the cooperative MARL research community for many years to come.

2022

Revisiting Discrete Gradient Estimation in MADDPG
Author: Callum Tilbury
MADDPG is an algorithm in Multi-Agent Reinforcement Learning that extends the popular single-agent method, DDPG, to multi-agent scenarios. Importantly, DDPG is an algorithm designed for continuous-action spaces, where the gradient of the state-action value function exists. For this algorithm to work in discrete-action spaces, then, discrete gradient estimation must be performed. For MADDPG, this is done using the Gumbel-Softmax estimator—a reparameterisation which relaxes a discrete distribution into a somewhat-similar continuous one. This method, however, is statistically biased, and some authors believe that this bias makes MADDPG perform poorly in grid-world situations, where the action-space is discrete. Fortunately, many alternatives to the Gumbel-Softmax exist, boasting a wide range of properties. This project details a handful of these alternatives, sourced from the literature, and integrates them into MADDPG for discrete grid-world scenarios. The corresponding impact on various performance metrics is then measured and analysed. It is found that one of the proposed estimators performs significantly better than the original Gumbel-Softmax in several tasks, both in the returns achieved and the time it takes for the training to converge. Based on these results, it is argued that a rich field of future work exists.

Value Decomposition based on the Actor-Critic Framework for Cooperative Multi-Agent Reinforcement Learning
Author: Yuyang Zhou
With the development of deep learning, Multi-Agent Reinforcement Learning (MARL) has become a popular method in multi-agent decision-making tasks. In recent years, algorithms under the Multi-Agent Actor-Critic (MAAC) framework achieved competitive results in cooperative MARL by comparing achieved returns, such as MAA2C, MADDPG, and MAPPO. However, the credit assignment problem is still challenging in MAAC algorithms. Although some MAAC methods try to mitigate it, such as COMA, they cannot achieve competitive returns. Therefore, Value Decomposition (VD) methods are proposed to mitigate the credit assignment problem. However, sometimes they cannot achieve as high returns as MAAC methods. Most recently, the Value Decomposition Actor-Critic (VDAC) framework, which implements VD methods for MAAC methods, was proposed. However, all these algorithms are implemented with different implementation details and the advantages of VDAC methods are vague. Therefore, we reimplement VD methods for MAA2C and MADDPG with the same implementation details and compare four VDAC algorithms to their original VD methods and MAAC methods in a total of eight different tasks. The results of our experiments show that VDAC methods are suitable to be used in environments containing numerous agents. Finally, we empirically find that MAA2C+VD methods mitigate the credit assignment problem in the MAAC framework.

Starcraft II ”Lite” Environment For Multi-Agent Reinforcement Learning Research
Author: Adam Michalski
With artificially intelligent agents entering more and more areas of our daily lives, we need methods to ensure we are only using the best algorithms to guarantee our safety and high quality of life, and there is a clear lack of standard benchmarks for Multi-Agent Reinforcement Learning (MARL) algorithms. The Starcraft Multi-Agent Challenge (SMAC) claims to be a good candidate for a standard MARL benchmark, but it is fully based on top of a heavy, closed-source computer game Starcraft II. This makes it completely closed to extension for researchers not familiar with the game, and requires the knowledge and use of other proprietary tools specific to the game for any meaningful contribution to the challenge. We introduce SMAClite – a challenge based on SMAC that is both completely decoupled from Starcraft II and open-source, along with a framework which makes it possible to create new content for SMAClite without any special knowledge. We conduct experiments to show it is equivalent to SMAC in many ways, and it outperforms SMAC in both speed and memory.

SMAC-lite: A lightweight SMAC-based environment for Multi-agent Reinforcement Learning Research
Author: Tawqir Sohail
Multi-agent reinforcement learning (MARL) is an active area of research in machine learning with many important practical applications. The StarCraft Multi-Agent Challenge (SMAC), a set of fully-cooperative partially observable MARL test environments using StarCraft II, is an important standardized benchmark in this field. Despite its value, SMAC remains inaccessible to many researchers in this field as it requires expensive computer hardware to use in practice. This project introduces SMAC-lite, a MARL test environment that can be used as an alternative to SMAC, based on SMAC’s 5m vs 6m test environment. We show that SMAC-lite is significantly faster than SMAC whilst showing comparable performance on the QMIX and IQL MARL algorithms. With this project, we aim to introduce and lay the groundwork for an alternative to SMAC that is more accessible and inclusive to researchers working in this field.

2021

Data Collection for Policy Evaluation in Reinforcement Learning
Author: Rujie Zhong
In reinforcement learning (RL), policy evaluation is a task to estimate the expected returns when deploying a particular policy (called the evaluation policy) to the real environment. One common strategy to conduct policy evaluation is On-policy Sampling (OS), which collects data by sampling actions from the evaluation policy, and makes estimations by averaging the collected returns. However, due to sampling error, the distribution of OS data collection could be arbitrarily different from that under the evaluation policy, making the value estimation suffer from a large variance. In this work, we propose two data collection strategies, Robust On-policy Sampling and Robust On-policy Acting, both of which can consider the historical collected data and reduce sampling error in future data collection. Experiments in different RL domains show that limited to the same amount, this low-sampling-error data can generally enable a more accurate policy value estimation.

Reinforcement Learning with Non-Conventional Value Function Approximation
Author: Antreas Tsiakkas
The use of a function approximation model to represent the value function has been a major contributor to the success of Reinforcement Learning methods. Despite its importance, the choice of model defaults to either a linear- or neural network-based approach mainly due to their good empirical performance, vast amount of research, and available resources. Still, these conventional approaches are not without limitations and alternative approaches could offer advantages under certain criteria. Yet, these remain under-utilised due to a lack of understanding of the problems or requirements where their use would be beneficial. This dissertation provides empirical results on the performance, reliability, sample efficiency, training time, and interpretability of non-conventional value function approximation methods, which are evaluated under a consistent evaluation framework and compared to the conventional approaches. This allows the identification of the relative strengths and weaknesses of each model that is highly dependent on the environment and task at hand. Results suggest that both the linear and neural network models suffer from sample inefficiencies, whilst alternative models –such as support vector regression– are significantly more sample efficient. Further, the neural network model is widely considered a black-box model whilst alternative models –such as decision trees– offer interpretable architectures. Both limitations have become increasingly important in the adaptation of Reinforcement Learning models in real-world applications where they are required to be efficient, transparent and accountable. Hence, this work can inform future research and promote the use of non-conventional approaches as a viable alternative to the conventional approaches.

Multi-Agent Deep Reinforcement Learning: Revisiting MADDPG
Author: Joshua Wilkins
Multi-agent reinforcement learning (MARL) is a rapidly expanding field at the forefront of current research into artificial intelligence. We examine MADDPG, one of the first MARL algorithms to use deep reinforcement learning, on discrete action environments to determine whether its application of a Gumble-Softmax impacts its performance in terms of average and maximum returns. Our findings suggest that while Gumbel-Softmax negatively impacts performance, the deterministic policy that necessitates its use performs far better with the multi-agent actor-critic method than the stochastic alternative we propose without Gumbel-Softmax.

Mitigating Exploitation Caused by Incentivization in Multi-Agent Reinforcement Learning
Author: Paul Chelarescu
This thesis focuses on the issue of exploitation enabled by reinforcement learning agents being able to incentivize each other via reward sending mechanisms. Motivated by creating cooperation in the face of sequential social dilemmas, incentivization is a method of allowing agents to directly shape the learning process of other agents by influencing their rewards to incentivize cooperative actions. This method is versatile, but it also leads to scenarios in which agents can be exploited via their collaboration when their environmental returns together with their received incentives end up being lower than if they had rejected collaboration and independently accumulated only environmental rewards. In my work, I have defined this behavior as exploitation via a metricwith how a set of decision makers can act to maximize their cumulative rewards in the presence of each other. I have expanded the action space to include a ”reject reward” action that allows an agent to control whether it receives incentives from others in order to prevent exploitation. My results indicate that cooperation without exploitation is a delicate scenario to achieve, and that rejecting rewards from others usually leads to failure of cooperation.

Reinforcement Learning with Function Approximation in Continuing Tasks: Discounted Return or Average Reward?
Author: Panagiotis Kyriakou
Reinforcement learning is a machine learning sub-field, involving an agent performing sequential decision making and learning through trial and error inside a predefined environment. An important design decision for a reinforcement learning algorithm is the return formulation, which formulates the future expected returns that the agent receives after following any action in a specific environment state. In continuing tasks with value function approximation (VFA), average rewards and discounted returns can be used as the return formulation but it is unclear how the two formulations compare empirically. This dissertation aims at empirically comparing the two return formulations. We experiment with three continuing tasks of varying complexity, three learning algorithms and four different VFA methods. We conduct three experiments investigating the average performance over multiple hyperparameters, the performance with near-optimal hyperparameters and the hyperparameter sensitivity of each return formulation. Our results show that there is an apparent performance advantage in favour of the average rewards formulation because it is less sensitive to hyperparameters. Once hyperparameters are optimized, the two formulations seem to perform similarly.

2020

Deep Reinforcement Learning with Relational Forward Models: Can Simpler Methods Perform Better?
Author: Niklas Höpner
Learning in multi-agent systems is difficult due to the multi-agent credit assignment problem. Agents must obtain an understanding how their reward depends on ther other agents actions. One approach to encourage coordination between a learning agent and teammate agents is augmenting deep reinforcement learning algorithms with an agent model. Relational forward models (RFM) (Tacchetti et al., 2019), a class of recurrent graph neural networks, try to leverage the relational inductive bias inherent in the set of agents and environment objects to provide a powerful agent model. Augmenting an actor-critic algorithm with action predictions from a RFM has been shown to improve the sample efficiency of the actor-critic algorithm. However, it is an open question whether agents equipped with a RFM also achieve a fixed level of reward in shorter computation time compared to agents without an embedded RFM. Since utilizing a RFM introduces a computational overhead this must not be the case. To investigate this question, we follow the experimental methodology of Tacchetti et al. (2019) but additionally look at the computation time of the learning algorithm with and without an agent model. Further, we look into whether replacing the RFM with a computationally more efficient conditional action frequency (CAF) model, provides an option for a better trade-off between sample efficiency and computational complexity. We replicate the improvement in sample efficiency caused by incorporating an RFM into the learning algorithm on one of the proposed environments from Tacchetti et al. (2019). Further we show that in terms of computation time an agent without an agent model achieves the same reward up to ten hours faster. This suggests that in practice only reporting the sample efficiency gives an incomplete picture of the properties of any newly proposed algorithm. It should therefore become standard to report both, results on the sample efficiency of an algorithm as well as results on the computation time needed to achieve a fixed reward.

2019

Curiosity in Multi-Agent Reinforcement Learning
Author: Lukas Schäfer
Multi-agent reinforcement learning has seen considerable achievements on a variety of tasks. However, suboptimal conditions involving sparse feedback and partial observability, as frequently encountered in applications, remain a significant challenge. In this thesis, we apply curiosity as exploration bonuses to such multi-agent systems and analyse their impact on a variety of cooperative and competitive tasks. In addition, we consider modified scenarios involving sparse rewards and partial observability to evaluate the influence of curiosity on these challenges. We apply the independent Q-learning and state-of-the-art multi-agent deep deterministic policy gradient methods to these tasks with and without intrinsic rewards. Curiosity is defined using pseudo-counts of observations or relying on models to predict environment dynamics. Our evaluation illustrates that intrinsic rewards can cause considerable instability in training without benefiting exploration. This outcome can be observed on the original tasks and against our expectation under partial observability, where curiosity is unable to alleviate the introduced instability. However, curiosity leads to significantly improved stability and converged performance when applied to policy-gradient reinforcement learning with sparse rewards. While the sparsity causes training of such methods to be highly unstable, additional intrinsic rewards assist training and agents show intended behaviour on most tasks. This work contributes to understanding the impact of intrinsic rewards in challenging multi-agent reinforcement learning environments and will serve as a foundation for further research to expand on.

Reinforcement Learning with Function Approximation in Continuing Tasks: Discounted Return or Average Reward?
Author: Lucas Descause
Discounted returns, a common metric used in Reinforcement Learning to estimate the value of a state or action, has been claimed to be ill-defined for continuing tasks using value function approximation. Instead, average rewards have been proposed as an alternative. We conduct an empirical comparison of discounted returns and average rewards on three different continuing tasks using various value function approximation methods. Results show a clear performance advantage for average rewards. Through further experiments we show that this advantage may not be caused by partial observability, as hypothesized by the claims against discounted returns. Instead, we show that this advantage may be due to average reward values requiring a simpler function to estimate than discounted returns.

Variational Autoencoders for Opponent Modelling in Multi-Agent Systems
Author: Georgios Papoudakis
Multi-agent systems exhibit complex behaviors that comes from the interactions of multiple agents in a shared environment. Therefore, modelling the behavior of other agents is a key requirement in order to better understand the local perspective of each agent in the system. In this thesis, taking advantage of recent advances in unsupervised learning, we propose an approach to model opponents using variational autoencoders. Additionally, most existing methods in the literature make the assumption that models of opponents require access to opponents’ observations and actions both during training and execution. To this end, we propose an inference method that tries to identify the underling opponent model, using only local information, such as the observation, action and reward sequence of our agent. We provide a thorough experimental evaluation of the proposed methods, both for reinforcement learning as well as unsupervised learning. The experiments indicate that our opponent modelling method achieves equal of better performance in reinforcement learning task than another recent modelling method and other baselines.

Exploration in Multi-Agent Deep Reinforcement Learning
Author: Filippos Christianos
Much emphasis in reinforcement learning research is placed on exploration, ensuring that optimal policies are found. In multi-agent deep reinforcement learning, efficiency in exploration is even more important since the state-action spaces grow beyond our computational capabilities. In this thesis, we motivate and experiment on coordinating exploration between agents, to improve efficiency. We propose three novel methods of increasing complexity, which coordinate agents that explore an environment. We demonstrate through experiments that coordinated exploration outperforms non-coordinated variants.

2018

Communication and Cooperation in Decentralized Multi-Agent Reinforcement Learning
Author: Nico Ring
Many real-world problems can only be solved when multiple agents work together. Each agent often has only a partial view of the whole environment. Therefore, communication and cooperation are important abilities for agents in these environments. In this thesis, we test the ability of an existing deep multi-agent actor-critic algorithm to cope with partially observable scenarios. In addition, we propose to adapt this algorithm to use recurrent neural networks, which enables using information from past steps to improve action-value predictions. We evaluate the algorithms in cooperative environments that require cooperation and communication. Furthermore, we simulate varying degrees of partial observability of actions and observations from other agents. Our experiments show that our proposed recurrent adaptation achieves equally good or signi cantly better success rates compared to the original method when only actions of other agents are partially observed. However, for other settings our adaptation also can perform worse than the original method, especially when the observations of other agents are only partially observed.