EPyMARL 2.0: New Features and Major Update to our Environments

Date: 2024-07-26

Our research group has developed and open-sourced several libraries for multi-agent reinforcement learning (MARL) research. Most prominently, we released and actively maintain the following MARL libraries:

EPyMARL, released in 2021 as part of our NeurIPS benchmark paper [1], is a modular and extensible codebase that provides a unified interface for a variety of MARL algorithms and is designed to facilitate MARL research. We are very glad that since its release, EPyMARL has been used by many researchers in the community (e.g. [2, 3]) and continues to be used for MARL research. However, EPyMARL and our environments depended on the OpenAI Gym library, which has been deprecated and is no longer maintained. In addition, EPyMARL was originally designed for fully cooperative common-reward multi-agent settings and did not support training agents in environments with different reward functions for different agents despite several algorithms implemented within EPyMARL theoretically supporting this.

We are excited to announce that we have updated our environments and EPyMARL to address these limitations and add several new features. In this blog post, we provide an overview of the updated environments as well as introduce new features of EPyMARL 2.0 and how to use them. In short, EPyMARL 2.0 includes the following new features:

Update to Gymnasium library, including support for new environments
Support for environments with different reward functions for different agents
Support for logging data to Weights and Biases (wandb)
Provide a simple plotting script to visualise results

Gymnasium Integration

Previous versions of our environments and EPyMARL depended on the deprecated OpenAI Gym version 0.21 which became increasingly difficult to install and rely on. We made the decision to move our libraries to instead use the maintained Gymnasium library which comes with changes to the API of environments. In order to maintain support for commonly used environments, we updated several environments alongside EPyMARL to continue supporting training with them. These include the level-based foraging (LBF) environment, the multi-robot warehouse (RWARE) environment, SMAClite, and a library for matrix games.

With the change to Gymnasium comes a change in the API of environments. Most notably, resetting the environment for a new episode also returns a dictionary with potential metadata, the termination flag (often called "done") is returned as a single boolean instead of a list, and the step function returns an additional boolean indicator of whether the episode has been truncated (for more information on the difference between termination and truncation, see this page). Below, we provide an example of how the new gymnasium interface of our environments looks like at the example of an LBF task:

New Gymnasium Interface

import gymnasium as gym
env = gym.make("lbforaging:Foraging-8x8-2p-3f-v3")

env.unwrapped.n_agents  # 2
env.action_space  # Tuple(Discrete(6), Discrete(6))
env.observation_space  # Tuple(Box(XX,), Box(XX,))

obs, info = env.reset(seed=42)  # a tuple of observations; can initialise for a particular seed
done = False

while not done:
    actions = env.action_space.sample()  # the action space can be sampled
    print(actions)  # e.g. (1, 0)
    nobs, rewards, done, truncated, info = env.step(actions)
    print(done)  # e.g. False
    print(truncated)  # e.g. False
    print(rewards)  # e.g. [0.0, 0.0]

env.close()

To ensure continuing compatibility of EPyMARL with the StarCraft multi-agent challenge (SMAC) environment, we provide a wrapper that integrates the environment into the Gymnasium interface. Similarly, we also provide wrappers for the following environments which were previously not (fully) supported: SMACv2, SMAClite, and PettingZoo.

If you would like to run experiments in environments which are only supported under the deprecated OpenAI Gym environment, please refer to these instructions to load gym environments in gymnasium or use the previous version v1.0.0 of EPyMARL.

Support for Environments with Individual Reward Functions

Previously, EPyMARL only supported training MARL algorithms in environments with common rewards, i.e. all agents receive the same reward at every time step. To support environments which naturally provide individual rewards for agents (e.g. LBF and RWARE), we previously scalarised the rewards of all agents using a sum operation to obtain a single common reward that was then given to all agents.

With the most recent update, EPyMARL now natively supports training agents in environments with different reward functions for different agents, i.e. each agent can receive its own reward at every time step. We note that not all MARL algorithms support training in environments with individual rewards. Below, we list all MARL algorithms implemented in EPyMARL and whether they support training in any environment (individual or common rewards) or only in environments with common rewards:

Algorithms that support general-sum reward envs: IA2C, IPPO, MAA2C, MAPPO, IQL, PAC
Algorithms that only support common-reward envs: COMA, VDN, QMIX

To preserve compatibility with previous experiments, EPyMARL still defaults to training algorithms with common rewards and scalarises the rewards of all agents into a single common reward. To allow for more control over the scalarisation of rewards, we provide a new flag "reward_scalarisation'' which can be set to "sum" or "mean". By default, the scalarisation is set to "sum" to match the original EPyMARL scalarisation. This scalarisation only takes effect when training in environments with common rewards, marked by the configuration flag "common_reward'' (set to True by default).

To train agents in environments with individual rewards, set the configuration flag "common_reward'' to False. For example, to train MAPPO in an LBF task with individual rewards, run the following command:

python src/main.py --config=mappo --env-config=gymma with env_args.time_limit=50 env_args.key="lbforaging:Foraging-8x8-2p-3f-v3" common_reward=False

Identical to previous versions of EPyMARL, for experiments with common rewards, we log the mean and standard deviation of returns computed over the common rewards of all agents. For experiments with individual rewards, we log the mean and standard deviation of returns of each agent separately, as well as the mean and standard deviation of the sum of all individual rewards (total returns).

Logging to Weights and Biases

Weights and Biases (wandb) emerged as a popular tool to track experiments, log data, store model checkpoints, and do hyperparameter tuning. Due to its popularity, we decided to integrate wandb into EPyMARL which can now natively log data to your wandb dashboard. To use wandb, you first need to install it via pip and authenticate the library with your account. We refer to the wandb quickstart guide for further instructions. To enable wandb logging in EPyMARL, you need to specify the following configuration options:

use_wandb: True # Log results to wandb (by default disabled)
wandb_team: "your_team_name" # wandb team name (needed!)
wandb_project: "your_project_name" # wandb project name (needed!)
wandb_mode: "offline" # wandb mode (online/offline)
wandb_save_model: False # Save models to wandb

You need to specify which team and project wandb should log the run to. Furthermore, you can specify the mode of logging (directly online or offline) and whether to save model checkpoints to wandb. For model checkpoints, we only save these to wandb if model checkpointing is generally enabled in the configuration ("save_model" is set to True).

Plotting Script

To simplify the process of visualising results, we now provide a simple plotting script that can generate plots of any metric throughout training. Below, we provide a list of all arguments and their functionality before showing some examples of how the script can be used:

python plot_results.py --path path/to/results/dir [--metric log_metric --filter_by_algs alg1_to_show alg2_to_show --filter_by_envs env1_to_show env2_to_show --save_dir path/to/save/plots --y_min yaxis_min --y_max yaxis_max --log_scale --smoothing_window window_size --best_per_alg]

--path path/to/results/dir: Path to the directory containing the results to plot.
--metric log_metric: The metric to plot. This can be any metric logged during training, e.g. "test_return_mean" for the average returns (of common rewards) during evaluations, "test_total_return_mean" for the average total returns (sum of individual rewards) during evaluations, or "entropy" for the entropy of the policy in policy gradient algorithms.
--filter_by_algs alg1_to_show alg2_to_show ...: Only plot results for the specified algorithms. This can be a list of algorithms separated by spaces, e.g. "mappo iql". Algorithms are matched by substrings, so "ia2c" will also match with "ia2c_ns"!
--filter_by_envs env1_to_show env2_to_show ...: Only plot results for all environments that contain any of the listed strings separated by spaces. E.g. "Foraging" or "rware" to only visualise LBF and RWARE experiments.
--save_dir path/to/save/plots: Path to the directory where the plots should be saved (by default current working directory).
--y_min yaxis_min: Minimum value for the y-axis (or none if not provided).
--y_max yaxis_max: Maximum value for the y-axis (or none if not provided).
--log_scale: Use a logarithmic scale for the y-axis.
--smoothing_window window_size: Size of the window for smoothing the plotted data. Values are averaged within this window.
--best_per_alg: Plot only the best configuration per algorithm.

The plotting script will go through all subdirectories in the specified path of results and search for the "metrics.json" and "config.json" files which contains the logged data and configuration of each run. It will then group the results by environment and algorithm and by default generate a single plot for each environment. These plots show the time steps on the x-axis, the chosen metric on the y-axis for all found algorithms and configurations, and a legend with algorithm names (and differences in configurations in multiple configurations are found per algorithm; more on that below). The script computes the mean and standard deviation across all runs with the same configuration (excluding random seeds) and visualises the mean performance as lines and the standard deviation across all runs of the same configuration as shading. The generated plots will all be saved in the specified save directory. This means you will only need to run the plotting script once to generate all plots for all experiments in a directory of results. Below, we will show some examples of particular use cases of the plotting script.

Visualising gridsearch results: After running a gridsearch over multiple hyperparameter configurations of an algorithm, you can use the plotting script to visualise the results of all configurations in a single plot. For example, we might run the following command to visualise the average total returns of all configurations of IA2C in all environments:

python plot_results.py --path results --metric "test_total_return_mean" --filter_by_algs ia2c --save_dir path/to/save/plots

This will generate a plot like the one shown below. The title of the plot shows the environment, the y-axis the chosen metric, and the legend lists all plotted configurations. The script automatically identifies all hyperparameters that vary across found configurations and includes them in the legend, and we sort the legend by the average performance of the particular metric. This allows for easy identification of the best configuration. Whenever there are many algorithms and/ or configurations to visualise, we plot the legend below the plot to avoid cluttering the plot itself.

Visualising best results: After running a gridsearch of potentially multiple algorithms, we might want to compare only the best identified configuration of all algorithms with each other. For example, we might have run a gridseach of IA2C with and without parameter sharing and would like to visualise the average total returns of only the best configuration of both algorithms. We can use the following command to achieve this:

python plot_results.py --path results --metric "test_total_return_mean" --save_dir path/to/save/plots --best_per_alg

This will generate a plot like the one shown below. The title and axes are labelled similarly to before, but as we can see the legend is now shown within the plot rather than below (as there are fewer algorithms to show) and we only show the algorithm names in the legend since those already sufficiently identify the algorithms when plotting only the configuration.

Citing EPyMARL

If you use EPyMARL in your work, please cite:

@inproceedings{papoudakis2021benchmarking,
       title = {Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks},
       author = {Georgios Papoudakis and Filippos Christianos and Lukas Schäfer and Stefano V. Albrecht},
       booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS)},
       year = {2021},
       url = {http://arxiv.org/abs/2006.07869},
       openreview = {https://openreview.net/forum?id=cIrPX-Sn5n},
       code = {https://github.com/uoe-agents/epymarl},
    }

References

Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V. Albrecht. "Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks." In Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
Torbati, Reza J., Shubham Lohiya, Shivika Singh, Meher S. Nigam, and Harish Ravichandar. "MARBLER: An Open Platform for Standardized Evaluation of Multi-Robot Reinforcement Learning Algorithms." In 2023 International Symposium on Multi-Robot and Multi-Agent Systems (MRS), pp. 57-63. IEEE, 2023.
Leroy, Pascal, Pablo G. Morato, Jonathan Pisane, Athanasios Kolios, and Damien Ernst. "IMP-MARL: a suite of environments for large-scale infrastructure management planning via MARL." In Advances in Neural Information Processing Systems 36, 2024.