SMAClite: A Lightweight Environment for Multi-Agent Reinforcement Learning

Date: 2023-05-28

In the area of multi-agent reinforcement learning, thus far there is no single benchmark that is widely accepted as a standard for evaluating the performance of new algorithms. One popular candidate for such a benchmark is the StarCraft Multi-Agent Challenge (SMAC) [1], which is a multi-agent adaptation of a certain mini-game inside the popular real-time strategy game StarCraft II.

Due to its necessary dependency on the StarCraft II game engine, SMAC poses several difficulties if it were to become accepted as a standard. It is not trivial to set up and run due to requiring an installation of the game Starcraft II itself. For the same reason, it is quite demanding on computational resources, and any significant modification of the environment requires expert knowledge of the Starcraft II Map Editor – a tool not many researchers are familiar with.

In this post, we present SMAClite – a lightweight version of SMAC that is designed to be easy to install, as well as run training and inference on, and to be more accessible to the community of reinforcement learning researchers. This is achieved through numerous improvements over the original environment, including better computational performance, as well as a simplified system for map creation.

Starcraft Multi-Agent Challenge

The MMM2 scenario inside SMAC The MMM2 scenario inside SMAClite
The corridor scenario as visible in the SMAC (left) and SMAClite (right) environments.

In this section, we provide a brief explanation of the SMAC mini-game and its rules. For a more detailed description, we refer the reader to the original SMAC paper [1]. Inside SMAC, the environment consists of a 2D rectangular map with several soldiers, called "units", each belonging to one of two teams. The goal of each team is to defeat all units belonging to the other team. One team, called "the enemy", is controlled by Starcraft II's in-game AI, while the other team, called "the allies", is controlled by reinforcement learning agents. Each unit in the allied team is controlled by a separate agent, and they need to work as a team to defeat the enemies. At the end of each episode (i.e. when one of the teams is eliminated or when a time limit is reached), the agents receive a shared reward based on how much damage they dealt, how many units they killed, and whether they won or lost the encounter.

Each agent has an observation range and a targeting range – these are the same for all agents. They do not receive any observations about units outside of their observation range, and they can only target units within their targeting range. The observations each agent receives roughly describe each unit they can see, as well as the agent itself. As an action, each agent can move in a cardinal direction: up, down, left, or right, attack a unit within its targeting range, or do nothing.

SMAC provides a range of different scenarios, which differ in unit types, unit numbers, and map terrain layouts. In essence, each scenario can be thought of as its own "sub-environment", each presenting a unique challenge – for example, some scenarios have agents being outnumbered, some have them using weaker unit types, etc.

Tenets of SMAClite

When designing SMAClite, we started with the basic idea that it should be a lightweight version of SMAC, "lightweight" referring primarily to the computational resources required to run it. While we did keep computational performance in mind while working on the project, throughout the implementation process we gradually converged on a set of extra tenets that we believe are important for it to be a good benchmark for multi-agent reinforcement learning algorithms.

A major tenet that influenced our implementation is a complete decoupling of the SMAClite environment from the Starcraft II game engine. This means that the environment can be run on any machine that has a Python interpreter installed, without requiring any installation of Starcraft II itself. We decided to implement the engine in Python because it is a language that is widely used in the field of reinforcement learning, and thus the code should be easily readable, and extendable, by all users.

Next, the resulting environment needed to be easily modifiable, without any expert knowledge about Starcraft. To this end, the environment also serves as a framework capable of loading JSON files representing combat scenarios and units that are to be used in the game. All of the scenarios and units shipped with the environment are defined in these JSON files, and thus also serve as examples of valid JSON files for the framework.

Finally, while keeping in mind all of the above, we wished the environment to be as close to the original SMAC as possible. This means that the environment should be able to run the same scenarios as the original SMAC and that the same units should be available for use in the game, while maintaining as many game mechanics as possible identical to Starcraft II. This also means we kept the environment state, observation, action, and reward spaces nearly identical to the original SMAC. We demonstrate this in section 5.3 of our paper.

Assuming these tenets, we believe that SMAClite is a good candidate for a lightweight benchmark for multi-agent reinforcement learning algorithms, or that it at the very least improves upon the original SMAC in the above aspects.

Methodology

We implemented most of the game logic using the Numpy library for array operations in Python, and added a simple graphical interface using the Pygame library. The graphical interface, and the Pygame library itself, are not required to run the environment, but they are useful for verifying its correctness and visualizing the state of the game.

The MMM2 scenario inside SMAC The MMM2 scenario inside SMAClite
The MMM2 scenario as visible in the SMAC (left) and SMAClite (right) environments.

One aspect of the engine worth mentioning is the collision avoidance algorithm. In Starcraft II, in order to move in a realistic manner and not bump into each other, the in-game units use a proprietary algorithm that is not publicly available (but there was an interesting talk about it from Starcraft's developers at GDC 2011). We had to either implement our own algorithm mimicking the one in SC2 or use an open-source alternative. We decided to go with the latter and used the ORCA algorithm [2] for collision avoidance. Notably, this algorithm has also been used in another real-time strategy game, called Warhammer 40,000: Space Marine.

Note that while we tried to maintain as much parity with Starcraft II as possible, there are some noticeable differences stemming from time constraints and our trying to simplify the game's rules. Most notably, none of the ranged units' particles are simulated – each attack deals damage instantly as soon as it's fired. Also, units have no attack delay – as soon as they decide to attack, and are within attack range, the attack can happen. In SC2, there is usually an animation delay between the moment the unit decides to attack and the moment it actually attacks.

We make available two versions of the ORCA algorithm compatible with SMAClite: one written by us in Numpy, shipped directly with the environment and serving as a direct port of the original implementation. The other version is a Python wrapper around the original C++ implementation of the algorithm, which is available in this code repository. This module is a fork of pre-existing python bindings for the algorithm, which we modified to work with our engine. The latter version is much faster than the Numpy implementation, but requires extra installation steps. To use the C++ version of ORCA in the environment, one needs to set the use_cpp_rvo2 environment flag to True.

Installation and Usage

To install SMAClite, one can simply run pip install . in a cloned SMAClite repository. To install the optional C++ bindings, one can run python setup.py build && python setup.py install in a clone of their repository.

To train and evaluate the models, we used the EPyMARL framework made available by our research group. An example command used to train a model using the C++ version of the RVO2 algorithm is shown below.

set -eu
ENV=2s_vs_1sc
TIME_LIMIT=300
SEED=1
ALGORITHM=iql

python src/main.py \
    –config="$ALGORITHM" \
    –env-config=gymma \
    with seed="$SEED" \
    env_args.time_limit="$TIME_LIMIT" \
    env_args.key="smaclite:smaclite/$ENV-v0" \
    env_args.use_cpp_rvo2=True

Agent performance

We trained a number of multi-agent reinforcement learning algorithms on various scenarios within the SMAClite environment. For each scenario/algorithm combination, we used 10 different seeds from 0 to 9, to account for the stochasticity of the environment, since in each game step the units execute their commands in random order. The results can be seen in the below image. Note that the graph formatting is identical to the one in our recent benchmark paper [3], allowing for a direct comparison to results observed in the original SMAC.

In general, the ranking of the algorithms within the scope of each scenario is similar in SMAClite and SMAC. This is one of the reasons we believe that SMAClite poses an equivalent challenge to the original SMAC.

We also performed 0-shot transfer learning experiments by putting agents trained inside SMAClite in tbe original SMAC scenarios. We found that training inside SMAClite imrpoved the agents' performance in each of the scenarios tested – this indicates a high level of similarity between the environments.

Agent behaviours

In this section, we present and briefly describe a few interesting behaviours observed in the SMAClite environment. The videos below use the Pygame graphical interface and show the behaviours of agents trained using EPyMARL. The agents were trained for 4 million timesteps in the first two scenarios, and 1 million in the last one.

The `3s_vs_5z` scenario (algorithm: IQL)

In this scenario, the agents control 3 stalkers – powerful but fragile ranged units, and need to win against the enemy team consisting of 5 zealots – bulky units that can only attack in melee range. The terrain in the scenario is a square room, covered from each side by impassable walls. The below video shows the trained agents surviving the encounter with 2 of the stalkers left alive. In this and the following example, the agent units are green, and the enemy units are red.

As seen in the video, the agents learned to "attack and run", at all times keeping their distance from the zealots. This strategy is effective, as the zealots are unable to catch up to the stalkers, and the stalkers can deal damage from a safe distance most of the time. Notably, this is a technique used by human players in the original Starcraft II, and many other real-time strategy games, called kiting.

The `MMM2` scenario (algorithm: MAPPO)

This scenario can be seen as less complicated because it is fully symmetric, but it does use more complicated unit types. Each team has access to one flying, healing unit called medivac, capable of restoring health to nearby units, some marines – versatile but fragile ranged units capable of attacking both ground and air units, as well as some marauders – bulkier units only capable of attacking ground units. The below video shows the agents winning the encounter with two marines and the medivac alive:

Notably, the agents learned to exploit the environment's reward scheme to exceed the theoretical maximum reward for this scenario (this can also be seen in the graphs above). Since the enemy medivac is capable of healing enemy units, and the agents get rewards for dealing damage to enemy units, keeping the enemy medivac alive longer results in more possible reward to be acquired. This is why the agents in the video above keep the enemy medivac alive for as long as possible, and only attack it when it is the last enemy unit alive. Note that this behaviour is not present in the original SMAC because enemy healing there incurs negative reward, but we decided to keep it in SMAClite, as it is an interesting interaction. It can also be noted that, because the handwritten enemy AI is programmed to prioritise healers when attacking, the agents' medivac stays at a safe distance until all enemy marines are dead.

The `bane_vs_bane` scenario – 0-shot learning in SMAC (algorithm: IA2C)

In this scenario, both teams consist of some zerglings – fragile, fast-moving melee attackers, as well as banelings – suicide bombers that explode on impact, dealing damage to units in a circular area around them. Notably, this video shows the 0-shot performance of agents trained in the SMAClite environment, when put inside the SMAC environment. In this example, the agent units are in red, and the enemy units are in blue.

The agents' strategy is simple – because the enemy banelings are all running straight down and exploding on impact, the agents first send some units in to absorb the impact of these explosions, while the rest of their army waits to the side. Once the initial impact is over, the agents send in the rest of their army to finish off the enemy units. This proves to be an effective strategy, and was found by the agents very quickly during training.

Future work

The Starcraft, and Starcraft-like, environments pose extremely interesting and versatile challenges in multi-agent reinforcement learning. We hope that the highly customizable and extensible nature of SMAClite inspires the general research community to experiment more with various real-time strategy challenges, not necessarily confined to what Starcraft II has to offer. It would be interesting to experiment more with advanced terrain setups, new custom unit types, and more.

One particularly desirable improvement is adapting the stochasticity changes introduced by the authors of SMAC v2 [4]. This version of the SMAC environment randomizes the allied unit types and their starting positions from episode to episode, making it a more difficult challenge for agents to beat.

The full report on the project can be found in the paper, where one can find more detailed descriptions of the environment and the algorithms used, as well as details on all the experiments conducted. All technical details are documented together with the code in the repository.

References

Mikayel Samvelyan, Tabish Rashid, Christian Schröder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob N. Foerster, and Shimon Whiteson. 2019. The StarCraft Multi-Agent Challenge. In Proceedings of the 18th International Conference on Autonomous Agents and Multi-Agent Systems, 2019.
Jur van den Berg, Stephen J. Guy, Ming Lin & Dinesh Manocha. Reciprocal n-Body Collision Avoidance. In Robotics Research - The 14th International Symposium, Springer Tracts in Advanced Robotics, 2009.
Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V. Albrecht. Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
Benjamin Ellis, Skander Moalla, Mikayel Samvelyan, Mingfei Sun, Anuj Mahajan, Jakob N. Foerster, and Shimon Whiteson. SMACv2: An Improved Benchmark for Cooperative Multi-Agent Reinforcement Learning. 2022 arXiv https://arxiv.org/abs/2212.07489