Decentralised Learning with many, many agents

back to our blogs

Decentralised Learning with many, many agents

David Mguni

By David Mguni (Senior Machine Learning Researcher)

Here at PROWLER.io, we believe the promise of AI can be fulfilled only when diverse fields of study – different ways of thinking – come together to provide novel solutions to problems. In this post, we'll outline how some of our recent research is tackling the growing need for huge numbers of agents to interact and make decisions. It's a good example of how Reinforcement Learning and Multi-agent Systems can come together to fix previously insoluble problems.

Certain phenomena underlie the dynamics of complex environments, whether they are natural weather systems and epidemics; designed communication, energy and transportation networks; or market-driven financial and economic systems. Even the human brain – and thinking itself – can be thought of as an immensely complex system of interacting agents. In applications as different as financial markets, fleet management, supply chain logistics and swarm robotics, the need for AI that can cope with many, many interacting physical and artificial agents is reaching a critical juncture.

In these settings, the decision space grows exponentially with the number of agents. Since each agent's decision depends on both the state of the environment and the actions of other agents, analysing these problems in multi-agent systems can become computationally intractable as the number of interacting agents grows. Moreover, agents may lack up-front knowledge of some or all of the following: the environment, the system dynamics, the actions required to obtain the highest rewards and the rewards themselves. Such complex systems are generally too complicated to hard code manually, model simply or solve analytically. The agents must discover solutions in a decentralised way, on their own – by learning.

Rlwmma B1

Yet as agents learn, changes in their policies affect the way they interact with the environment, impairing the ability of most existing methods to converge to best responses. Part of the solution is Multi-Agent Reinforcement Learning (MARL), which can help multiple agents find optimal actions through direct interaction with other agents in stochastic environments about which they have no knowledge up-front. MARL is a fundamental tool for understanding behaviour in complex systems, from multiplayer games to traffic networks and financial markets.

In markets, for example, agents (traders) performing actions (trades) affect each other's rewards (returns) and influence the environment (the market). In order to maximise long-term returns, a trader must find “best responses”, i.e. trading strategies that best respond to the collective behaviour of other traders. In game theory, such systems are modelled by stochastic games because they involve self-interested agents that are free to pursue their own objectives in a world whose state changes randomly over time.

But extending the applicability of MARL to systems with large populations of self-interested agents has represented a significant challenge for researchers. By itself, MARL has difficulty scaling with the number of agents. In applied problems and environments, it has typically been limited to systems with no more than a few agents.

That's where Mean-Field Games (MFG) can help. Individual agents generally have a negligible impact on these kinds of complex systems, what needs to be salient isn't their individual actions, but the results of decision-making by all agents as a collective whole (e.g. the movement of the market). What sometimes matters most is how the forest is changing, not the individual trees. Inspired by Mean Field Theory (a subfield of statistical physics that discusses the dynamics of particle motion) MFG studies decision-making in very large populations of interacting agents. This revolutionary model has been studied in depth by mathematicians and applied to describe complex multi-agent dynamic systems such as Mexican waves, stock markets and smart grids. In theory, MFG by itself could describe the behaviours of large population systems. But in practice, the models can require handling non-linear partial differential equations that are often unsolvable.

By combining MARL and MFG, we can get around these issues and use the benefits of both to find effective solutions across a very broad range of applications that had previously been prohibitively complex to analyse. As we've seen, the task of computing optimal actions quickly becomes intractable in multi-agent systems with more than a few agents. But the mean-field hypothesis greatly reduces the complexity of the problem by assuming the number of agents is infinite and that they have identical reward functions (much as traders can be assumed to seek to maximise their returns). These assumptions allow us to describe the collective behaviour of the agents as a probability density function. As explained in the paper, instead of agents responding to the actions of other agents individually, each agent now performs its actions in response to the mass which jointly represents the collection of states for all agents.

Rlwmma B2

Crucially in this body of work, solving mean field games allows us to closely approximate solutions for stochastic games. To demonstrate this in the paper, we show how stochastic games with thousands of agents can be approximated by a mean field game with infinite agents. Using our approach, agents learn to reason about their environment by repeatedly interacting with a collection of agents that can now be described by a probability density function.

We also show that discrete time MFGs belongs to a class of games known as dynamic potential games, in which Nash equilibria can be found by a single optimal control problem instead of a large interdependent system of partial differential equations (PDEs). This leads to a vast reduction in computational complexity.

Our method can even enable agents to plan long-term by performing a sequence of actions over time to maximise their individual cumulative reward. Here we are inspired by a classic problem within economics in which firms need to decide how to place supplies to meet changing demand in the presence of rival firms and transportation costs:

Rlwmma

In the above figure, a collection of objects (each with an associated reward) moves according to a path (depicted in red). The agents must select a sequence of movements that maximises their rewards as they accumulate over time. However, the more agents occupy a particular region, the lower the reward received by each agent. Moreover, the agents incur a high cost for movement, which makes the naive option of following the objects suboptimal.

With our method, the agents learn that by taking a shortcut (traversing horizontally) to meet the objects, they can increase their overall rewards. To behave in this way, the agents must first incur costs with low rewards as they traverse a horizontal path that doesn't coincide with the path of the objects. In this sense, the agents demonstrate planning by forgoing immediate rewards in favour of taking a path that maximises long-term rewards.

The theoretical component of this research should interest those studying analytic theory and seeking convergence guarantees. But going forward, multi-agent system researchers and practitioners in general will find this work especially useful if they wish to be ready for the explosion in numbers of agents that populate complex systems, as we're already seeing in fintech, vehicle fleets, IoT networks, micro-robot swarms and smart cities.