# Driving system optimisation with clever incentives

## How incentive design can help optimise the world's most complex systems

back to our blogsYou don't want to sit in a traffic jam or wait in the rain for a cab. But, every day, you’re interacting with a world where your actions (and the actions of others) can adversely affect your ability to perform certain tasks.

These problems don't just apply to transport networks and cab fleets - they affect any system with a group of individuals. This group may be a handful of people or a vast network of computers, but to effectively coordinate a crowd and optimise the efficiency of that system, we often rely on a central authority.

Although this central authority may have the power to dictate to agents in a system, calculating what to dictate is often prohibitively computationally expensive. This is because the outcomes following the joint actions of agents must be calculated and this scales exponentially with the number of agents.

For example, say we have a system of interconnected robots, working on a factory floor to perform a series of tasks. The system goal is to improve their joint productivity but it would be computationally expensive for a central computer to send commands to each robot so that all robots jointly optimise the system performance.

Autonomous agents work over a wide range of systems, improving systems that are already naturally distributed and increasing the efficiency of other systems through distribution.

Instead, we could allow the robots to have autonomy over their decision-making. To allow the robots to act autonomously, we need to give them their own reward signal, which is a signal that tells the robot how well it is performing its designated task. So, this is equivalent to creating a system of self-interested agents.

Moreover, in many environments the central authority may not be in a position to dictate to autonomous agents, and just telling agents how to behave does not necessarily lead to a satisfactory outcome for the system either. For example, in a fleet of self-employed cab drivers, the fleet manager may want to reduce the average waiting time for customers. Minimising the average wait time is the system goal. However, each driver chooses (to some extent) which tasks to complete, or to stay idle. What's more, the driver may not care about system goal and, instead, only care about their individual goals.

Many other real-world systems involve inherently self-interested agents, such as traders in the financial markets, performing individual tasks in a shared environment. Agents respond to what other agents are doing and complete tasks that they think will give them the highest achievable individual reward. Agents interacting in this manner (and where their actions are affected by the actions of others) often lead to poor system performance.

The agents may also lose out on an individual basis too. Paradoxically, despite carrying out a series of actions chosen to maximise their self-interest, agents don’t get the greatest individual reward.

No one wins. The system is highly inefficient. The implications are horrendous.

These inefficiencies are illustrated by Braess’ paradox:

In such a system with self-interested agents, the stable outcome is a Nash equilibrium. The Nash equilibrium describes a stable configuration in a strategic setting or game setting where no agent can gain by changing their current strategy.

A Nash equilibrium occurs when the agents respond optimally to one another. In other words, they are each using an individually tailored “best-response” policy to perform the best sequence of actions, in response to the actions of other agents.

However, Nash equilibria have notoriously inefficient outcomes at a system level because the agents' self-interested behaviour produces poor system outcomes.

So, how can we achieve an efficient Nash equilibrium for such systems?

If we could modify some component of the environment, then this component could drive system optimisation, help the system reach its goal and produce a computationally efficient solution.

However, a naive modification could produce a worse outcome for the system. As demonstrated in the Braess' paradox example, you may build a bridge in the traffic network to ease congestion - but this can actually increase congestion and reduce the system efficiency.

So, we need to intelligently modify the system to optimise it. This is the basis for a new approach from PROWLER.io to tackle this problem. It uses black-box optimisation techniques, game theory and multi-agent reinforcement learning to compute the optimal modification that maximises system performance. You can read the full paper here.

Our approach also demonstrates, for the first time, successful optimisation of multi-agent systems containing an unbounded number of interacting adaptive learners in certain settings.

This work has a direct impact on multi-agent reinforcement learning research and a range of real-world systems in industries including economics, logistics and finance.

It helps those seeking to apply such techniques to increase the efficiency or output of a wide range of systems where the smallest increases in efficiency can lead to massive increases in returns and social gains. For example, the introduction of a smart toll could minimise traffic congestion or dynamic pricing could evenly distribute electricity consumption in smart grids.

To achieve this, we need to alter the self-interested agents' behaviour. Since we cannot do this dictatorially, we modify their reward functions, which is akin to providing incentives.

## Introducing smart incentives

Consider a system that contains a set of self-interested agents. Each agent is given an incentive, which induces a change in the agent's behaviour. Each incentive is carefully calibrated so that the collective actions of the agents now produce the optimal system performance. We could use this method to optimise our system of factory worker robots or fleet of cab drivers, for example. As a result, the individual still benefits - but the outcome is also better for society.

Our method splits the problem into two parts. One part computes the agents' best-response policies with a given set of reward functions. The other part finds the best set of modifications to the reward functions (or incentives) given the agents' joint response.

This decomposes the problem in a way that decentralises the computation because the agents compute their best-response policies themselves. It involves multi-agent reinforcement learning to compute the Nash equilibrium and Bayesian optimisation to compute the optimal incentive, within a simulated environment.

Our method of computing the optimal incentives is computationally far less expensive than traditional methods because we allow autonomous reinforcement learning agents to compute their Nash equilibrium policy themselves. This method enables the optimal performance of a centralised problem to be computed in a distributed fashion - the agents compute their individual best-response policies and the incentive designer computes how to change the agents' rewards to produce optimal system outcomes.

Consequently, the computational load of the central agent (in our case, the incentive designer) is reduced since it doesn't need to compute the optimal joint actions for all agents.

## Background

Consider a set of agents **{1, 2, …N}** who each receive a cumulative reward after performing a sequence of actions over some period of time.

Each agent has a reward function that determines the immediate reward the agent receives given the agent's own action, the actions of all other agents and the state of the system at that time.

For example, if an agent takes a long time to complete their commute, they get a low reward. This reward may be a function of time or other factors such as fuel costs, for example.

So, other agent's actions influence the reward. Each agent **i **has a value function **v _{i} **which describes its expected cumulative set of rewards received by agent

**i**. The value function depends on the agent’s own sequence of actions, where each action is decided by a policy

**π**but also depends on the actions of all other agents - their actions are decided by the joint set of their own individual policies, we'll describe this joint set collectively by

_{i}**π**.

_{-i}The goal of each agent is to maximise its expected sum of cumulative rewards. For a single agent decision problem, the evolution of the system is described by the sequence of states that are induced by actions that maximise the agent’s value function.

In a multi-agent setting with self-interested agents, the outcome is more complicated since the actions of many agents affect both other agents' rewards and where the system transitions to.

So, we use a Markov game formalism, which is a multi-agent generalisation of a Markov decision process, because we need to analyse a strategic setting (or game) over a time horizon where the system is stochastic, which captures randomness over time.

This following tuple (which is an ordered set of elements) represents a Markov game:

A Markov-Nash equilibrium is the solution concept for Markov Games, where every agent plays a best-response against other agents' policies over the horizon of the game.

The outcome of the game is intimately tied to the set of reward functions. This is because these reward functions determine what immediate actions the agents find optimal to execute.

A Markov-Nash equilibrium is defined by the following:

This says that no agent can improve their rewards by deviating (unilaterally) from their current strategy.

We're interested in systems that involve agents competing over some common resources. For example, when traders buy and sell shares, consumers use an electrical supply, or cab drivers competing over the supply of customers. Such systems are described by potential games.

Formally, potential games are games that have the property that the change in reward for all agents caused by a deviation from their current strategy can be expressed using a single global function, this single global function is known as the potential function.

The appropriate framework is a Markov Potential Game, i.e. a Markov game that satisfies the potential game property. We use Markov Potential Games to model dynamic multi-agent systems where agents act non-cooperatively to maximise their own interests and in systems where agents use a common resource. For example, transportation networks, factory floor space, oligopoly (or market share) etc.

## Our method

We parametrise the reward functions and introduce an incentive designer that can modify the game by choosing the parameter for the agents’ reward functions.

This method does not require up-front knowledge of the system performance metric. Instead, our method arranges the problem into a hierarchy in which the incentive designer chooses the reward function within a simulated game, played by the agents, that models the joint behaviour of the agents.

The goal of the incentive designer is to modify the set of agent reward functions for the sub-game that induces behaviour that maximises the system performance. Crucially, the agents are required to behave rationally and hence reproduce the responses of self-interested agents in an environment with the given reward functions, i.e. a Nash equilibrium.

Using feedback from the simulated sub-game in response to changes to the agents' reward functions, the incentive designer can compute precisely the modifications to the agents' rewards that produce desirable equilibria among self-interested agents of the real-world game. The simulated environment avoids the need for costly acquisition of feedback data from real-world environments whilst ensuring the generated agent behaviour is consistent with real-world outcomes.

The incentive designer also has its own reward signal. We choose this reward function to be the same system performance metric that we would like to optimise e.g. the average waiting time for a cab, spread over traffic within some road network, etc.

We then use a combination of multi-agent reinforcement learning (MARL) and Bayesian optimisation to overcome the inefficiencies of multi-agent systems:

- MARL is used to simulate the agents’ actions and produce the Nash equilibrium behaviour by the agents for a given choice of parameter by the meta-agent.
- Bayesian optimisation is used to select the parameters of the game that lead to more desirable outcomes. Bayesian optimisations find the best model based on randomness, which matches the dynamics of the system.

We repeat the two MARL and Bayesian optimisation processes until an optimal parameterisation is found.

To converge to the optimal outcome, we need to put a theory in place that drives individuals to that optimal outcome for the system. In doing so, we will prove that we can converge to an optimal reward structure and that we understand the conditions and methods for which such a scheme would work. This takes us back to the original tuple:

We replace each agent's reward function with a new reward function, which is parameterised by **ω**. This allows the incentive designer to tune the agents' signals of their (individual) performance. So, the game becomes:

The agents are now playing a modified game. The inclusion of **(R _{i},ω)_{i∈N}** means the incentive designer can tune the reward functions of the agents and, consequently, the incentive designer can alter the fundamentals of the game and the strategic interaction among agents.

The incentive designer now has to find the vector of parameters **ω** that maximises its reward function whilst the Markov-Perfect Nash Equilibrium condition must be satisfied (since all agents maximise their self-interest interest).

In other words, the incentive designer has to find the optimal set of parameters, **ω**, that produces an outcome that both maximises the global system performance and ensures each agent plays their Nash equilibrium policy.

This changes the value function compared to a traditional Markov game, where the following value function is used:

With the inclusion of the incentive designer and **(R _{i},ω)_{i∈N}**, each agent's aim now is to maximise a new value function:

## Our results

We prove several important results with our work. First, there exists a value **ω** that maximises the incentive designer's reward function. Second, that the set of Nash equilibria is 'continuous' that is, we have the following:

The result above enables the incentive designer to generate an iterative sequence of games each with modified reward functions and that this sequence converges to the optimal solution. This then permits the use of black-box optimisation to solve the incentive designer's problem.

Additionally, despite altering the fundamentals of the game that determine the way the agents interact, the potential property of the game is preserved. In other words, **G(ω)** is a Markov potential game.

So, the following equation holds:

Since this equation holds, each agent inadvertently maximises **Φ _{π,w}**. So, the function

**Φ**is a potential of

_{π,w}**G(ω)**.

We have reduced the problem of finding Nash equilibria in the modified game to solving an optimal stochastic control problem. This is a very useful result since obtaining the solution to an optimal stochastic control problem is, in general, an easier task than standard methods to obtain the Nash equilibrium, which relies on fixed point arguments. This, in turn, allows us to rapidly simulate the responses of the agents and the outcome of the game with modified reward functions.

So, by combining multi-agent reinforcement learning and Bayesian optimisation to induce efficient behaviour in complex, multi-agent systems, this proves two important points:

**1. G(ω) is a Markov Potential Game**

Although we allow the incentive designer to modify the game, specifically the agents’ rewards, we've demonstrated that our incentive modifications preserve the important structural features of Markov potential games.

This means that to solve the game, i.e. characterise the Nash equilibrium, rather than finding a fixed point solution, which is in general difficult, we just need to find the local maxima of the potential function, which is typically much easier.

**2. Individual rewards translate to a system reward**

Each agent inadvertently maximises system performance. This is achieved by combining Bayesian optimisation and constructing calibrated incentives that induce self-interested agents to take actions that maximise the system or group performance.

## Experiment 1: A selfish routing game

Let's look at a traffic network problem where each agent wants to commute from node 1 to node 8. In this scenario, we want to minimise the travel time for all agents in the system.

If we leave the system untouched, the drivers all opt for the shortest route possible and congestion occurs.

Consider now a central authority that seeks to minimise each agent's travel time and minimise congestion. This central authority can introduce tolls. If we introduce a incentive designer who wants to minimise delays due to congestion by devising a dynamic system of toll charges, this induces an even flow of drivers over the network.

We can see that when an incentive designer is included, it learns to set tolls on the middle edges that induce equal flow of traffic in the network.

Our framework produces a technique that vastly reduces the decision complexity and convergence time when compared with selfish routing games within large networks. Crucially, our method produces socially optimal outcomes. The Nash equilibrium is reached, which are also Nash equilibria for the agents.

## Experiment 2: Supply and demand matching with 2,000 agents

Consider 2,000 self-interested agents, each seeking to locate themselves at desirable points in space throughout some time horizon. The desirability of a region changes with time and decreases with the number of agents located within the neighbourhood.

In this setting, the resulting Markov-Perfect Nash Equilibrium distribution is, in general, highly inefficient as agents tend to cluster around locations they deem to be desirable. For example, if the agents are cab drivers in a fleet then each driver (and their colleagues) may cluster around a football stadium when they know the game is due to finish and fans need a lift home.

This is because you think you have the greatest chance of picking up a fare in this location.

The effect of the agents' self-interested behaviour on the system is highly undesirable. It causes traffic congestion and leaves punters on the other side of the city without a means to get home.

The computational complexity of the problem scales exponentially with the number of agents.

Fortunately, in our previous work, we have developed a mean field approach that enables Nash equilibrium policies of certain systems to be computed even when the population of the system is potentially unbounded.

So, we apply our method where an incentive designer introduces a reward modifier to incentivise agents to adopt a desired distribution. This method also incorporates our reinforcement learning mean field game framework to handle large strategic populations.

Remember, each point in the state space has some level of desirability, which is determined by the agent’s location and the number of agents at that given location.

As a result, the fleet of 2,000 independent drivers now distributes in the desired manner. In other words, the Kullback-Leibler (KL) divergence (a measure of the difference between the desired and induced distributions) converges to almost zero in this one shot case:

We then applied this method to a dynamic case, where the desirable distribution changes over three set time periods t = 0, 1, 2. Without the influence of the incentive designer, we see the same default behaviour as we recorded in the one-shot case.

However, when we applied our method, we found that the KL divergence, again, converges to almost zero.

This dynamic case also demonstrates that agents learn to select policies that produce a distribution that matches the desired outcome over the time period of the problem. The agents successfully converge to the expected Nash equilibrium.

In other words, the self-interested agents perform actions that maximise the system performance.

## In conclusion

We have introduced an incentive design framework, which enables self-interested adaptive learners to converge to efficient Nash equilibria in Markov games. To summarise: This method acts as an automated incentive design framework that calibrates the incentives of a multi-agent system, to produce desirable outcomes for both the system and its agents.

- Our theoretical results prove a continuity property in the incentive designer's modifications to the game, which enables a method to use black box optimisation methods to optimise iteratively the reward functions to maximise the system.
- This method can reconfigure the rewards of systems. So, going back to our previous example of a system of connected robots working on a factory floor, we don't need to reprogram each robot to optimise the system performance. Instead, we modify each robot's individual reward signal.
- Our simulation method allows an optimal incentive/system design to be achieved without implementing costly adjustments to the system. This could potentially save a lot of money where even minor changes to a system could provide major benefits across a broad range of socioeconomic use cases.
- The framework works in systems for which the incentive designer's reward function does not need to be known up-front (or its gradients or the gradients of the agents' rewards). This enables implementation in unknown systems and complex systems that are analytically intractable.
- By integrating the mean field game framework, we can implement the framework in systems with millions of agents.
- We demonstrated how this would work using a traffic network system with selfish commuters (using a subsection of London) where we optimised traffic flows and optimised a supply and demand logistics problem with thousands of self-interested agents (which could be commuter, cab drivers in a fleet, etc).

To conclude, incentive design is a powerful method to maximise social welfare and financial returns for any crowd-based system. Using our method, we can calibrate and maximise the overall performance of these crowd-based systems and drive optimisation across a range of industries and for millions of agents.