Efficient, Adaptable, & Safe Reinforcement Learning

back to our blogs

Efficient, Adaptable, & Safe Reinforcement Learning

Haitham Contributor

By Haitham Bou-Ammar

This is the second of three blog posts that show how each of PROWLER.io's research teams (Multi-Agent Systems (MAS), Reinforcement Learning (RL) and Probabilistic Modelling (PM)) work to make AI decision making possible for high-dimensional problems. This week we'll look at how the RL team is finding ways to make Reinforcement Learning data-efficient and scalable in industrial applications such as finance, logistics and robotics.

Palo Alto Rl Body 1

These applications look very different on the surface, yet the challenges they face share an essential structure: all have agents that perform sequences of decisions in uncertain environments in the presence of other agents. Not coincidentally, these are the issues our three teams focus on: RL helps agents learn and perform sequences of actions, PM tries to forecast and quantify uncertainties in the environment, and MAS makes it possible for agents to interact with each other strategically. This three-pronged approach is helping us solve some of the most difficult problems in AI — and in Reinforcement Learning itself.

How RL works

Motivation 3

Imagine an agent: it might be a computer trading in a market, a drone delivering a package, or a robot walking in a kitchen. It's dealing with an uncertain environment, so we can't pre-program or script its actions in advance; it's going to have to make decisions interactively.  In RL, we break down those decisions into sequences of individual actions: each time an agent performs an action in its environment — buys a stock, routes a package, takes a step forward — the environment transitions to its next state and delivers a numerical reward to the agent. If a robot trying to walk out of a room takes a step closer to the door, it gets a positive number - good move! - if not, it gets a negative number and tries a different direction. As it interacts with the environment over and over again, it collects a dataset from these transitions. The goal of a reinforcement learner is to use that dataset to learn a "policy": an action selection rule that maps states to actions in such a way that it rewards/reinforces good outcomes, and discourages bad ones. This core concept of RL has been widely applied and has shown remarkable success, in both the private sector and academia.

Current Tech

One famous example is Deep Q Networks (DQN), which allowed DeepMind to train an agent end to end for the first time on Super Mario and other Atari Games. Here the state of the environment was a snapshot of the game — a simple graphic — and the output was an action that said to Mario go left, go right, jump up and so on. When training the model using a complex convolutional neural network, the agent was able to outperform human players. This success was extended to Robotics, where the snapshot of the environment is mapped to low-level torques or rotations of simulated robotic joints. These results were impressive and generated a lot of interest in the power of Deep Neural Networks.


And how it doesn't

But there's an obvious problem here: to be successful, this approach needs tens of millions of interactions, tens of millions of mistakes, before learning something useful in a simulated environment. Clearly, this can't be applied anywhere in industry; no one is interested in a financial AI that will make tons of mistakes before learning a good trading strategy. What industry needs is RL that scales, that learns not with millions, not with hundreds of thousands, but with tens of interactions in an environment. Why do current RL practices fail at this? First let's understand the problem: in Reinforcement Learning there are two levels of complexity: task complexity and agent complexity.

Task Complexity

Palo Alto Rl Body 5 1

In this example maze, the agent (the navy and purple icon) needs to find its way to a goal (the red dot). The robot/agent sees this world only through low-level representations: all it "understands" is its position in the maze. It must randomly explore the entire maze until, with luck, it stumbles on the correct path to the goal (navy blue dots) and its behaviour is reinforced. Humans don't think this way; we see the problem at a much higher level of abstraction: we understand that we must decompose the problem and start by exiting the first room, entering the second room and so on.

While current RL practices don't really think in this way, we at PROWLER.io are finding ways to enable higher levels of abstraction in AI. We've accomplished this in much more demanding environments than simple mazes and can demonstrate the approach in the hardest Atari game of all: Montezuma's Revenge.

6 Why Is Ml Hard V1 3

Here the player controls a character whose goal is to reach the key, retrieve it and move on to the next level. The problem is that any time it makes a mistake, it dies and the episode restarts. As in the maze example, the agent needs to be very lucky to achieve its goal by going down the ladder, jumping to the rope and over the skull, then up the ladder to the key and back out. If its actions are random, that won't be possible. This is high task-level complexity.

Agent Complexity

Palo Alto Rl Body 7 1

The other half of the problem in RL is agent-level complexity. Going back to our maze, what if the agent is not just a simple dot, but a complicated android: a humanoid robot with complex dynamics and motion? Even if we give it the correct path to follow, its joints still have to perform tons of complex, coordinated internal actions, just to enable it to walk without falling over.

Current practice in RL generally combines these two separate problems. DeepMind and others try to mix and solve them together, in the hope that one big neural network can handle these convoluted levels of complexity. It's an inefficient solution that requires vast amounts of data and computational resources.

PROWLER.io instead separates the task problem and the agent problem, modularises the solution of these separate complexities and then collates them back together to solve the overall problem.

How do we do this?

Divide and Conquer

Divide And Conquer

When addressing task level complexity, we divide and conquer the problem even further by splitting the task — the maze — into subproblems that can be solved individually. To do this, we create a hierarchy: starting with a high-level representation of the problem that learns by decomposing lower level problems which can then be collated back together to solve the maze as a whole.

When the game theorists in our Multi-Agent Systems team see this, they immediately recognise that the interaction between these two levels of the hierarchy is much like a two-player game: where the high-level network can interact with a lower level component and reward it each time it gets closer to a solution.  That's the beauty of working at PROWLER.io: our different teams solve issues by sharing completely different ways of viewing these traditionally intractable problems. So once a subsidiary agent solves one part of the problem, the higher level agent picks another task for it to perform. They cooperate and collaborate on the solution until we end up with chunks of controllers that together are capable of solving the whole problem. This should be much more efficient than previous solutions, and it is. Let's apply the idea to Montezuma's Revenge:

On the left, DeepMind's approach: after 15 million interactions it is still randomly, seemingly crazily, jumping around the level. On the right, our high-level agent directs simpler sub-agents to navigate each obstacle, and collates their findings to successfully retrieve the key and solve the task — after only 2.5 million interactions.

But that's still too many interactions for real industrial applications: we want to go down to the tens. For that, we need to bring in the Probabilistic Modelling team and use an approach called Model-Based RL. To prove our point, we'll move beyond games to something physical and continuous that has the complexity of real-world problems.

Mbrl To Erl

We have a benchmark in RL called the cart-pole problem that applies real-world physics to movement. Here, a block with a pendulum moves back and forth when force is applied. The goal is to balance the pole in an upright position. We now need to take into account real-world physics: this is clearly a more complex continuous environment than a game with a character that can only be directed to go up, down left or right.

In this great example of agent complexity, we need to devise a model that understands the effect of an action taken on how the agent moves. We're letting the agent learn about itself as it moves, and it starts to understand that if it goes left or right a certain speed, the pole will move in a certain way. As the agent gathers more information about itself, it learns to simulate the results of any action it might take.

When DeepMind tries to learn these dynamics, they use an end-to-end approach: starting from an image, they try to map every pixel from one state timestamp to every pixel in another timestamp. They input an image to a neural network that learns from every single pixel and then outputs another image and iterates. That's nowhere near what humans do to solve this problem. When we look at the image, we only notice the salient features of the environment: the dynamics of the cart pole. As far as we're concerned, the rest of the image is background noise. We need to develop an AI approach that does just that: abstracting these salient features and ignoring the rest of the pixels. That's part of what probabilistic modelling allows us to do. The upper cart pole image is the resulting model, a kind of "dream" of what is happening in the real image below. We'll take a closer look at how James and the PM team do this in the next blog post.

Palo Alto Rl Body 11


The resulting efficiency is impressive, and we're now able to train the model using 160 data points instead of thousands. When it comes to training our RL agent, instead of interacting with the real environment, we can now train on the much more data-efficient surrogate of the cart pole provided by the probabilistic model. The results are astounding:

Palo Alto Rl Body Chart3

Here, the Y-axis shows rewards, so the higher, the better. The X-axis shows episodes (the number of times the agent interacted with the environment), which we want to reduce.

The red curve is OpenAI, the green is DeepMind, and the orange is an academic model-based RL approach that uses Neural Networks (paper), but no probabilistic model. The blue curve is our approach. Not only are we able to learn better overall by receiving more rewards on the Y-axis, but we also need only 15 interactions to do so on the logarithmic scale of the X-axis, while the competition needs 2000. So we have three orders of magnitude reduction in complexity to learn these kinds of problems.

It's the kind of result that allows us to scale RL and AI decision-making to industrial problems, and it's all integrated to our VUKU™️ platform.

Help us build AI that will change the world

join our team