AI tools for automated game testing

back to our blogs

AI tools for automated game testing

AI for Games Testing Contributors

By Matthew Bedder (Machine Learning researcher), David Beattie (Senior software engineer) and Haitham Bou Ammar (Reinforcement Learning team leader). has crafted a specialised tool that solves a crucial problem for game developers: how can they ensure that levels are playable and completable after a design change? Like Tuneable AI, it's one way our agents can autonomously help developers save time and resources by tackling the core issues in gaming: Is the user experience everything the designers intended? How can user-difficulty be personalized? How can one be sure the whole thing just works?

Game development's mix of technical and non-technical people – designers, programmers, artists, sound engineers – can make it hard to identify when someone makes a change that “breaks” a game. Developers commonly try to solve this using by using either hard-coded scripts or human testers to run through their games looking for issues. Neither approach is ideal. Simple scripts can break as a result of even the smallest change, and more complex ones are costly to implement and maintain. User testing certainly can bring to light both technical and experiential issues, but at even greater expense. It’s infeasible to expect users to test after every change, so developers must often do the job themselves, further wasting valuable development time. When added to existing continuous integration tooling, our agents allow developers to identify the impacts of changes made to a game, replacing existing test scripts and supplementing user testing.


To demonstrate this approach, we’ve created a highly simplified game development scenario. We start by hooking up our testing platform to Minecraft’s open source Project Malmo interface, which has a realistic, complex codebase and faces common testing challenges. Once that’s done, we start training an agent to run through a level and reach a goal within the game. For simplicity’s sake, we’ve set up a very simple navigation task that consists of one level containing a few rooms with doors between them. The agent needs to find its way to a block that represents the goal in one of the rooms. When it starts training over the level, things can look a bit chaotic:

But our agents are smart and it doesn’t take them long to learn to achieve mission goals with high accuracy: it soon figures out an optimal route. Once it becomes proficient, we can report statistics on what it did and how long it took, performance metrics for the game engine and so on. If we’re happy with the numbers, we have a baseline for comparison.

Post Training Optimal

Time to try and break things. First an easy challenge. Our game artists come in and change all the temporary level textures with new artwork. We rerun the previously trained agent to confirm that it’s still able to solve the mission.

Palette Change No Issues

When we compare the generated report to our baseline, we can see that the changes don’t have a negative impact on our metrics. We can accept the changes and adopt them as a new performance baseline.

If we follow the same process after a more significant change, such as moving some of the elements of the level around, we can identify and report on this as well. Here we’ve made a relatively innocuous change: replacing the easily avoided lava pit with part of a wall and keeping everything else the same. We just re-run the agent we’d previously trained, and the automatically generated report highlights areas that might be of concern.

Minor Change Minor Issues

There’s a small impact on performance  - it turns out that the new wall adds more light sources, slightly increasing our CPU usage and locally decreasing the frame rate. We can then use this data to make an informed choice about whether this change was worth the cost.

Finally, even larger changes that force the agent to completely alter its trajectory are detected and reported on - alerting us to potential issues. By adding levers to the doors, the designers have created a much more challenging problem that can effectively force the agent to take much longer to find a solution (we've sped up the middle section of the video to shorten viewing time.) The agent eventually manages to reach the goal but reports what could be a serious issue to the testing or design team.

Major Change Possible Serious Issue

And so it goes. The agent, and thousands like it, can run through levels at great speed detecting and reporting on the impacts of changes big and small, freeing up human testers for more interesting tasks and letting designers and developers get on with what they do best: making great games.

This is just the start for the games testing platform. We’ll keep expanding our gaming toolbox with core technologies for finding hard-to-track bugs, load testing servers in MMORPGs and making user experiences ever more personalised – and fun.