Level Up

The Overlooked Aspects Of Data Cleaning In Game Development

Ph.D. and AI researcher Daniele Gravina highlights how clean data is integral for utilising artificial intelligence in games development

Paige Cook

Published

February 14, 2023

Game development continues to evolve every day, and a major lift in the industry has come from AI and machine learning, writes Daniele Gravina, Ph.D., AI Researcher at modl.ai. An essential — but often overlooked — part of machine learning in game development is data cleaning. Unfortunately, even the most sophisticated game studios with their own data scientists struggle with this because it involves some heavy lifting. Data cleaning may involve standardising data sets, correcting mistakes such as blank fields, spelling and syntax errors and recognizing duplicated data points.

Having clean data is a fundamental step for artificial intelligence (AI), in particular, to support game developers in their quest to take their games to the next level.

Gaming’s new generation relies on data not only to analyse players’ behaviours but to incorporate AI and machine learning for optimised development. It’s a missed opportunity if the games studio doesn’t incorporate a data scientist at the beginning of the development process. Data scientists are essential in the development process. They create mathematical and automated models for analysing and identifying game optimization points.

For a game development to be successful, the game needs to be data-ABLE, and that means it needs to comply with two main states, it has to be cleaned and versioned.

Cleaning the Data

Games are tricky and complex. To make the game stable and avoid unnecessary patching, game developers require that the collected data is normalised and versioned. The data may look correct at first sight, but ‘dirty’ data can jeopardise the accuracy of the AI algorithms.

It’s not uncommon for gamers to let other people play their games – especially those with kids. If a player lets somebody else play the game they usually play, it immediately contaminates the game data because this new player – playing the same game – may not have the same level of expertise as the original player. That’s what we call outliers, and it’s important to filter this data from the data set to get the results needed from a machine learning bot.

Also, it’s necessary to keep the data in similar ranges – normalise your data – because of how machines work, numbers need to be in a similar range. For example, if a bot detects a really big number, it may incline the bot to pay close attention to this number instead of the others because it’s a hugely different number from your regular data.

Versioned Data

Games get patches, and they get changed — a lot. For example, a puzzle game can start with 15 levels and grow to 4,000. It’s common to see studios getting creative and changing the level but not the ID of the level. So now you have data that says this level is hard and data that says the level is easy, but it’s probably not even the same level. Another example – is the Player A performance on Level X the same as Player B performance on Level X six months later? It’s hard to know with un-versioned data.

A good data system can dynamically adapt to new game versions and updates so that you can ask new questions without changing the game or the way the data is collected. In conclusion, before getting into the complexity of artificial intelligence and machine learning, you need your data to be clean and ready to add functional and smart bots to a game.

Written By Paige Cook

Paige Cook is a writer with a multi-media background. She has experience covering video games and technology and also has freelance experience in video editing, graphic design, and photography. Paige is a massive fan of the movie industry and loves a good TV show, if she is not watching something interesting then she's probably playing video games or buried in a good book. Her latest addiction is virtual photography and currently spends far too much time taking pretty pictures in games rather than actually finishing them.

Bytes

Digital Futures Institute Festival of Storytelling announced for 2nd-4th June

The London festival will explore science fiction, games and speculative narratives, culminating in the Arthur C Clarke shortlist reveal

Dave BradleyMay 19, 2026

BeyondGames.biz

Level Up

The Overlooked Aspects Of Data Cleaning In Game Development

Cleaning the Data

Versioned Data

You May Also Like

Bytes

Digital Futures Institute Festival of Storytelling announced for 2nd-4th June

Subscribe to the future

Popular reading