How to build AI that actually works #4: monitoring after launch

A final reason to pay attention to your data

Oct 30, 2024

You’ve launched…so what comes next?

This post is the final part of a series on building AI that works in the real world. Before reading it, you should be familiar with the other topics:

You’ve broken down the problem, labelled data to set up an evaluation, and iterated on your models to perform well against these evaluations. You followed the playbook and launched your AI product - congrats!

Unfortunately, no launch survives first contact with the enemy. Your AI will get hit with unexpected requests, and then sometimes produce wrong outputs. This can be scary and frustrating - what was the point of labelling all that data if the AI still doesn’t work?

There’s no need to panic - this is a completely normal experience. Your AI will likely require ongoing maintenance and monitoring. This post discusses how to monitor effectively, touching on topics like:

The simple way to monitor your data, and how to be smarter
Being too smart for your own good with fully automated monitoring
The holy grail of having your users monitor output for you

Like the other parts of building AI, the key to effective monitoring is paying attention to your data.

Data doesn’t stay still: it drifts

With some software systems that don’t require AI, you can leave them to run once you’ve set them up. Some websites haven’t changed in decades but still work. For example: Vortex.com has been around for nearly 40 years. That’s because the underlying software doesn’t change over time.

AI isn’t like that, because the underlying tech isn’t just software - it’s also data. Data nearly always changes over time, because it’s a representation of the real world. When the world changes, the data should change too and this means your system will need new data to ensure it’s still aligned with the world. This problem is called data drift.

Let’s take an extreme example: Covid-19. Suppose you’re building an AI chatbot that advises sick people whether to stay at home or go to hospital. Suppose you’re a 75- year-old with a mild cough who’s losing their sense of smell. In April 2019, the chatbot should probably tell you to wait it out for a few days. In April 2020, this would be a terrible answer! The AI no longer works because the data it was trained on no longer represents the world, which has drifted.

A less extreme problem applies to most AI products. People change over time, so your users’ behaviour will change too. For example, how people query Google has changed, so Google needs to update the data its search algorithm is trained on. If tastes in TV shift, Netflix needs to update its recommendation system. If you’re building an AI product, you need to assume that data drift will happen at some point, so you’ll need some way to monitor this.

A simple solution: monitor the data yourself

The simplest way to deal with data drift is to monitor the data yourself. Monitoring means storing data on how users interact with your product:

The input to the AI pipeline e.g. the query the user has submitted.
The output of any intermediate steps such as data cleaning. If you don’t have any intermediate steps, you may not have broken the problem down correctly.
The final output of your pipeline e.g. the response to the query.
The version of your pipeline that has produced this output.

This allows you to review outputs and make changes if they’re not what you expect. For example: with a chatbot, you’ll likely notice new queries that weren’t in the training data. If the bot gives the wrong responses, you’ll need to update your training data and recalibrate the bot.

This isn’t a bad starting point: if you do this, you’re ahead of most people deploying AI products. However, depending on how heavily your product is used, it may be a lot of work to label all of your data, and it may not be necessary. Instead, you can just label a random sample - this may be 10%, 5% or even less than 1% depending on volume. There isn’t a one-size-fits-all answer: it depends on how quickly data drift occurs, and how well your model can adapt to it without needing retraining.

To get smarter, we need a way to tilt our sample to where the AI is making mistakes. In most cases, you can give users a way to flag bad outputs. You’ll still need to check them yourself: after all, in most cases users can make mistakes too. But this is a better way to choose what data to check than random sampling.

Another way to tilt your sample is to use unsupervised methods (these don’t require labelling) to automatically select data that’s more likely to be troublesome. Outlier detection is a good example of an unsupervised method. Suppose that all queries have a maximum of 100 words in your training data. After you launch, users start submitting queries with 500 words. You’ll want to check the model’s response to these, as it’s more likely to get queries wrong that it’s not seen before. In this case: an automated filter on the length of the query is a sensible way to detect outliers.

Don’t be too smart for your own good

People don’t like labelling data. It’s boring, and it brings you face-to-face with your model’s errors. I get it. This often leads them to devise “smart” alternatives, like fully automated systems for monitoring their AI output in production. None of these work, and none will ever work.

There’s an inherent flaw with fully automated monitoring: if an automated step can reliably detect errors, that step should become part of your model. And if it becomes part of your model, it will no longer be able to detect errors from your model. This insight may seem anywhere from patronisingly obvious to baffling, depending on your background, so I’m going to unpack it further.

Since LLMs took off, many solutions now promise fully automated monitoring with them. You provide the query and the model’s output to the LLM “checker”, which then decides whether the output is consistent with the query. If it’s not, the checker labels this as an error so that you don’t need humans to label the data anymore.

If this actually worked, you’d include the “checker” as the final step in your model. Rather than just checking for errors, it could instead send the query back to the model and tell it to give a different response. This would improve the model, but make the checker no longer effective as it can’t catch its own mistakes.

The reality is these systems don’t work. Using LLMs to judge the output of other LLMs is partially accurate at best - it’s like grading your own homework. It’s also not a given that the LLM can fix the errors they detect. Imagine the checker detects an error, feeds it back to the model, then fails to fix it. It could get caught in this loop forever!

Instead, automated monitoring should be treated the same as other unsupervised methods. They’re another way to tilt the sample sent to humans for labelling. Don’t be too smart for your own good by trying to avoid labelling entirely!

The holy grail: your users monitor for you

So we’ve established that a human needs to label some data to monitor your AI in production. However, that human doesn’t necessarily need to be you or someone that you pay. Instead, there are good examples where companies have devised clever strategies to get their users to label data for them.

Google Search is the best-known example. Their AI needs to return search results that are maximally relevant to the user’s query. In production, they monitor their users’ behaviour such as whether they click the returned links or search again. Users choosing to search again indicates the results weren’t relevant, which automatically labels the data as “incorrect”. In turn, Google can then automatically update their search algorithm with this labelled data.

Captcha is my favourite. You may know Captcha as the annoying check which slows you down when browsing websites. The tasks are tedious: I don’t know how many times I’ve got the “select all images with cars” one wrong. But as you do the task, you’re labelling data for them to train AI models to analyse images. Captcha sells this data to other people training AI models, and they don’t need to label any of it themselves.

Unfortunately, the vast majority of AI products don’t have the potential for these fully automated monitoring systems. If you go chasing them, you’ll likely find yourself disappointed. Just like people who went chasing the holy grail!

Wrapping up: it’s ok to be clever, but not too clever

For AI systems to work, the data they’re trained on must reflect the real world. If the world changes and your data drifts, you’ll need to update your data too. The simple way to do this is to sample a fraction of your AI’s outputs, and manually label whether it’s right or wrong.

You can get a bit more clever by finding ways to tilt the sample towards likely errors. There are automated techniques that can help with this, such as outlier detection, which reduce the volume you’ll need to monitor if implemented well. However, you still need a human to review the “tilted” sample they select.

Be wary of anyone promising fully automated monitoring, as this is essentially the holy grail. No product ever really gets there - even Google still pays people to dig into their search results. Instead, it’s better to roll up your sleeves and continue to monitor at least a fraction of the data with human eyes.

This marks the end of the series, for now. If you made it this far: thank you very much. They’ve been a pleasure to write and it’s a highlight of my day whenever someone reaches out about the blogs. So…don’t be shy!

Artanis' Substack

Discussion about this post