Background
This is the third post in a series about building AI that works in the real world. Before reading this, it’s critical that you understand the first two posts in the series:
Fitting models (this post)
Fitting models often feels like the fun part. At its best, it’s like playing a video game where you make rapid progress. But if you jump straight to fitting models before understanding the concepts above, the fun feeling turns into feeling stuck in the mud and spinning your wheels.
Using real-world data raises other challenges that aren’t covered in clean environments, like Kaggle competitions or a Masters project. While you no longer need strong coding skills - most of this is dealt with by modern software packages - other challenges remain. This post will cover key real-world topics like:
What does “accurate enough” mean? Hint: It can’t be 100%
Using a model as a “second opinion” on your data
Iterating, in a methodical way, until you get to good enough
What does good enough look like?
AI usually replaces some human workflow. Humans aren’t 100% reliable, so the bar for the AI can’t be 100%. If the bar is truly 100%, don’t use AI as you’ll never get there. Instead, you need to build something using traditional, reliable software functions.
You can look to human performance as a starting point as to what’s “good enough”. However, it’s not going to give a definitive answer. Self-driving cars are a great example of this. If we measure accuracy as the % of trips completed safely without delay, they’re over 99.9% (and probably well above human performance). But we’re still a long way from rolling them out, because we’re less comfortable with accidents caused by self-driving cars than human drivers.
Ultimately, the market decides what’s good enough. It is not a technical question and can’t be answered in the lab. My favourite example comes from Midjourney, an image generator. I use Midjourney to make personalised celebration cards for my friends. On average, I try about 50 images before landing on one that feels good enough. So Midjourney’s accuracy here is about 2%. Yet I’m happy to keep paying $10/month for it, so in this case 2% is good enough!
So there’s no way to know what good enough looks like without going to market. If you don’t know, maybe you’re a startup with an entirely new product, you should set the bar low. After all, sometimes 2% is good enough! And you’ll learn quickly if you’re not good enough, because you’ll hear about it from your customers.
Using a basic model to check your data
Now you’ve got a target, it’s time to build your first model. You should fit something quick to a small sample of your data. In the likely event you find errors, do not start iterating yet. In the real world, some of your data has likely been mislabelled. There are many reasons why this happens:
Labelling your data requires specialist domain knowledge, which the labeller does not possess.
The labeller(s) aren’t themselves data scientists, so they don’t understand the impact that mistakes have on model performance.
The labellers are only human, and will make mistakes. The frequency of these mistakes goes up when they get bored and/or tired.
Anyway, the solution to labelling mistakes isn’t to improve the model. Instead, you need to fix the data i.e. re-label it. A basic model provides an automatic “second opinion” on your data, and finds possible labelling mistakes. You need to fix these mistakes before you can begin iterating. Otherwise, you’ll iterate towards a model that will repeat these mistakes!
You should repeat this process until the errors you find are no longer due to labelling mistakes, but because your model isn’t good enough.
Iterating until you get to “good enough”
Once you’ve fixed the mistakes in your data, you should establish a “baseline” model. This is the first version that you measure accuracy with, and it doesn’t need to be good. It just needs to give you something to compare future iterations against.
A sensible way to set a baseline is to first overfit for a small sample of data. Overfitting involves building a model that’s 100% accurate on the small sample. This is a useful proof of concept, as it shows your model can at least learn the patterns it needs to. You then run this overfit model across your whole dataset, where it will likely perform much worse. This is ok - it’s just a starting point.
You next need to inspect the errors from your baseline model. You want to find the most frequent type of error, then propose some experiment to fix it. In the case of LLMs, this may be adjusting the prompt. For other models, it may involve changing some training parameters.
It’s crucial to write down your experiment before running it. It’s tempting to run experiments in a rush: improving your model can feel like playing a video game where you just need to push the right button. But you’ll quickly reach a point where most experiments won’t improve on the baseline. It’s easy to forget what you’ve tried so, without logging, you may repeat failed experiments and spin your wheels.
Change only one thing at a time. It’s tempting to try and tweak several knobs at once. But you’ll find that you won’t be able to see what impact came from each change. It will end up taking longer than making one change at a time, methodically.
If an experiment improves on the baseline, this becomes the new baseline model. This iterative process can be summarised as:
Find the most common error in the current baseline
Propose an experiment to fix the error
Log results from the experiment
If the experiment succeeds, establish a new baseline
Repeat
You keep going with this until you get to “good enough”. If you get stuck at some point i.e. your iterations are no longer working, it’s worth looking again at the errors to check for mislabelled data. Never blindly assume data in the real world is correctly labelled! But if the data is correctly labelled, and you’re out of ideas, that’s when it’s time to get external help (shameless plug).
Wrapping Up
People often jump to building and iterating on their models because it’s the fun part. At its best, it can feel like a video game. But it’s very important not to start playing this game until you’ve:
Broken down the problem correctly - see post #1
Defined success and set up performance evaluation - see post #2
Set a bar for “accurate enough”
Used a basic model to check the data is correctly labelled
If you skip these steps, what begins with feeling like a fun video game quickly becomes a frustrating whack-a-mole with a blindfold.
Once you’ve done these, you’re ready to begin iterating. As you do so, you must have the discipline to log your experiments so you don’t end up repeating yourself. Follow all the above steps, and you’ll see how building AI that actually works often doesn’t need a PhD!