testing

Testing & Debugging

Imagine you're at a friend's house and they're making Spaghetti Marinara. They ask you to bring the pot with the spaghetti over to you. But as you grab it you realize Ouch! It's Hot! Your friend laughs and hands you oven mitts to pick the pot up with. A few minutes later, they ask you to bring the sauce pan over. You go over to the stove, but this time put on the oven mitts first, and only then pick up and carry the obviously hot pan over. You got burnt once, so you don't repeat it. You learn from your mistakes.

AI doesn't learn from its mistakes. And that's a problem.

If you are still in the same session, it can of course recall things that happened in that session. But start a new session (or have the old one get compacted), and there's no memory of what happened. It's going to get burned on the stove, every day, until you take pity on it and remind it not to touch the stove.

There are tricks AI can use to try to avoid this problem. There's usually a AGENT.md file of some sort where notes are kept. But the problem is that it has to keep all these notes in the context window for every conversation. In theory, you could fine-tune or train your model to avoid mistakes, and the AI companies to hire technical experts to RLHF the code output. But that's to avoid mistakes in the first place, not to fix them afterwards.

Testing is not one of an AI Agent's strongest points. I was trying to vibe-debug a problem and the agent just couldn't figure out. So I had to do it myself. And that's when I noticed something that should have been obvious: I used a debugger! I could put breakpoints in the running application to stop it at key points, I could look at (and change variables) in real time, and I could single step through the code to watch the execution and see when things went wrong. One of the biggest intuitions a programmer can have is "something looks wrong". AI, on the other hand, is locked in a sort of late 1960s batch programming approach to debugging: put lots of debugging print statements in the code, run it, look at the output, put more print statements in, rinse and repeat. (I guess I should point out this is kind of the way you debug Apex classes in Salesforce today, unfortunately).

As a result, on its own, AI is terrible at debugging. It just brute forces its way. You might say so what, let AI do its thing and it will get done eventually. But it's very time consuming, and while it's doing its thing you're not making forward progress.

And when agents test, they either tend to test way too much stuff all at one and never make progress, or way too little to see the problem. And often, even when they see the problem right in front of them, they still think that's the best way to do things and seem incredulous when you tell them it's actually broken.

But LLMs, no matter how smart, don't remember unless it happened extremely recently - a very short term memory. They don't have long term memory. If it were AI instead of you visiting your friend, it (the AI) is once again going to think that picking up the hot sauce pan with bare hands is a great idea and get burnt yet again.

There are tricks the chat apps try to use to remember things, but they're not efficient: they append their learnings to a giant list with the intent of "remembering" things. But it's incredibly inefficient; it's like you have a notebook of rules you'd have to read through every time you did something. It's not a great crutch and doesn't work great.

This is why, sometimes, AI seems unable to fix a problem. Because it never learns that from before what was wrong. It might, transiently get it right, but that's only because you've complained enough that it avoids using what it thinks is best. Once if forgets that you've complained about it, it goes right back to thinking the broken approach is the best.

The way to fix this is to decompose the problem into small bits, and have the AI coding agent understand exactly what is expected to be done in each bit, and to be able to validate whether it is working or not. It helps if you've kept the application modular with lots of unit tests. Lots of unit tests.

As an example, I was working on a process that involved downloading data from an external system, persisting it in a database, and then presenting a summary in the UI. The agent tried to debug it all the way from the remote server to the UI every single time, which meant progress was very slow because the downloading step was sluggish.

So I made it first just verify that it could download the external data. I actually had it put just that chunk of code, which was embedded in a huge source file, into its very own source file with a built in unit test, and then debugged just it with the unit test until it worked.

Then I made the rest of the application use that small source file whenever it wanted to pull data from the external system. I also had it put a huge comment at the top of the small file not to change the code without explicit human permission, to prevent it from butchering it in the future (as AI is wont to do)

Rinse and repeat on the rest of the steps.