What happens when you skip the data audit (we found out)

Early in a project for a mid-sized logistics company, we did something we've since stopped doing: we jumped straight to model selection.

The client had a clear use case — automating the classification of inbound freight documents — and clean-looking sample data. The business problem was real, the ROI was obvious, and there was pressure to show progress quickly. So we evaluated models, picked a good one, started building.

Six weeks in, we hit a wall.

What we thought we knew

The sample data the client shared was a representative set of about 2,000 documents. They looked consistent. Fields were mostly populated, document formats were standard, labels were clear. We used them to test a few approaches and the results were promising. 90%+ accuracy on the sample set. We felt good about the direction.

What we didn't know was where that sample came from.

The ops team had pulled a "clean" set for us — documents that had already been manually reviewed and corrected. The actual production pipeline included everything that hadn't been reviewed yet: documents with missing fields, formats from a dozen different carriers with slightly different conventions, scanned images of variable quality, and a non-trivial percentage of cases where the classification itself was ambiguous.

We found this out when we started running the model against production documents and watched accuracy drop to 71%.

The gap we'd missed

The real distribution of production data was fundamentally different from the sample we'd trained and tested on. Not dramatically — but enough.

About 15% of production documents came in as low-resolution scans. We hadn't seen those in the sample. Another 10% were from carriers using non-standard field names for the same information. Again, not in the sample. And roughly 8% of all documents fell into categories where two experienced humans would have classified them differently.

Those last ones were the most important. When we dug into them with the client's team, we discovered that the classification rules had informal exceptions that existed nowhere in writing. Different team members had slightly different mental models for the edge cases. The "clean" training data had been labeled by one person, consistently — but consistently applying one person's interpretation of ambiguous rules.

A proper data audit would have surfaced all of this in week one. Instead we found it in week six, after we'd built infrastructure around the wrong assumptions.

What we did differently after that

We added a data audit phase to the front of every project. It's not long — typically three to five days depending on data complexity — but it changed everything.

What we look at:

Actual production data, not samples. We ask for a random draw from the real pipeline, not a curated set. If that's not possible due to privacy constraints, we want to understand exactly why the sample was chosen and what it might be missing.

Distribution of edge cases. For classification problems, what percentage of cases are genuinely ambiguous? Who currently makes those calls, and how? The answer shapes how the model needs to handle uncertainty.

Consistency of labels. If the training data was labeled by humans, was it the same human? Do multiple reviewers agree on the edge cases? Inconsistent labeling puts a ceiling on model accuracy that no architecture decision can overcome.

Data completeness by field. Missing fields aren't always problems — sometimes a field being missing is itself informative. But you need to know which fields are reliably populated and which aren't before you build a system that depends on them.

Data lineage. Where does the data come from, how does it get transformed before it reaches us, and what could change upstream? A system trained on today's data needs to be resilient to reasonable changes in tomorrow's data.

The thing we tell clients now

We've framed this to dozens of clients since: the model is the last thing you should worry about.

Models are good. The major providers have closed most of the capability gaps for standard classification and extraction tasks. In most cases, the model is not the constraint. The data is.

A mediocre model on clean, well-labeled, representative data will outperform an excellent model on poorly understood data almost every time. And you can't know what you have until you look.

The three weeks we saved by skipping the audit on that logistics project cost us three weeks of rework, a difficult conversation about revised timelines, and a significant amount of rethinking architectural decisions we'd already made. We've never skipped it since.

If you're at the early stages of an AI project and wondering whether a structured diagnostic is worth the time before you start building, that's the question our AI Automation Audit is built to answer. One week, a clear picture of what you have and what it would take. Book a discovery call if you want to talk through what that would look like for your situation.

What Happens When You Skip the Data Audit (We Found Out)

What happens when you skip the data audit (we found out)

What we thought we knew

The gap we'd missed

What we did differently after that

The thing we tell clients now

Three places to go next

Stripe webhooks failing in production

Rescue Ship case study

AI Hallucinations Aren't the Problem You Think They Are

Ready to scope one AI workflow that can actually ship?