Is your data ready for AI? What to check before you build

The most common reason AI projects stall in planning is a version of this statement: "We'd love to do this, but I'm not sure our data is good enough."

Sometimes that's correct. Sometimes it's a fear that isn't grounded in anything specific. And sometimes it's masking a different concern entirely (timeline, budget, internal buy-in) that has nothing to do with data.

Before you greenlight or kill a project based on data concerns, it helps to know what "ready" actually means. Not in an abstract sense. In a practical one that applies to your specific project and your specific data.

The myth of the perfect dataset

No production dataset is clean. None. Every company I've worked with has data issues: missing fields, inconsistent formatting, records from legacy systems that don't match records from current systems, labels that were applied differently before a policy change two years ago. This is not a signal that you can't build. It's just Tuesday.

The question is not "is our data perfect?" It's "are the problems in our data the kind that would actually affect what we're trying to build?"

A classification model that identifies high-risk invoices doesn't care much if some non-critical address fields are missing. It cares a lot if the "high-risk" label was applied inconsistently across different time periods. Those are different problems with different stakes.

Four questions to answer before you start

1. Do you have examples of the thing you want the model to do?

If you want a model to classify customer requests, you need examples of customer requests that have already been classified. If you want a model to predict which deals will close, you need historical deals with outcomes. If you want a model to extract information from contracts, you need contracts and the information you'd want extracted.

This sounds obvious, but it's where many projects actually hit a wall. The company wants to predict X, but they've never tracked X systematically. That's not a data quality problem. It's a data existence problem, and it requires a different solution.

2. How many examples do you have?

The answer doesn't need to be "a lot." With a pre-trained base model and fine-tuning, a few hundred labeled examples can get you to a meaningful baseline on many classification tasks. For more complex problems, or for custom models trained from scratch, you'll need more.

The useful framing is: do you have enough examples that a new employee could learn the pattern from them? If yes, there's usually something to work with. If your team is still making judgment calls on edge cases that they can't articulate rules for, the model will struggle too.

3. Is the data accessible?

Can your team actually get the data out of wherever it lives, in a format that a model can read? Data locked in a system with no export function, no API, and no engineering access is not workable, regardless of its quality. This is not an unsolvable problem, but it adds time and cost to the project.

If the data exists in spreadsheets, a SQL database, an API, or a modern data warehouse, you're in reasonable shape. If it's in scanned PDFs, a legacy CRM with no export, or a system that requires manual data entry to access, flag that before the project scoping conversation.

4. Does your historical data reflect current reality?

This one is underrated. A model trained on data from three years ago will learn the patterns from three years ago. If your business has changed, if your customer base has shifted, if your product has evolved, the model will reflect the old state, not the current one.

For stable processes, this is fine. For processes that have changed significantly, it means your training data may need to be filtered, reweighted, or supplemented before a model trained on it will generalize to today's conditions.

What to do if your data isn't ready

If the answers to those four questions surface real gaps, the project isn't necessarily dead. It needs a different starting point.

Some gaps are quick to close. Data access issues are usually solvable in days or weeks, not months. A labeling effort on existing unlabeled data can produce a working training set faster than most teams expect, especially with modern labeling tools and a clear labeling guide.

Other gaps take longer. If you don't have examples at all, you may need to run a data collection phase before a build phase. If your labels are inconsistent, you may need a relabeling pass with reconciliation. Neither is a reason to abandon the project. They're reasons to sequence it correctly.

The companies that get the most out of AI investments are the ones that do this assessment before they commit to a build scope. A week spent understanding your data state saves months of mid-project rework.

If you want an honest read on whether your data is workable for the project you have in mind, book a discovery call. We do a data readiness assessment as part of every discovery engagement, and we'll tell you exactly what we'd need to see before recommending a build.

Is Your Data Ready for AI? What to Check Before You Build

Is your data ready for AI? What to check before you build

The myth of the perfect dataset

Four questions to answer before you start

1. Do you have examples of the thing you want the model to do?

2. How many examples do you have?

3. Is the data accessible?

4. Does your historical data reflect current reality?

What to do if your data isn't ready

Three places to go next

Custom AI Build vs Off-the-Shelf Tools: Which One Fits the Workflow You Actually Run?

AI transformation case study

How to Scope an AI Project Without Underestimating It

Ready to scope one AI workflow that can actually ship?