How to Hire the Right AI Development Agency (and the Questions That Separate Good from Bad)

Most companies evaluating AI agencies are doing it wrong. They're comparing proposals, checking credentials, and asking about team size. None of that tells you what you actually need to know.
What you need to know is whether this agency has shipped AI that works in production, whether they know how to handle the things that go wrong, and whether you'll own the outcome or be dependent on them forever.
Here's how to find out.
The only question that separates agencies that have done this from agencies that say they have
Ask for a production reference. Not a case study. Not a demo. A customer you can call who will tell you: they shipped something, it runs today, here's what it does.
Most agencies that are good at AI demos are not good at production AI. The demo workflow is: understand the problem, stitch together some APIs, build a compelling prototype in two weeks. The production workflow is: design for failure modes, build monitoring and observability, handle edge cases, integrate with the existing stack, do the boring infrastructure work that makes it reliable. These require different skills, and most AI agencies skipped the second part.
A production reference cuts through everything. If they can't provide one, or if the reference is vague about what actually shipped, weight that heavily.
Four questions to ask every AI agency
What broke in a project that seemed to be going well, and how did you fix it?
Good agencies have a ready answer. They've debugged strange failure modes, handled model behavior that diverged from testing, dealt with integration problems that weren't obvious until production. They'll tell you a specific story.
Agencies that haven't done enough production work either have no answer or give you a clean narrative where everything went smoothly. Real AI projects don't work that way.
What does the monitoring setup look like after it ships?
Production AI systems need observability. You need to know if output quality is degrading, if costs are spiking, if user behavior is drifting away from the test distribution. An agency that builds AI without building monitoring is handing you a black box.
Ask them: what would I see on a dashboard six months after launch? What would tell me something was wrong before a user complained? If they don't have a concrete answer, they haven't thought this through.
Who owns the code and model configuration when the engagement ends?
This varies more than it should. Some agencies build on proprietary platforms you can't export. Some use architectures that require their ongoing involvement to maintain. Some deliver clean, documented code that your team can run without them.
There is a right answer here: you should own the output. The agency should want you to be able to maintain and extend it independently. An agency whose business model depends on your dependency on them has a different incentive structure than one whose business model depends on you hiring them for the next project because the last one went well.
What's your process when the AI doesn't behave as expected?
This isn't a hypothetical. AI systems produce unexpected outputs. The question is whether the agency has a systematic approach to debugging and improving them, or whether their plan is to adjust the prompt and hope.
A good answer involves evaluation datasets, automated testing for regressions, a framework for identifying whether the problem is the model, the data, or the architecture, and a plan for retraining or fine-tuning when needed. A bad answer is "we'll work closely with your team to iterate."
The red flags that are easy to miss
They talk about AI capabilities, not your problem. An agency that leads with "we work with GPT-4, Claude, and Gemini" or "we use the latest models" is telling you about their tools, not their approach. The model choice is a downstream decision. The upstream decision is understanding what you're actually trying to achieve and whether AI is the right tool for it. If that conversation doesn't happen early, the project is starting from the wrong place.
They give you a fixed-price quote for undefined scope. AI projects have inherent uncertainty. The right pricing structure accounts for that — milestones, defined deliverables, a discovery phase before commitment. A fixed-price quote for a custom AI build on a problem that isn't fully scoped is either a quote that's padded to cover uncertainty, or a quote that's going to result in disputes about scope later. Neither is good.
Their proposal is heavy on slides and light on specifics. Proposals that explain what AI can do for your industry in general, without engaging with your specific workflow, data, and integration requirements, are recycled pitch decks. A good agency asks questions before writing a proposal. If they didn't ask questions, they don't have enough information to write an accurate proposal.
They can't explain what they'd do if it didn't work. The difference between an agency that has shipped production AI and one that hasn't is that the experienced one has a plan for failure. They've been in situations where the model behavior diverged from expectations, where the integration broke, where the business metric didn't move. Ask about those situations. The answer tells you everything.
What to look for instead
The best signal for a good AI agency is honest, specific communication about scope and uncertainty. They tell you what they know, what they don't know yet, and what they'll need to learn during the project. They have opinions about your problem before they have opinions about which tools to use. They have a framework for measuring success before they start building.
That's a different kind of conversation than the one most agencies want to have, which is about capabilities and case studies and team credentials. Those things matter, but they're table stakes. The thing that separates the ones who ship production AI from the ones who demo it is how they talk about what goes wrong.
If you're evaluating AI partners and want an independent assessment of what you actually need — before you commit to a proposal — that's part of what an AI Automation Audit covers. We scope the right approach before any build begins.