What Makes a Good RAG System (And Why Most of Them Aren't)

Retrieval-augmented generation gets pitched as the solution to LLM hallucinations. Build a knowledge base, hook it up to an LLM, and suddenly your AI answers questions from your actual data instead of making things up. Simple.
In practice, most RAG systems don't work that way. They look fine in demos and fall apart when real users start asking real questions. The failure isn't usually the LLM. It's everything underneath it.
Here's what actually separates the RAG systems that hold up in production from the ones that don't.
The retrieval problem is harder than it looks
A RAG system has two jobs: retrieve the right context, then generate a useful answer. Most teams spend 80% of their time on the generation side — prompt engineering, model selection, response formatting — and then wonder why their system produces confident, fluent answers that are completely wrong.
The answer is almost always in retrieval.
If you pull the wrong chunks from your knowledge base, no amount of prompt engineering will save you. The model can only work with what you give it. And the dominant retrieval approach — cosine similarity on text embeddings — has real limits.
It retrieves by semantic similarity, not by relevance to the actual question. These are not the same thing. A user asking "what's your refund policy?" might get chunks about customer satisfaction surveys, return shipping costs, and your terms of service footer — all semantically related to "refund" — but not the two sentences that actually answer the question.
This is the core retrieval problem. Solving it requires more than a bigger embedding model.
What good retrieval actually looks like
Teams that ship reliable RAG systems typically layer multiple retrieval strategies rather than relying on a single embedding search.
Hybrid search combines dense vector search (embeddings) with sparse keyword search (BM25 or similar). Keyword search is better at exact matches and proper nouns; vector search handles paraphrasing and synonyms. Together they cover more ground.
Metadata filtering narrows the candidate pool before you even run similarity search. If you know the user is asking about a specific product, a specific date range, or a specific department, filter first. Searching 10,000 chunks when you could search 200 is slower and noisier.
Re-ranking runs a second pass over the top retrieved chunks using a cross-encoder model that scores relevance more precisely than embedding similarity. This step alone often moves the needle more than anything else.
Contextual chunking means splitting documents in ways that preserve meaning, not just character counts. A chunk that starts in the middle of a table or ends before the conclusion of a paragraph is almost useless even if it has a high similarity score. Good chunking treats document structure as information.
None of these are exotic. They're patterns that come up every time we build a RAG system that needs to work for more than demo purposes.
The data quality problem nobody talks about in demos
Here's what rarely makes it into the conference talks: your retrieval is only as good as your source data.
If your knowledge base contains outdated documents, contradictory policies, redundant content from five different team members answering the same question differently, and a healthy amount of internal jargon — your RAG system will faithfully retrieve all of it. The LLM will do its best, but it's working with garbage.
Before you optimize retrieval, you need to audit the content. That means:
- Removing or archiving stale documents
- Reconciling contradictions (two support articles that give different answers to the same question)
- Normalizing structure so similar content is formatted consistently
- Adding metadata that makes filtering actually work (document type, date, product area, access level)
This is unglamorous work. Nobody builds a demo around it. But it's usually the difference between a system users trust and one they stop using after two weeks.
Evaluation: if you can't measure it, you can't improve it
A good RAG system has an evaluation pipeline from day one — not as an afterthought after you notice it's giving bad answers.
At minimum, you want:
Retrieval evaluation — given a set of test queries, did the system retrieve the right chunks? You can measure this with precision and recall against a labeled test set. This tells you whether your retrieval strategy is working independently of your generation quality.
Answer faithfulness — does the generated answer actually follow from the retrieved context? This catches hallucinations that slip through when the model draws on training data instead of the retrieved chunks.
Answer relevance — is the answer actually useful for the question that was asked? A faithful answer can still miss the point.
Running these evaluations systematically before and after changes tells you whether you're making things better or worse. Without them, you're guessing.
Evaluation frameworks like RAGAS make this easier to set up, but the harder part is building the test set. That requires domain knowledge — someone who understands what "a good answer" looks like for your specific use case.
Latency and cost are design constraints, not afterthoughts
A RAG system that takes 8 seconds to respond is a RAG system that users stop using. Latency is a product quality issue, not just an engineering detail.
The main latency drivers in a RAG pipeline are embedding generation (especially at query time), the number of chunks you retrieve and re-rank, and the LLM response time. Each can be optimized, but there are tradeoffs.
Caching commonly-asked queries can cut response times significantly. Smaller, faster models work fine for retrieval tasks where you don't need the full capability of a frontier model. Async prefetching can help when you can predict what a user is likely to ask next.
Cost scales with volume in ways that surprise people. If you're planning to serve thousands of queries per day, run the numbers on embedding API costs, LLM token costs at your average context length, and vector database hosting before you commit to an architecture.
When RAG is the wrong tool
Not every question-answering problem needs RAG.
If your knowledge base is small and static (under a few hundred documents), just put it in the context window. Retrieval adds complexity without adding much value at that scale.
If your users are asking questions that require reasoning across multiple documents simultaneously — comparing two policies, synthesizing findings from ten research papers — standard RAG will struggle. You need either a better chunking strategy that preserves cross-document relationships, or a different architecture entirely.
If accuracy requirements are high enough that a wrong answer has serious consequences, RAG alone isn't sufficient. You need human review steps, confidence scoring, and fallback paths — not just better embeddings.
RAG is a good fit for large, dynamic knowledge bases where users need specific answers and some tolerance for occasional errors is acceptable. Outside that zone, there are usually better options.
What this means in practice
The teams that build reliable RAG systems don't treat it as a commodity problem you solve by picking the right framework. They invest in:
- Understanding the retrieval failure modes before assuming the LLM is the problem
- Data quality work that isn't glamorous but determines half the outcome
- Evaluation infrastructure that makes it possible to measure progress
- Architecture decisions that account for latency, cost, and accuracy together
If you're scoping a RAG system and want to know which of these will matter most for your specific use case, that's exactly the kind of question we dig into in the AI Automation Audit. A week of focused analysis, a clear picture of what to build (and what to skip), and a realistic roadmap.
Book a discovery call if you want to talk through what you're building.
Martin Tech Labs builds custom AI systems for early-stage founders and growing companies. We specialize in production-ready AI — not demos.