Explainer
How RAG works for business AI assistants (and when it's the wrong tool)
Retrieval-Augmented Generation is the technique behind most useful business AI assistants. We break down what it actually does, where it earns its keep, and the five situations where it's the wrong answer.
In short
RAG lets an AI model answer questions using your private documents, data and systems, without retraining the model. It is the right approach when content changes often, when answers need to cite a source, and when fine-tuning a model on your data is overkill. It is the wrong approach when the answer never lives in a document, when latency is critical, or when you are trying to change how the model reasons rather than what it knows.
Plain language
What RAG actually is
RAG, short for Retrieval-Augmented Generation, is a way to make a general-purpose AI model answer questions using your specific knowledge. Instead of stuffing every document you own into the prompt, or paying to fine-tune the model on your data, RAG fetches only the few passages that matter for the question, hands them to the model, and asks the model to answer using those passages as evidence.
Practically, you build a small pipeline that turns your documents into searchable vectors, looks up the most relevant ones for each question, and feeds them to the model alongside the user's question. The model then writes the answer, ideally citing the passages it used. The point is to ground generic AI in your actual content, so the assistant talks about your business, your policies, your tickets, your invoices, not the internet at large.
Why you would want this
Why businesses end up needing RAG
Three pain points that usually trigger the conversation.
Generic AI cannot see your data
Off-the-shelf assistants only know what was in their training set. Your policies, your tickets, your product catalogue, your supplier terms are all invisible to them. Without retrieval, the model either refuses, guesses, or makes something up.
Your knowledge keeps changing
Prices change. Policies get rewritten. New SKUs ship. Fine-tuning freezes a snapshot, which goes stale within weeks. RAG reads from your live document store, so when you update the source, the assistant updates with it.
You need to trust the answer
Business users do not trust answers they cannot verify. RAG retrieves real passages from real documents, so the assistant can show its work: the answer, plus the source paragraph and the link to the original document. That single feature is usually the difference between adoption and abandonment.
Inside the pipeline
The four moving parts
A working RAG system is four steps wired together. Each one has real engineering decisions behind it.
- 1
Ingest
Pull the documents that should power the assistant: PDFs, wiki pages, knowledge base articles, ticket history, ERP data, CRM notes, internal SharePoint, anything the answer might live in. Then chunk them into passages small enough to be useful but large enough to keep context. The quality of this step decides the quality of the system more than any model choice.
- 2
Embed and index
Turn each passage into a vector using an embedding model, and store the vectors in a vector database. The vector is a numeric fingerprint of meaning: two passages saying the same thing in different words end up close together. Choices here are not cosmetic: which embedding model, which vector store, how often to re-embed when content changes.
- 3
Retrieve
When a user asks something, the question gets embedded and the vector store returns the few most relevant passages. Good retrieval also reranks the results, mixes in keyword search for proper-noun queries, and filters by user permissions so people only see what they are allowed to see. Retrieval quality is where most production RAG systems live or die.
- 4
Generate
The model receives the user's question plus the retrieved passages, and writes the answer using them as evidence. The prompt instructs the model to stick to the passages, cite which ones, and refuse if the answer is not in there. Good systems then show the citations to the user, so the answer is verifiable rather than a black box.
Read this before building
When RAG is the wrong tool
Five situations where RAG is overkill, the wrong abstraction, or actively harmful.
The answer does not live in a document
If the assistant needs to calculate something, query a live database, or trigger an action in your CRM, RAG alone will not do it. You need function calling, agents, or a hybrid: RAG for the knowledge, structured tools for the actions. Building pure RAG when the real need is workflow is the most common over-engineering trap.
Your knowledge base is tiny
If you have 20 pages of FAQ and they are stable, you do not need a retrieval pipeline. Just put them in the prompt. RAG infrastructure pays off when content is too large for a context window, or changes frequently. Below that threshold, it is complexity for no gain.
Latency is critical
Every RAG query adds an embedding step, a vector lookup, often a reranking pass, then generation. That is hundreds of milliseconds, sometimes seconds. For real-time agents handling live customer calls or sub-second decisions, you may need cached answers, smaller models, or a different architecture altogether.
You want to change how the model reasons
RAG changes what the model knows. It does not change how the model writes, what tone it uses, or how it reasons through a domain-specific problem. If you need a model that speaks like your brand voice or thinks like a domain expert in a niche, that is fine-tuning territory. The two techniques are complementary, not interchangeable.
Your source data is not trustworthy
RAG retrieves whatever you indexed. If your knowledge base is full of contradictions, outdated drafts, or unstructured email threads, the assistant will faithfully retrieve and surface that mess. Garbage in, citations out. Source curation matters before retrieval architecture.
RAG vs the alternatives
RAG, fine-tuning, prompt engineering
Three techniques people often confuse. They solve different problems and stack rather than compete.
| Dimension | RAG | Fine-tuning | Prompt engineering |
|---|---|---|---|
| Changes | What the model knows | How the model writes and reasons | How the model is instructed for one task |
| Cost to update | Re-index changed documents, instant | Retrain the model, hours to days | Edit the prompt, instant |
| Best for | Private knowledge, frequent updates, citations | Style, tone, domain reasoning, structured outputs | Quick wins, single-task assistants, prototypes |
| Limit | Retrieval quality caps answer quality | Frozen snapshot, no live knowledge | Context window size, no memory between calls |
| Signal you need this | Users ask about company-specific facts | Output format or voice is repeatedly wrong | First prototype, before infrastructure |
Changes
- RAG
- What the model knows
- Fine-tuning
- How the model writes and reasons
- Prompt engineering
- How the model is instructed for one task
Cost to update
- RAG
- Re-index changed documents, instant
- Fine-tuning
- Retrain the model, hours to days
- Prompt engineering
- Edit the prompt, instant
Best for
- RAG
- Private knowledge, frequent updates, citations
- Fine-tuning
- Style, tone, domain reasoning, structured outputs
- Prompt engineering
- Quick wins, single-task assistants, prototypes
Limit
- RAG
- Retrieval quality caps answer quality
- Fine-tuning
- Frozen snapshot, no live knowledge
- Prompt engineering
- Context window size, no memory between calls
Signal you need this
- RAG
- Users ask about company-specific facts
- Fine-tuning
- Output format or voice is repeatedly wrong
- Prompt engineering
- First prototype, before infrastructure
What drives the bill
What a real implementation actually costs
We do not quote ranges in articles because the variance is real. The drivers, in order of impact: how messy the source data is (this dominates), how many distinct sources need to be unified (each one is its own ingest pipeline), how strict the permissions model is (per-user retrieval filtering is engineering work), whether the assistant has to take actions or just answer questions, and how much production load it has to hold.
The pattern we see: clients underestimate the data work and overestimate the AI work. The model is the easy part. The hard part is making your scattered, inconsistent, half-indexed content into something a retrieval pipeline can actually use. A real number lives at the end of a 30-minute review, after we look at your actual sources, not before.
Frequently asked questions
What businesses ask before they build their first RAG assistant.
How is RAG different from just using ChatGPT?
ChatGPT (and any general assistant) only knows what was in its training data, plus whatever you paste into a single conversation. RAG plugs an assistant into your private knowledge: it can answer about your contracts, your policies, your tickets, your products, with citations to the actual source paragraph. ChatGPT is a generic assistant. RAG turns a model into your assistant.
Is RAG better than fine-tuning?
They solve different problems. RAG changes what the model knows. Fine-tuning changes how the model writes and reasons. Most production assistants use both: RAG to ground the model in your live knowledge, fine-tuning to lock in tone and structured output formats. Fine-tuning alone for knowledge is a common mistake: it freezes a snapshot that goes stale within weeks.
How long does a RAG project take?
A working prototype that retrieves from one source can ship in a few weeks. A production system that handles multiple sources, permissions, monitoring, and quality evaluation takes longer, mostly because the data work is real. The model and infrastructure pieces are the fast part. Cleaning, structuring and chunking source content is what sets the pace.
What kind of sources can RAG read from?
Anything you can extract text from: PDFs, Word docs, wiki pages, knowledge base articles, support tickets, CRM notes, ERP records, internal SharePoint, Notion, Confluence, email archives. The harder question is permissions and freshness: who is allowed to see what, and how stale the answer can be. Both are solvable, but they shape the architecture.
Can RAG hallucinate?
Yes, less often than a raw model, but it still can. Two scenarios: the retrieval missed the right passage and the model filled the gap; or the retrieval found something that looks relevant but is not. Good RAG systems mitigate this with strict prompts ('answer only from the passages, refuse if not in there'), citations the user can click, and an evaluation loop that catches regressions. Hallucination does not disappear, but it becomes auditable.
Can RAG run on private infrastructure?
Yes. The model, the embedding service and the vector store can all run on private infrastructure if data residency or compliance requires it. The trade-off is engineering work: managed services are faster to ship, dedicated infrastructure gives you control. We help clients pick the right point on that spectrum based on actual data sensitivity, not theatre.
Thinking about building a RAG assistant?
Book a 30-minute review. We look at your actual sources, your real use case, and what a fit-for-purpose pipeline would look like. You leave with a 1-page recommendation tailored to your data and constraints, even if you don't engage us.