Explainer

How RAG works for business AI assistants (and when it's the wrong tool)

Retrieval-Augmented Generation is the technique behind most useful business AI assistants. We break down what it actually does, where it earns its keep, and the five situations where it's the wrong answer.

See our internal AI assistant service

In short

RAG lets an AI model answer questions using your private documents, data and systems, without retraining the model. It is the right approach when content changes often, when answers need to cite a source, and when fine-tuning a model on your data is overkill. It is the wrong approach when the answer never lives in a document, when latency is critical, or when you are trying to change how the model reasons rather than what it knows.

Plain language

What RAG actually is

RAG, short for Retrieval-Augmented Generation, is a way to make a general-purpose AI model answer questions using your specific knowledge. Instead of stuffing every document you own into the prompt, or paying to fine-tune the model on your data, RAG fetches only the few passages that matter for the question, hands them to the model, and asks the model to answer using those passages as evidence.

Practically, you build a small pipeline that turns your documents into searchable vectors, looks up the most relevant ones for each question, and feeds them to the model alongside the user's question. The model then writes the answer, ideally citing the passages it used. The point is to ground generic AI in your actual content, so the assistant talks about your business, your policies, your tickets, your invoices, not the internet at large.

Why you would want this

Why businesses end up needing RAG

Three pain points that usually trigger the conversation.

Generic AI cannot see your data

Off-the-shelf assistants only know what was in their training set. Your policies, your tickets, your product catalogue, your supplier terms are all invisible to them. Without retrieval, the model either refuses, guesses, or makes something up.

Your knowledge keeps changing

Prices change. Policies get rewritten. New SKUs ship. Fine-tuning freezes a snapshot, which goes stale within weeks. RAG reads from your live document store, so when you update the source, the assistant updates with it.

You need to trust the answer

Business users do not trust answers they cannot verify. RAG retrieves real passages from real documents, so the assistant can show its work: the answer, plus the source paragraph and the link to the original document. That single feature is usually the difference between adoption and abandonment.

Inside the pipeline

The four moving parts

A working RAG system is four steps wired together. Each one has real engineering decisions behind it.

1
Ingest
Pull the documents that should power the assistant: PDFs, wiki pages, knowledge base articles, ticket history, ERP data, CRM notes, internal SharePoint, anything the answer might live in. Then chunk them into passages small enough to be useful but large enough to keep context. The quality of this step decides the quality of the system more than any model choice.
2
Embed and index
Turn each passage into a vector using an embedding model, and store the vectors in a vector database. The vector is a numeric fingerprint of meaning: two passages saying the same thing in different words end up close together. Choices here are not cosmetic: which embedding model, which vector store, how often to re-embed when content changes.
3
Retrieve
When a user asks something, the question gets embedded and the vector store returns the few most relevant passages. Good retrieval also reranks the results, mixes in keyword search for proper-noun queries, and filters by user permissions so people only see what they are allowed to see. Retrieval quality is where most production RAG systems live or die.
4
Generate
The model receives the user's question plus the retrieved passages, and writes the answer using them as evidence. The prompt instructs the model to stick to the passages, cite which ones, and refuse if the answer is not in there. Good systems then show the citations to the user, so the answer is verifiable rather than a black box.

Read this before building

When RAG is the wrong tool

Five situations where RAG is overkill, the wrong abstraction, or actively harmful.

The answer does not live in a document

If the assistant needs to calculate something, query a live database, or trigger an action in your CRM, RAG alone will not do it. You need function calling, agents, or a hybrid: RAG for the knowledge, structured tools for the actions. Building pure RAG when the real need is workflow is the most common over-engineering trap.

Your knowledge base is tiny

If you have 20 pages of FAQ and they are stable, you do not need a retrieval pipeline. Just put them in the prompt. RAG infrastructure pays off when content is too large for a context window, or changes frequently. Below that threshold, it is complexity for no gain.

Latency is critical

Every RAG query adds an embedding step, a vector lookup, often a reranking pass, then generation. That is hundreds of milliseconds, sometimes seconds. For real-time agents handling live customer calls or sub-second decisions, you may need cached answers, smaller models, or a different architecture altogether.

You want to change how the model reasons

RAG changes what the model knows. It does not change how the model writes, what tone it uses, or how it reasons through a domain-specific problem. If you need a model that speaks like your brand voice or thinks like a domain expert in a niche, that is fine-tuning territory. The two techniques are complementary, not interchangeable.

Your source data is not trustworthy

RAG retrieves whatever you indexed. If your knowledge base is full of contradictions, outdated drafts, or unstructured email threads, the assistant will faithfully retrieve and surface that mess. Garbage in, citations out. Source curation matters before retrieval architecture.

RAG vs the alternatives

RAG, fine-tuning, prompt engineering

Three techniques people often confuse. They solve different problems and stack rather than compete.

Dimension	RAG	Fine-tuning	Prompt engineering
Changes	What the model knows	How the model writes and reasons	How the model is instructed for one task
Cost to update	Re-index changed documents, instant	Retrain the model, hours to days	Edit the prompt, instant
Best for	Private knowledge, frequent updates, citations	Style, tone, domain reasoning, structured outputs	Quick wins, single-task assistants, prototypes
Limit	Retrieval quality caps answer quality	Frozen snapshot, no live knowledge	Context window size, no memory between calls
Signal you need this	Users ask about company-specific facts	Output format or voice is repeatedly wrong	First prototype, before infrastructure

Changes

RAG: What the model knows
Fine-tuning: How the model writes and reasons
Prompt engineering: How the model is instructed for one task

Cost to update

RAG: Re-index changed documents, instant
Fine-tuning: Retrain the model, hours to days
Prompt engineering: Edit the prompt, instant

Best for

RAG: Private knowledge, frequent updates, citations
Fine-tuning: Style, tone, domain reasoning, structured outputs
Prompt engineering: Quick wins, single-task assistants, prototypes

Limit

RAG: Retrieval quality caps answer quality
Fine-tuning: Frozen snapshot, no live knowledge
Prompt engineering: Context window size, no memory between calls

Signal you need this

RAG: Users ask about company-specific facts
Fine-tuning: Output format or voice is repeatedly wrong
Prompt engineering: First prototype, before infrastructure

What drives the bill

What a real implementation actually costs

We do not quote ranges in articles because the variance is real. The drivers, in order of impact: how messy the source data is (this dominates), how many distinct sources need to be unified (each one is its own ingest pipeline), how strict the permissions model is (per-user retrieval filtering is engineering work), whether the assistant has to take actions or just answer questions, and how much production load it has to hold.

The pattern we see: clients underestimate the data work and overestimate the AI work. The model is the easy part. The hard part is making your scattered, inconsistent, half-indexed content into something a retrieval pipeline can actually use. A real number lives at the end of a 30-minute review, after we look at your actual sources, not before.

See our internal AI assistant service

Frequently asked questions

What businesses ask before they build their first RAG assistant.

How is RAG different from just using ChatGPT?

ChatGPT (and any general assistant) only knows what was in its training data, plus whatever you paste into a single conversation. RAG plugs an assistant into your private knowledge: it can answer about your contracts, your policies, your tickets, your products, with citations to the actual source paragraph. ChatGPT is a generic assistant. RAG turns a model into your assistant.

Is RAG better than fine-tuning?

They solve different problems. RAG changes what the model knows. Fine-tuning changes how the model writes and reasons. Most production assistants use both: RAG to ground the model in your live knowledge, fine-tuning to lock in tone and structured output formats. Fine-tuning alone for knowledge is a common mistake: it freezes a snapshot that goes stale within weeks.

How long does a RAG project take?

A working prototype that retrieves from one source can ship in a few weeks. A production system that handles multiple sources, permissions, monitoring, and quality evaluation takes longer, mostly because the data work is real. The model and infrastructure pieces are the fast part. Cleaning, structuring and chunking source content is what sets the pace.

What kind of sources can RAG read from?

Anything you can extract text from: PDFs, Word docs, wiki pages, knowledge base articles, support tickets, CRM notes, ERP records, internal SharePoint, Notion, Confluence, email archives. The harder question is permissions and freshness: who is allowed to see what, and how stale the answer can be. Both are solvable, but they shape the architecture.

Can RAG hallucinate?

Yes, less often than a raw model, but it still can. Two scenarios: the retrieval missed the right passage and the model filled the gap; or the retrieval found something that looks relevant but is not. Good RAG systems mitigate this with strict prompts ('answer only from the passages, refuse if not in there'), citations the user can click, and an evaluation loop that catches regressions. Hallucination does not disappear, but it becomes auditable.

Can RAG run on private infrastructure?

Yes. The model, the embedding service and the vector store can all run on private infrastructure if data residency or compliance requires it. The trade-off is engineering work: managed services are faster to ship, dedicated infrastructure gives you control. We help clients pick the right point on that spectrum based on actual data sensitivity, not theatre.

Thinking about building a RAG assistant?

Book a 30-minute review. We look at your actual sources, your real use case, and what a fit-for-purpose pipeline would look like. You leave with a 1-page recommendation tailored to your data and constraints, even if you don't engage us.

Email contact@morsof.com See our internal AI assistant service