Loading...
Loading...
Index the knowledge, retrieve at query time, splice into the prompt. The library-card approach.
Bake the knowledge into the weights. Pay once at train time, ship a smaller prompt.
Paste everything relevant into the prompt. Let attention sort it out, pay per token every call.
Pick RAG when your knowledge base is large, changes often, or needs citations. The default for customer support, internal docs, and legal research where "what is your source?" matters.
Pick Fine-tuning when you need a fixed style, format, or high-QPS inference on a smaller base. Style transfer, domain tone, and compliance-sensitive outputs benefit from learned weights over prompts.
Pick Long Context when your corpus fits the window and iteration speed beats cost. A 500-page PDF plus Claude 4.x in one API call beats a 3-week RAG buildout for most internal tools.
Combine in production - the default stack is: fine-tune the base model for tone and format, use RAG to inject fresh knowledge with citations, and let long context hold the conversation history plus the retrieved chunks. Every major product you use already does this.
Each vertex is one approach. A point inside the triangle is a workload whose ideal mix leans toward the nearest vertex. Pure vertices are rare in production - most real systems sit inside the triangle, with some weight on all three.
Illustrative positioning drawn from published reference architectures (Perplexity, Claude Projects, Notion AI, GitHub Copilot Workspace, enterprise RAG vendors). Real weights at any company shift quarterly as model context windows grow and fine-tuning costs drop. Use this as a starting map, not a measurement.
A 6-step mental model that walks from "where does the knowledge live" to "how do I stack all three in production."
Three storage locations: external index (RAG), model weights (fine-tuning), prompt buffer (long context). Every decision follows from where you put the data - not from which technique is fashionable this month.
Why it matters: RAG reads the latest vector DB entry; fine-tuning only knows what was in the training set; long context is fresh but only for what you paste. RAG wins decisively on data that changes often.
Why it matters: RAG needs a full pipeline: embedding model, vector DB, chunking strategy, retriever, reranker. Fine-tuning needs a clean dataset and a GPU. Long context needs nothing - paste the docs and ship.
Why it matters: Fine-tuning ships the smallest prompt because knowledge lives in weights. Long context pays for every token every call. RAG sits in the middle - only retrieved chunks hit the model.
Why it matters: Fine-tuning wins latency - small prompt, no retrieval hop. Long context is slowest because attention over 200k+ tokens is expensive even with flash attention.
Why it matters: RAG scales to terabytes of knowledge because only the top-k retrieved chunks hit the model. Long context is hard-capped by the window. Fine-tuning compresses everything into weights with inevitable loss.
Why it matters: If you need the model to write like your brand voice, always emit a specific JSON shape, or never apologize - fine-tuning is the right tool. Prompts can do a lot, but learned examples do more.
Why it matters: RAG naturally produces "here is the source chunk" citations users can click. Fine-tuning produces confidently-wrong answers with no provenance. Long context can cite the prompt but not itself.
Why this is not a win: RAG and long context are cheap to update; fine-tuning is not. But long context can only update what you paste every call - you still pay tokens.
Why this is not a win: All three can reach high quality on domain knowledge when used correctly. RAG fails on bad retrieval; fine-tuning fails on stale data; long context fails on the "needle in a haystack" problem.
Illustrative shapes calibrated against published numbers from the Meta RAG paper (Lewis et al. 2020), Anthropic's long-context evaluations, and community benchmarks on LoRA / DoRA / QDoRA fine-tuning runs. Exact numbers depend on model size, corpus, retrieval quality, and hardware.
| Operation | Dataset | RAG | Fine-tuning | Long context | Delta |
|---|---|---|---|---|---|
| Time-to-first-answer (prototype) | From zero to working demo | ~2-5 days (pipeline build) | ~1-3 days (dataset + training) | ~2 hours (paste corpus) | ~10-50x |
| Per-query inference cost | 50k-token knowledge, 2026 frontier pricing | ~$0.003 (retrieved top-k + answer) | ~$0.0005 (small prompt, fine-tuned model) | ~$0.15 (full 50k prompt) | ~6-300x |
| Answer quality on static FAQ | Internal docs, 500 questions | ~89% accuracy (good retrieval) | ~92% accuracy (fine-tuned on FAQ) | ~91% accuracy (full docs in context) | Tradeoff |
| Answer quality on last-week's data | Fresh news, 100 questions | ~85% (re-indexed daily) | ~12% (pre-training cutoff) | ~88% (paste the news in) | Tradeoff |
| Needle-in-haystack recall at 200k tokens | Claude 4.x, single-fact lookup | ~100% (retrieval is exact) | N/A (facts not in weights) | ~75-90% (attention drift) | - |
Each approach gets the same question: "what is our PTO policy?" against the same 200-page employee handbook. The code below shows the minimum viable version of each. Production implementations of RAG add a reranker, hybrid search, and query rewriting; production fine-tuning uses evaluation harnesses and multiple epochs; production long context adds cache headers and prompt caching.
# RAG: embed once, retrieve per query, generate
# OpenAI for embeddings; Anthropic for generation. qdrant-client 1.17+ uses query_points.
from openai import OpenAI
from anthropic import Anthropic
from qdrant_client import QdrantClient
embed_client = OpenAI()
gen_client = Anthropic()
qdrant = QdrantClient("localhost:6333")
def answer(question: str) -> str:
# Retrieve top 5 relevant chunks
q_embed = embed_client.embeddings.create(
model="text-embedding-3-large",
input=question,
).data[0].embedding
hits = qdrant.query_points(
collection_name="handbook",
query=q_embed,
limit=5,
).points
context = "\n\n".join(h.payload["text"] for h in hits)
msg = gen_client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system="Answer using ONLY this context.",
messages=[
{"role": "user", "content": f"Context:\n{context}\n\nQ: {question}"},
],
)
return msg.content[0].text# Fine-tuning: train once on handbook Q&A pairs, ship small prompts
# ---- One-time training step ----
# peft + trl, 2026 style:
# dataset = load_dataset("jsonl", "handbook_qa.jsonl")
# model = AutoModelForCausalLM.from_pretrained(
# "meta-llama/Llama-3.1-8B-Instruct",
# load_in_4bit=True, quantization_config=bnb_config)
# lora = LoraConfig(r=32, lora_alpha=64, target_modules="all-linear")
# model = get_peft_model(model, lora)
# trainer = SFTTrainer(model=model, train_dataset=dataset, max_seq_length=2048)
# trainer.train(); model.push_to_hub("acme/handbook-llama-lora")
# ---- Runtime ----
from transformers import pipeline
qa = pipeline("text-generation", model="acme/handbook-llama-lora")
def answer(question: str) -> str:
return qa(f"Q: {question}\nA:", max_new_tokens=256)[0]["generated_text"]# Long context: paste entire handbook every call
from anthropic import Anthropic
client = Anthropic()
with open("handbook.txt") as f:
HANDBOOK = f.read() # ~180k tokens, fits in Claude 4.x
def answer(question: str) -> str:
msg = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{
"type": "text",
"text": f"You are a policy assistant. Knowledge base:\n{HANDBOOK}",
"cache_control": {"type": "ephemeral"}, # prompt caching
},
],
messages=[{"role": "user", "content": question}],
)
return msg.content[0].textNote: RAG is the most code but scales to millions of documents. Fine-tuning has a one-time training cost and then ships the tiniest inference call. Long context is the shortest but pays the full prompt price on every call (prompt caching reduces this, not eliminates it). Production systems typically run all three.
# RAG: one write to the vector DB, every future query sees it
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
from openai import OpenAI
qdrant = QdrantClient("localhost:6333")
client = OpenAI()
def update_policy(chunk_id: str, new_text: str):
embed = client.embeddings.create(
model="text-embedding-3-large",
input=new_text,
).data[0].embedding
qdrant.upsert(
collection_name="handbook",
points=[PointStruct(id=chunk_id, vector=embed,
payload={"text": new_text})],
)
# Done. Next query retrieves the updated chunk.# Fine-tuning: append to dataset, re-run training, redeploy
# 1. Update training data
with open("handbook_qa.jsonl", "a") as f:
f.write('{"q": "new policy?", "a": "new answer"}\n')
# 2. Retrain the LoRA adapter (hours on a GPU)
# trainer = SFTTrainer(...); trainer.train()
# Costs ~$50-500 depending on dataset and model size.
# 3. Push new adapter, roll out in production
# model.push_to_hub("acme/handbook-llama-lora-v2")
# Update model serving config to point at v2, then A/B test.# Long context: edit the file, next call sees the new content
with open("handbook.txt", "w") as f:
f.write(updated_handbook_text)
# Every subsequent call to answer() reads the new handbook.
# Prompt caching will be invalidated - first call after update
# pays full prompt cost again, subsequent calls are cached.Note: The update story is where RAG and long context pull far ahead. RAG is a single DB write. Long context is a file edit. Fine-tuning is a full retraining cycle and a redeploy. For anything that changes weekly or faster, learned weights are the wrong storage layer.
They are three ways to give an LLM knowledge. RAG stores knowledge in an external vector database and retrieves relevant chunks at query time. Fine-tuning updates the model weights so knowledge is baked in. Long context pastes everything into the prompt every call. RAG optimizes for large and fresh corpora; fine-tuning for style, format, and low per-call cost; long context for simplicity.
In 2026, start with RAG. It is the default for production AI systems because your knowledge changes, you need citations, and the infrastructure is mature (Pinecone, Qdrant, Weaviate, pgvector). Add fine-tuning when prompt engineering cannot fix style, format, or latency problems - not as a substitute for retrieval.
For small, static knowledge bases that fit in 200k tokens: yes, long context eats RAG's lunch. For large, frequently-updated corpora: no. RAG still wins decisively at terabyte-scale knowledge, citation-heavy workloads, and anywhere retrieval recall matters more than attention coverage. The two are complementary, not competitive.
Yes, and it is the default production stack. Fine-tune the base model for domain tone and format (LoRA or DoRA is cheap), then use RAG at inference time to inject fresh citable knowledge. This is what Perplexity, Notion AI, Claude Projects, and most enterprise AI assistants do in 2026.
Transformer attention does not weight all tokens equally. Models attend most strongly to the beginning and end of the prompt, and underweight the middle. On 200k+ token contexts, needle-in-haystack recall drops from near-100% at the edges to 70-80% in the middle. This is why RAG still wins on precise fact retrieval from large corpora.
Absolutely not. LoRA, QLoRA, and DoRA made fine-tuning cheaper than ever, and every production stack that needs a specific brand voice, format, or low inference cost uses it. What changed is that fine-tuning is no longer the default answer to "how do I inject knowledge" - RAG took that job. Fine-tuning is now the right tool for style, format, and high-QPS inference.
RAG: embedding costs plus per-query retrieval plus generation - typically $0.001-$0.01 per query. Fine-tuning: $50-$5,000 one-time for a LoRA / QLoRA run, then near-zero marginal inference cost. Long context: full per-token input pricing every call - $0.05-$0.50 per query depending on context size and model tier. At scale, fine-tuning is cheapest per call; at low volume, long context is cheapest total.
RAG wins on citations by a wide margin. Every answer can point at the exact retrieved source chunks, with document IDs, line numbers, and relevance scores. Fine-tuning has no citation mechanism at all - the model says things confidently with no provenance. Long context can cite back to the pasted prompt but cannot differentiate between "I learned this in pre-training" and "you just told me this."