Introduction to RAG

March 28, 2026

#genai #ai #llm #rag #embeddings

LLMs are trained on public data up to a cutoff date. They don’t know about your company’s documentation, your product’s API, or the email you received this morning. RAG (Retrieval-Augmented Generation) solves this by fetching relevant information at query time and feeding it to the model as context.

RAG is the most important pattern in applied GenAI. It’s how you build chatbots that answer questions about your docs, search engines that understand intent, and assistants that stay grounded in facts instead of hallucinating.

This tutorial covers the concepts and architecture. You should be familiar with Embeddings & Vector Search and Calling LLM APIs with JavaScript (or the Python equivalent) before reading this.

The Problem RAG Solves

Imagine you’re building a support bot for your product. You try this:

User: How do I reset my password?

The LLM gives a generic answer about password resets that doesn’t match your product’s actual flow. It might even invent a “Settings > Security > Reset Password” path that doesn’t exist in your app.

You could put your entire documentation in the system prompt, but:

Your docs are 500 pages — they won’t fit in the context window
You’d pay for all those tokens on every single request
Most of the docs are irrelevant to any given question

RAG solves this by retrieving only the relevant documents for each query.

How RAG Works

The RAG pipeline has two phases: indexing (done once, ahead of time) and querying (done for each user request).

Phase 1: Indexing

Load your documents (PDFs, markdown files, database records, web pages)
Chunk them into smaller pieces (paragraphs, sections, or fixed-size segments)
Embed each chunk using an embedding model to get a vector
Store the vectors and their source text in a vector database

Documents → Chunks → Embeddings → Vector Database

Phase 2: Querying

Embed the user’s question using the same embedding model
Search the vector database for the most similar chunks
Augment the prompt by injecting the retrieved chunks as context
Generate a response using the LLM, grounded in the retrieved context

Question → Embedding → Vector Search → Top Chunks → LLM → Answer

A Minimal RAG Implementation

Let’s build a simple RAG system from scratch. We’ll use OpenAI for embeddings and completions, and an in-memory store for simplicity.

Step 1: Prepare and Chunk Documents

// Simulate loading documents — in practice, read from files or a database
const documents = [
  {
    title: "Password Reset",
    content:
      "To reset your password, go to the login page and click 'Forgot Password'. " +
      "Enter your email address and we'll send a reset link. The link expires after 24 hours. " +
      "If you don't receive the email, check your spam folder.",
  },
  {
    title: "Billing",
    content:
      "We bill monthly on the anniversary of your signup date. " +
      "You can view invoices in Settings > Billing. " +
      "To cancel, go to Settings > Billing > Cancel Subscription. " +
      "Refunds are available within 14 days of charge.",
  },
  {
    title: "API Rate Limits",
    content:
      "Free tier: 100 requests/minute. Pro tier: 1000 requests/minute. " +
      "Enterprise: custom limits. Rate limit headers are included in every response. " +
      "If you exceed your limit, you'll receive a 429 status code.",
  },
];

For this example, each document is small enough to be a single chunk. In practice, you’d split longer documents — we’ll cover chunking strategies in Building a RAG Pipeline.

Step 2: Build the Index

import OpenAI from "openai";

const client = new OpenAI();

async function embed(text) {
  const res = await client.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return res.data[0].embedding;
}

// Build the index
const index = await Promise.all(
  documents.map(async (doc) => ({
    text: doc.content,
    title: doc.title,
    vector: await embed(doc.content),
  }))
);

Step 3: Search for Relevant Chunks

function cosineSimilarity(a, b) {
  let dot = 0, magA = 0, magB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }
  return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}

async function retrieve(query, topK = 2) {
  const queryVec = await embed(query);
  return index
    .map((doc) => ({ ...doc, score: cosineSimilarity(queryVec, doc.vector) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

Step 4: Generate an Answer

async function rag(question) {
  const chunks = await retrieve(question);

  const context = chunks.map((c) => `[${c.title}]\n${c.text}`).join("\n\n");

  const response = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content:
          "Answer the user's question using ONLY the provided context. " +
          "If the context doesn't contain the answer, say so. " +
          "Cite which document you're referencing.",
      },
      {
        role: "user",
        content: `Context:\n${context}\n\nQuestion: ${question}`,
      },
    ],
  });

  return response.choices[0].message.content;
}

Try It Out

console.log(await rag("How do I reset my password?"));
// → "To reset your password, go to the login page and click 'Forgot Password'..."

console.log(await rag("What happens if I exceed the rate limit?"));
// → "You'll receive a 429 status code. Free tier allows 100 requests/minute..."

console.log(await rag("What's the weather like today?"));
// → "The provided context doesn't contain information about the weather."

The model answers from the retrieved documents, not from its general knowledge. And when the answer isn’t in the context, it says so instead of hallucinating.

Why RAG Works

RAG is effective because it combines the strengths of two systems:

Retrieval (vector search) is good at finding relevant information from large datasets quickly and cheaply
Generation (LLM) is good at synthesizing information, understanding nuance, and producing natural language answers

Neither system alone is sufficient. Search without generation gives you a list of documents — the user has to read and synthesize them. Generation without retrieval gives you fluent but potentially hallucinated answers.

Key Design Decisions

Chunking Strategy

How you split documents into chunks significantly affects retrieval quality:

Too small (individual sentences) — Loses context. The chunk might not contain enough information to be useful.
Too large (entire documents) — Dilutes relevance. The embedding represents the average meaning of the whole document, not the specific relevant section.
Sweet spot — Paragraphs or sections of 200–500 tokens tend to work well for most use cases.

Number of Retrieved Chunks

More chunks means more context for the LLM, but also:

More tokens = higher cost and latency
Risk of including irrelevant chunks that confuse the model
Less room for the model’s response in the context window

Start with 3–5 chunks and adjust based on your results.

The System Prompt

The system prompt is critical for RAG quality. Key instructions:

Only use the provided context — Prevents the model from falling back to general knowledge
Say when you don’t know — Prevents hallucination when the context doesn’t contain the answer
Cite sources — Helps users verify the answer and builds trust

RAG vs. Stuffing Everything in the Prompt

Why not just put all your documents in the system prompt?

	RAG	Full Context
Token cost per query	Low (only relevant chunks)	High (all documents every time)
Scales to large datasets	Yes (millions of documents)	No (limited by context window)
Latency	Moderate (retrieval + generation)	High (processing huge prompts)
Accuracy	High (focused context)	Can degrade (model loses focus in long contexts)

For a handful of short documents, stuffing them in the prompt is simpler and works fine. For anything larger, RAG is the way to go.

Common Pitfalls

Poor chunking — If your chunks split important information across boundaries, the retriever might find a chunk that’s only half-useful. Overlap your chunks slightly (e.g., include the last sentence of the previous chunk) to mitigate this.

Wrong embedding model — Use an embedding model suited to your content. Code-heavy content benefits from code-optimized embeddings. Multilingual content needs a multilingual model.

Ignoring retrieval quality — If the retriever doesn’t find the right chunks, the LLM can’t give a good answer. Test your retrieval separately — check that relevant chunks actually rank highest for your test queries.

Not handling “no answer” cases — Without explicit instructions, the model will try to answer every question, even when the context doesn’t contain the answer. Always instruct the model to say when it doesn’t know.

RAG quality is usually limited by retrieval quality, not generation quality. If your RAG system gives bad answers, check your retrieval first — are the right chunks being found?

What’s Next?

This tutorial covered the concepts and a minimal implementation. In Building a RAG Pipeline, we’ll build a production-quality pipeline with proper document loading, chunking strategies, a vector database, and evaluation techniques.

Table of Contents