Building a RAG Pipeline

March 28, 2026

#genai #ai #llm #rag #embeddings #javascript

In Introduction to RAG, we built a minimal RAG system with an in-memory store and simple documents. Now let’s build something closer to production: loading real documents, chunking them intelligently, using a vector database, and evaluating the results.

We’ll build a documentation Q&A system — the most common RAG use case — using JavaScript, OpenAI, and Chroma as our vector database.

Architecture

Markdown files → Chunker → Embeddings → ChromaDB
                                            ↓
User question → Embedding → Vector search → Top chunks → LLM → Answer

Setup

mkdir rag-pipeline && cd rag-pipeline
npm init -y
npm install openai chromadb

Add "type": "module" to package.json.

Step 1: Load Documents

For this example, we’ll load markdown files from a directory. In a real application, you might load from a CMS, database, or web scraper.

import { readdir, readFile } from "node:fs/promises";
import { join } from "node:path";

async function loadMarkdownFiles(dir) {
  const files = await readdir(dir);
  const docs = [];

  for (const file of files.filter((f) => f.endsWith(".md"))) {
    const content = await readFile(join(dir, file), "utf-8");
    docs.push({ id: file, content });
  }

  return docs;
}

Step 2: Chunk Documents

Chunking is where most RAG pipelines succeed or fail. The goal is to create chunks that are:

Small enough to be specific and relevant
Large enough to contain complete, useful information
Overlapping slightly so information at boundaries isn’t lost

Recursive Text Splitting

The most common strategy splits on natural boundaries (headings, paragraphs, sentences) and falls back to character-level splitting for very long sections:

function chunkDocument(doc, { maxChunkSize = 500, overlap = 50 } = {}) {
  const chunks = [];
  // Split on double newlines (paragraphs) first
  const paragraphs = doc.content.split(/\n\n+/);

  let current = "";
  for (const para of paragraphs) {
    if (current.length + para.length > maxChunkSize && current.length > 0) {
      chunks.push({ id: `${doc.id}#${chunks.length}`, text: current.trim() });
      // Keep overlap from the end of the current chunk
      const words = current.split(" ");
      current = words.slice(-overlap).join(" ") + "\n\n" + para;
    } else {
      current += (current ? "\n\n" : "") + para;
    }
  }

  if (current.trim()) {
    chunks.push({ id: `${doc.id}#${chunks.length}`, text: current.trim() });
  }

  return chunks;
}

Chunking Tips

Respect document structure. Split on headings and sections when possible, not in the middle of a paragraph.
Include metadata. Attach the document title, section heading, or URL to each chunk. This helps the LLM cite sources and helps you debug retrieval issues.
Tune chunk size for your content. Technical documentation often works well at 300–500 tokens. Conversational content might need larger chunks for context.
Test your chunks. Read through a sample of chunks manually. If a chunk doesn’t make sense on its own, your chunking strategy needs work.

Step 3: Embed and Store in ChromaDB

Chroma is an open-source vector database that’s easy to set up for development. It runs in-process (no separate server needed) and handles embedding for you if you configure it, but we’ll use OpenAI embeddings for consistency with the rest of this series.

import OpenAI from "openai";
import { ChromaClient } from "chromadb";

const openai = new OpenAI();
const chroma = new ChromaClient();

async function embedTexts(texts) {
  const res = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts,
  });
  return res.data.map((d) => d.embedding);
}

async function buildIndex(chunks) {
  const collection = await chroma.getOrCreateCollection({ name: "docs" });

  // Process in batches to stay within API limits
  const batchSize = 50;
  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const embeddings = await embedTexts(batch.map((c) => c.text));

    await collection.add({
      ids: batch.map((c) => c.id),
      documents: batch.map((c) => c.text),
      embeddings,
    });

    console.log(`Indexed ${Math.min(i + batchSize, chunks.length)}/${chunks.length} chunks`);
  }

  return collection;
}

Step 4: Retrieve Relevant Chunks

async function retrieve(collection, query, topK = 5) {
  const queryEmbedding = await embedTexts([query]);

  const results = await collection.query({
    queryEmbeddings: queryEmbedding,
    nResults: topK,
  });

  return results.documents[0].map((text, i) => ({
    text,
    id: results.ids[0][i],
    distance: results.distances[0][i],
  }));
}

Step 5: Generate an Answer

async function answer(collection, question) {
  const chunks = await retrieve(collection, question);

  const context = chunks
    .map((c, i) => `[Source ${i + 1}: ${c.id}]\n${c.text}`)
    .join("\n\n---\n\n");

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    temperature: 0,
    messages: [
      {
        role: "system",
        content: `You are a documentation assistant. Answer questions using ONLY the provided context.

Rules:
- If the context doesn't contain the answer, say "I couldn't find that in the documentation."
- Cite your sources using [Source N] references.
- Be concise and direct.`,
      },
      {
        role: "user",
        content: `Context:\n${context}\n\n---\n\nQuestion: ${question}`,
      },
    ],
  });

  return {
    answer: response.choices[0].message.content,
    sources: chunks.map((c) => c.id),
  };
}

Putting It All Together

// Index
const docs = await loadMarkdownFiles("./docs");
const chunks = docs.flatMap((doc) => chunkDocument(doc));
console.log(`Created ${chunks.length} chunks from ${docs.length} documents`);

const collection = await buildIndex(chunks);

// Query
const result = await answer(collection, "How do I configure authentication?");
console.log(result.answer);
console.log("Sources:", result.sources);

Improving Retrieval Quality

The basic pipeline above works, but there are several techniques to improve results.

Hybrid Search

Combine vector search (semantic) with keyword search (BM25) for better coverage. Vector search finds semantically similar content, while keyword search catches exact term matches that embeddings might miss.

Many vector databases support hybrid search natively. With Chroma, you can filter results using metadata:

const results = await collection.query({
  queryEmbeddings: queryEmbedding,
  nResults: 10,
  where: { category: "authentication" }, // metadata filter
});

Query Transformation

Sometimes the user’s question isn’t a good search query. You can use the LLM to rewrite it:

async function rewriteQuery(question) {
  const res = await openai.chat.completions.create({
    model: "gpt-4o",
    temperature: 0,
    messages: [
      {
        role: "system",
        content:
          "Rewrite this question as a search query optimized for finding relevant documentation. " +
          "Return only the search query, nothing else.",
      },
      { role: "user", content: question },
    ],
  });
  return res.choices[0].message.content;
}

“Why isn’t my login working?” becomes something like “authentication login troubleshooting error” — a much better search query.

Re-ranking

Retrieve more chunks than you need, then use a second pass to re-rank them by relevance:

async function rerankChunks(question, chunks) {
  const res = await openai.chat.completions.create({
    model: "gpt-4o",
    temperature: 0,
    response_format: { type: "json_object" },
    messages: [
      {
        role: "system",
        content:
          'Given a question and document chunks, return a JSON object with a "rankings" array ' +
          "of chunk indices sorted by relevance (most relevant first). Only include relevant chunks.",
      },
      {
        role: "user",
        content: `Question: ${question}\n\nChunks:\n${chunks.map((c, i) => `[${i}] ${c.text}`).join("\n\n")}`,
      },
    ],
  });

  const { rankings } = JSON.parse(res.choices[0].message.content);
  return rankings.map((i) => chunks[i]);
}

Evaluating Your RAG Pipeline

You can’t improve what you don’t measure. Here’s a simple evaluation framework:

Create a Test Set

Write 10–20 question-answer pairs based on your documents:

const testCases = [
  {
    question: "How do I reset my password?",
    expectedAnswer: "Go to login page, click Forgot Password, enter email",
    relevantDoc: "auth.md",
  },
  // ... more test cases
];

Measure Retrieval Quality

For each test question, check if the relevant document appears in the retrieved chunks:

async function evaluateRetrieval(collection, testCases) {
  let hits = 0;
  for (const tc of testCases) {
    const chunks = await retrieve(collection, tc.question);
    if (chunks.some((c) => c.id.startsWith(tc.relevantDoc))) hits++;
  }
  console.log(`Retrieval accuracy: ${hits}/${testCases.length}`);
}

If retrieval accuracy is low, focus on improving chunking, embeddings, or adding hybrid search before tuning the generation step.

Measure Answer Quality

For answer quality, you can use an LLM as a judge:

async function gradeAnswer(question, expected, actual) {
  const res = await openai.chat.completions.create({
    model: "gpt-4o",
    temperature: 0,
    response_format: { type: "json_object" },
    messages: [
      {
        role: "system",
        content:
          "Grade whether the actual answer correctly addresses the question based on the expected answer. " +
          'Return JSON: {"correct": true/false, "reason": "brief explanation"}',
      },
      {
        role: "user",
        content: `Question: ${question}\nExpected: ${expected}\nActual: ${actual}`,
      },
    ],
  });
  return JSON.parse(res.choices[0].message.content);
}

Evaluation is iterative. Start with a small test set, identify failure patterns, improve your pipeline, and expand the test set over time. Even 10 well-chosen test cases will reveal most issues.

What’s Next?

You now have a working RAG pipeline with document loading, chunking, vector storage, retrieval, and generation. For production deployments, you’ll want a dedicated vector database — we’ll cover Pinecone, pgvector, and other options in an upcoming tutorial on production-ready vector storage.

Table of Contents