Building a RAG Pipeline
Table of Contents
In Introduction to RAG, we built a minimal RAG system with an in-memory store and simple documents. Now let’s build something closer to production: loading real documents, chunking them intelligently, using a vector database, and evaluating the results.
We’ll build a documentation Q&A system — the most common RAG use case — using JavaScript, OpenAI, and Chroma as our vector database.
Architecture
Markdown files → Chunker → Embeddings → ChromaDB
↓
User question → Embedding → Vector search → Top chunks → LLM → Answer
Setup
mkdir rag-pipeline && cd rag-pipeline
npm init -y
npm install openai chromadb
Add "type": "module" to package.json.
Step 1: Load Documents
For this example, we’ll load markdown files from a directory. In a real application, you might load from a CMS, database, or web scraper.
import { readdir, readFile } from "node:fs/promises";
import { join } from "node:path";
async function loadMarkdownFiles(dir) {
const files = await readdir(dir);
const docs = [];
for (const file of files.filter((f) => f.endsWith(".md"))) {
const content = await readFile(join(dir, file), "utf-8");
docs.push({ id: file, content });
}
return docs;
}
Step 2: Chunk Documents
Chunking is where most RAG pipelines succeed or fail. The goal is to create chunks that are:
- Small enough to be specific and relevant
- Large enough to contain complete, useful information
- Overlapping slightly so information at boundaries isn’t lost
Recursive Text Splitting
The most common strategy splits on natural boundaries (headings, paragraphs, sentences) and falls back to character-level splitting for very long sections:
function chunkDocument(doc, { maxChunkSize = 500, overlap = 50 } = {}) {
const chunks = [];
// Split on double newlines (paragraphs) first
const paragraphs = doc.content.split(/\n\n+/);
let current = "";
for (const para of paragraphs) {
if (current.length + para.length > maxChunkSize && current.length > 0) {
chunks.push({ id: `${doc.id}#${chunks.length}`, text: current.trim() });
// Keep overlap from the end of the current chunk
const words = current.split(" ");
current = words.slice(-overlap).join(" ") + "\n\n" + para;
} else {
current += (current ? "\n\n" : "") + para;
}
}
if (current.trim()) {
chunks.push({ id: `${doc.id}#${chunks.length}`, text: current.trim() });
}
return chunks;
}
Chunking Tips
- Respect document structure. Split on headings and sections when possible, not in the middle of a paragraph.
- Include metadata. Attach the document title, section heading, or URL to each chunk. This helps the LLM cite sources and helps you debug retrieval issues.
- Tune chunk size for your content. Technical documentation often works well at 300–500 tokens. Conversational content might need larger chunks for context.
- Test your chunks. Read through a sample of chunks manually. If a chunk doesn’t make sense on its own, your chunking strategy needs work.
Step 3: Embed and Store in ChromaDB
Chroma is an open-source vector database that’s easy to set up for development. It runs in-process (no separate server needed) and handles embedding for you if you configure it, but we’ll use OpenAI embeddings for consistency with the rest of this series.
import OpenAI from "openai";
import { ChromaClient } from "chromadb";
const openai = new OpenAI();
const chroma = new ChromaClient();
async function embedTexts(texts) {
const res = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
return res.data.map((d) => d.embedding);
}
async function buildIndex(chunks) {
const collection = await chroma.getOrCreateCollection({ name: "docs" });
// Process in batches to stay within API limits
const batchSize = 50;
for (let i = 0; i < chunks.length; i += batchSize) {
const batch = chunks.slice(i, i + batchSize);
const embeddings = await embedTexts(batch.map((c) => c.text));
await collection.add({
ids: batch.map((c) => c.id),
documents: batch.map((c) => c.text),
embeddings,
});
console.log(`Indexed ${Math.min(i + batchSize, chunks.length)}/${chunks.length} chunks`);
}
return collection;
}
Step 4: Retrieve Relevant Chunks
async function retrieve(collection, query, topK = 5) {
const queryEmbedding = await embedTexts([query]);
const results = await collection.query({
queryEmbeddings: queryEmbedding,
nResults: topK,
});
return results.documents[0].map((text, i) => ({
text,
id: results.ids[0][i],
distance: results.distances[0][i],
}));
}
Step 5: Generate an Answer
async function answer(collection, question) {
const chunks = await retrieve(collection, question);
const context = chunks
.map((c, i) => `[Source ${i + 1}: ${c.id}]\n${c.text}`)
.join("\n\n---\n\n");
const response = await openai.chat.completions.create({
model: "gpt-4o",
temperature: 0,
messages: [
{
role: "system",
content: `You are a documentation assistant. Answer questions using ONLY the provided context.
Rules:
- If the context doesn't contain the answer, say "I couldn't find that in the documentation."
- Cite your sources using [Source N] references.
- Be concise and direct.`,
},
{
role: "user",
content: `Context:\n${context}\n\n---\n\nQuestion: ${question}`,
},
],
});
return {
answer: response.choices[0].message.content,
sources: chunks.map((c) => c.id),
};
}
Putting It All Together
// Index
const docs = await loadMarkdownFiles("./docs");
const chunks = docs.flatMap((doc) => chunkDocument(doc));
console.log(`Created ${chunks.length} chunks from ${docs.length} documents`);
const collection = await buildIndex(chunks);
// Query
const result = await answer(collection, "How do I configure authentication?");
console.log(result.answer);
console.log("Sources:", result.sources);
Improving Retrieval Quality
The basic pipeline above works, but there are several techniques to improve results.
Hybrid Search
Combine vector search (semantic) with keyword search (BM25) for better coverage. Vector search finds semantically similar content, while keyword search catches exact term matches that embeddings might miss.
Many vector databases support hybrid search natively. With Chroma, you can filter results using metadata:
const results = await collection.query({
queryEmbeddings: queryEmbedding,
nResults: 10,
where: { category: "authentication" }, // metadata filter
});
Query Transformation
Sometimes the user’s question isn’t a good search query. You can use the LLM to rewrite it:
async function rewriteQuery(question) {
const res = await openai.chat.completions.create({
model: "gpt-4o",
temperature: 0,
messages: [
{
role: "system",
content:
"Rewrite this question as a search query optimized for finding relevant documentation. " +
"Return only the search query, nothing else.",
},
{ role: "user", content: question },
],
});
return res.choices[0].message.content;
}
“Why isn’t my login working?” becomes something like “authentication login troubleshooting error” — a much better search query.
Re-ranking
Retrieve more chunks than you need, then use a second pass to re-rank them by relevance:
async function rerankChunks(question, chunks) {
const res = await openai.chat.completions.create({
model: "gpt-4o",
temperature: 0,
response_format: { type: "json_object" },
messages: [
{
role: "system",
content:
'Given a question and document chunks, return a JSON object with a "rankings" array ' +
"of chunk indices sorted by relevance (most relevant first). Only include relevant chunks.",
},
{
role: "user",
content: `Question: ${question}\n\nChunks:\n${chunks.map((c, i) => `[${i}] ${c.text}`).join("\n\n")}`,
},
],
});
const { rankings } = JSON.parse(res.choices[0].message.content);
return rankings.map((i) => chunks[i]);
}
Evaluating Your RAG Pipeline
You can’t improve what you don’t measure. Here’s a simple evaluation framework:
Create a Test Set
Write 10–20 question-answer pairs based on your documents:
const testCases = [
{
question: "How do I reset my password?",
expectedAnswer: "Go to login page, click Forgot Password, enter email",
relevantDoc: "auth.md",
},
// ... more test cases
];
Measure Retrieval Quality
For each test question, check if the relevant document appears in the retrieved chunks:
async function evaluateRetrieval(collection, testCases) {
let hits = 0;
for (const tc of testCases) {
const chunks = await retrieve(collection, tc.question);
if (chunks.some((c) => c.id.startsWith(tc.relevantDoc))) hits++;
}
console.log(`Retrieval accuracy: ${hits}/${testCases.length}`);
}
If retrieval accuracy is low, focus on improving chunking, embeddings, or adding hybrid search before tuning the generation step.
Measure Answer Quality
For answer quality, you can use an LLM as a judge:
async function gradeAnswer(question, expected, actual) {
const res = await openai.chat.completions.create({
model: "gpt-4o",
temperature: 0,
response_format: { type: "json_object" },
messages: [
{
role: "system",
content:
"Grade whether the actual answer correctly addresses the question based on the expected answer. " +
'Return JSON: {"correct": true/false, "reason": "brief explanation"}',
},
{
role: "user",
content: `Question: ${question}\nExpected: ${expected}\nActual: ${actual}`,
},
],
});
return JSON.parse(res.choices[0].message.content);
}
What’s Next?
You now have a working RAG pipeline with document loading, chunking, vector storage, retrieval, and generation. For production deployments, you’ll want a dedicated vector database — we’ll cover Pinecone, pgvector, and other options in an upcoming tutorial on production-ready vector storage.