Embeddings & Vector Search

March 28, 2026

#genai #ai #llm #embeddings #machine-learning

So far in this series, we’ve focused on LLMs that generate text. But there’s another fundamental capability that powers many AI applications: embeddings. Embeddings let you convert text into numbers that capture meaning, making it possible to search, compare, and cluster content based on what it means rather than what words it contains.

This is the technology behind semantic search, recommendation systems, and RAG (Retrieval-Augmented Generation) — which we’ll cover later in this series. You should be familiar with How LLMs Work before reading this.

What Are Embeddings?

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. The key insight is that texts with similar meanings end up with similar vectors, even if they use completely different words.

For example, these two sentences mean roughly the same thing:

“The cat sat on the mat”
“A feline rested on the rug”

A traditional keyword search would see no overlap — they share zero words. But their embedding vectors would be very close together because they describe the same scene.

How Embeddings Work

Embedding models are trained to map text into a high-dimensional space (typically 256–3072 dimensions) where distance corresponds to semantic similarity. During training, the model learns that:

“happy” and “joyful” should be close together
“happy” and “sad” should be far apart
“Python programming” and “coding in Python” should be close
“Python programming” and “python snake” should be farther apart

You don’t need to understand the math behind the training process to use embeddings effectively. What matters is the result: a function that takes text in and produces a vector out.

Generating Embeddings

Here’s how to generate embeddings using the OpenAI API:

import OpenAI from "openai";

const client = new OpenAI();

const response = await client.embeddings.create({
  model: "text-embedding-3-small",
  input: "How do I deploy a Node.js app to AWS?",
});

const vector = response.data[0].embedding;
console.log(vector.length); // 1536
console.log(vector.slice(0, 5)); // [0.0023, -0.0091, 0.0152, ...]

The result is an array of 1,536 floating-point numbers. By itself, a single vector isn’t useful — the power comes from comparing vectors.

Measuring Similarity

To determine how similar two pieces of text are, you compare their embedding vectors using a distance metric. The most common one is cosine similarity, which measures the angle between two vectors:

1.0 = identical meaning
0.0 = completely unrelated
-1.0 = opposite meaning (rare in practice)

function cosineSimilarity(a, b) {
  let dot = 0, magA = 0, magB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }
  return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}

Let’s see it in action:

async function embed(text) {
  const res = await client.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return res.data[0].embedding;
}

const a = await embed("How do I deploy to AWS?");
const b = await embed("What's the process for deploying on Amazon Web Services?");
const c = await embed("What's a good recipe for chocolate cake?");

console.log(cosineSimilarity(a, b)); // ~0.92 (very similar)
console.log(cosineSimilarity(a, c)); // ~0.15 (unrelated)

Even though sentences A and B use different words, their embeddings are nearly identical because they mean the same thing.

Vector Search

Vector search (also called semantic search) uses embeddings to find content that’s semantically similar to a query, rather than matching keywords.

The basic workflow is:

Index — Generate embeddings for all your documents and store them
Query — Generate an embedding for the user’s search query
Search — Find the stored embeddings closest to the query embedding
Return — Return the corresponding documents

Here’s a simple in-memory implementation:

// 1. Index: embed your documents
const docs = [
  "JavaScript arrays have methods like map, filter, and reduce",
  "AWS Lambda lets you run code without managing servers",
  "Git branches allow parallel development workflows",
  "CSS Grid is a two-dimensional layout system",
];

const index = await Promise.all(
  docs.map(async (text) => ({ text, vector: await embed(text) }))
);

// 2-3. Query and search
async function search(query, topK = 2) {
  const queryVec = await embed(query);
  return index
    .map((doc) => ({ text: doc.text, score: cosineSimilarity(queryVec, doc.vector) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

// 4. Return results
const results = await search("serverless functions");
console.log(results);
// [
//   { text: "AWS Lambda lets you run code without managing servers", score: 0.82 },
//   { text: "JavaScript arrays have methods like map, filter...", score: 0.31 }
// ]

The query “serverless functions” matched the Lambda document even though neither word appears in it. That’s the power of semantic search.

Vector Databases

The in-memory approach above works for small datasets, but real applications need a vector database — a database optimized for storing and searching high-dimensional vectors efficiently.

Popular options include:

Database	Type	Best For
Pinecone	Managed cloud service	Production apps, zero ops
pgvector	PostgreSQL extension	Teams already using Postgres
Chroma	Open source, embeddable	Prototyping, local development
Weaviate	Open source / cloud	Complex filtering + search
Qdrant	Open source / cloud	High performance, Rust-based

Vector databases use specialized indexing algorithms (like HNSW — Hierarchical Navigable Small World graphs) to search millions of vectors in milliseconds, rather than comparing against every single vector.

We’ll cover vector databases in more depth in a dedicated tutorial later in this series.

Where Embeddings Are Used

Semantic Search

Replace keyword search with meaning-based search. Users find what they’re looking for even when they don’t use the exact right terms.

RAG (Retrieval-Augmented Generation)

This is the most important application for developers building with LLMs. RAG uses embeddings to find relevant documents, then feeds those documents to an LLM as context so it can answer questions about your specific data. We cover this in Introduction to RAG.

Recommendations

“Users who read this article also liked…” — embed all your content, then find items with similar vectors to what the user has already engaged with.

Clustering and Classification

Group similar documents together automatically, or classify new documents by finding which cluster they’re closest to.

Anomaly Detection

If a new data point’s embedding is far from all existing embeddings, it might be an outlier worth investigating.

Choosing an Embedding Model

Model	Dimensions	Provider	Notes
text-embedding-3-small	1536	OpenAI	Good balance of cost and quality
text-embedding-3-large	3072	OpenAI	Higher quality, higher cost
Cohere embed-v3	1024	Cohere	Strong multilingual support
Voyage 3	1024	Voyage AI	Optimized for code and technical content
BGE / GTE	Varies	Open source	Free, run locally

Key considerations:

Dimensions — More dimensions can capture more nuance but use more storage and compute
Cost — Embedding API calls are cheap (fractions of a cent per request) but add up at scale
Consistency — Once you pick a model, stick with it. You can’t compare embeddings from different models — they live in different vector spaces

If you change your embedding model, you need to re-embed all your existing documents. Vectors from different models are not compatible.

What’s Next?

Now that you understand embeddings and vector search, you have the building blocks for one of the most powerful patterns in GenAI: RAG. But before we get there, let’s step back and look at the big picture. In Fine-Tuning vs RAG vs Prompt Engineering, we’ll compare the three main approaches to customizing LLM behavior and help you decide which one to use for your use case.

Table of Contents