Embeddings & Vector Search
Table of Contents
So far in this series, we’ve focused on LLMs that generate text. But there’s another fundamental capability that powers many AI applications: embeddings. Embeddings let you convert text into numbers that capture meaning, making it possible to search, compare, and cluster content based on what it means rather than what words it contains.
This is the technology behind semantic search, recommendation systems, and RAG (Retrieval-Augmented Generation) — which we’ll cover later in this series. You should be familiar with How LLMs Work before reading this.
What Are Embeddings?
An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. The key insight is that texts with similar meanings end up with similar vectors, even if they use completely different words.
For example, these two sentences mean roughly the same thing:
- “The cat sat on the mat”
- “A feline rested on the rug”
A traditional keyword search would see no overlap — they share zero words. But their embedding vectors would be very close together because they describe the same scene.
How Embeddings Work
Embedding models are trained to map text into a high-dimensional space (typically 256–3072 dimensions) where distance corresponds to semantic similarity. During training, the model learns that:
- “happy” and “joyful” should be close together
- “happy” and “sad” should be far apart
- “Python programming” and “coding in Python” should be close
- “Python programming” and “python snake” should be farther apart
You don’t need to understand the math behind the training process to use embeddings effectively. What matters is the result: a function that takes text in and produces a vector out.
Generating Embeddings
Here’s how to generate embeddings using the OpenAI API:
import OpenAI from "openai";
const client = new OpenAI();
const response = await client.embeddings.create({
model: "text-embedding-3-small",
input: "How do I deploy a Node.js app to AWS?",
});
const vector = response.data[0].embedding;
console.log(vector.length); // 1536
console.log(vector.slice(0, 5)); // [0.0023, -0.0091, 0.0152, ...]
The result is an array of 1,536 floating-point numbers. By itself, a single vector isn’t useful — the power comes from comparing vectors.
Measuring Similarity
To determine how similar two pieces of text are, you compare their embedding vectors using a distance metric. The most common one is cosine similarity, which measures the angle between two vectors:
- 1.0 = identical meaning
- 0.0 = completely unrelated
- -1.0 = opposite meaning (rare in practice)
function cosineSimilarity(a, b) {
let dot = 0, magA = 0, magB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
magA += a[i] * a[i];
magB += b[i] * b[i];
}
return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}
Let’s see it in action:
async function embed(text) {
const res = await client.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return res.data[0].embedding;
}
const a = await embed("How do I deploy to AWS?");
const b = await embed("What's the process for deploying on Amazon Web Services?");
const c = await embed("What's a good recipe for chocolate cake?");
console.log(cosineSimilarity(a, b)); // ~0.92 (very similar)
console.log(cosineSimilarity(a, c)); // ~0.15 (unrelated)
Even though sentences A and B use different words, their embeddings are nearly identical because they mean the same thing.
Vector Search
Vector search (also called semantic search) uses embeddings to find content that’s semantically similar to a query, rather than matching keywords.
The basic workflow is:
- Index — Generate embeddings for all your documents and store them
- Query — Generate an embedding for the user’s search query
- Search — Find the stored embeddings closest to the query embedding
- Return — Return the corresponding documents
Here’s a simple in-memory implementation:
// 1. Index: embed your documents
const docs = [
"JavaScript arrays have methods like map, filter, and reduce",
"AWS Lambda lets you run code without managing servers",
"Git branches allow parallel development workflows",
"CSS Grid is a two-dimensional layout system",
];
const index = await Promise.all(
docs.map(async (text) => ({ text, vector: await embed(text) }))
);
// 2-3. Query and search
async function search(query, topK = 2) {
const queryVec = await embed(query);
return index
.map((doc) => ({ text: doc.text, score: cosineSimilarity(queryVec, doc.vector) }))
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}
// 4. Return results
const results = await search("serverless functions");
console.log(results);
// [
// { text: "AWS Lambda lets you run code without managing servers", score: 0.82 },
// { text: "JavaScript arrays have methods like map, filter...", score: 0.31 }
// ]
The query “serverless functions” matched the Lambda document even though neither word appears in it. That’s the power of semantic search.
Vector Databases
The in-memory approach above works for small datasets, but real applications need a vector database — a database optimized for storing and searching high-dimensional vectors efficiently.
Popular options include:
| Database | Type | Best For |
|---|---|---|
| Pinecone | Managed cloud service | Production apps, zero ops |
| pgvector | PostgreSQL extension | Teams already using Postgres |
| Chroma | Open source, embeddable | Prototyping, local development |
| Weaviate | Open source / cloud | Complex filtering + search |
| Qdrant | Open source / cloud | High performance, Rust-based |
Vector databases use specialized indexing algorithms (like HNSW — Hierarchical Navigable Small World graphs) to search millions of vectors in milliseconds, rather than comparing against every single vector.
We’ll cover vector databases in more depth in a dedicated tutorial later in this series.
Where Embeddings Are Used
Semantic Search
Replace keyword search with meaning-based search. Users find what they’re looking for even when they don’t use the exact right terms.
RAG (Retrieval-Augmented Generation)
This is the most important application for developers building with LLMs. RAG uses embeddings to find relevant documents, then feeds those documents to an LLM as context so it can answer questions about your specific data. We cover this in Introduction to RAG.
Recommendations
“Users who read this article also liked…” — embed all your content, then find items with similar vectors to what the user has already engaged with.
Clustering and Classification
Group similar documents together automatically, or classify new documents by finding which cluster they’re closest to.
Anomaly Detection
If a new data point’s embedding is far from all existing embeddings, it might be an outlier worth investigating.
Choosing an Embedding Model
| Model | Dimensions | Provider | Notes |
|---|---|---|---|
| text-embedding-3-small | 1536 | OpenAI | Good balance of cost and quality |
| text-embedding-3-large | 3072 | OpenAI | Higher quality, higher cost |
| Cohere embed-v3 | 1024 | Cohere | Strong multilingual support |
| Voyage 3 | 1024 | Voyage AI | Optimized for code and technical content |
| BGE / GTE | Varies | Open source | Free, run locally |
Key considerations:
- Dimensions — More dimensions can capture more nuance but use more storage and compute
- Cost — Embedding API calls are cheap (fractions of a cent per request) but add up at scale
- Consistency — Once you pick a model, stick with it. You can’t compare embeddings from different models — they live in different vector spaces
What’s Next?
Now that you understand embeddings and vector search, you have the building blocks for one of the most powerful patterns in GenAI: RAG. But before we get there, let’s step back and look at the big picture. In Fine-Tuning vs RAG vs Prompt Engineering, we’ll compare the three main approaches to customizing LLM behavior and help you decide which one to use for your use case.