Skip to main content J Williams
Blog
8 min read
Blog

How Semantic Search Actually Works: A Tiny Search Engine You Can Run in This Tab

Building a real lexical search engine from scratch to understand exactly what neural embeddings add on top of it

8 min read

Every explanation of embeddings starts with the same hand-wave: text becomes a vector, similar meanings end up close together. The honest version is more interesting — and you can build the simpler half of it yourself, in about forty lines of JavaScript, and watch it work in real time.

Semantic Search

“Semantic search” gets explained so often as magic that it’s worth building the boring, fully transparent version first. Underneath every vector search system — whether it’s a simple keyword index or a billion-parameter embedding model — is the same two-step recipe: turn each piece of text into a list of numbers, and measure how close two lists of numbers are. The interesting engineering is entirely in step one. Step two is almost always the same handful of lines of arithmetic.

This post builds step one the simple way — counting words — so you can see exactly what that arithmetic is doing, and exactly where it breaks down.

The Geometry Underneath “Similarity”

The comparison at the heart of nearly every search and recommendation system is cosine similarity: treat each vector as an arrow from the origin, and measure the angle between them. Two arrows pointing in almost the same direction score close to 1, regardless of how long either arrow is. Two arrows at right angles score 0. Pointing in opposite directions scores -1.

Two vectors from the origin with the angle theta between them, illustrating that cosine similarity measures direction, not length. θ vector A vector B small θ → cos θ close to 1 → "similar"

The only question left is how you build the vector in the first place. The simplest defensible answer — and the one this post implements for real — is to count words.

Building a Real Vector: TF-IDF

Term frequency–inverse document frequency turns a document into a vector with one dimension per word that appears anywhere in your collection. Each entry is weighted by two competing forces: how often the word appears in this document (term frequency — appearing five times should count for more than once), divided down by how often that word appears across all documents (inverse document frequency — a word that’s in every single document, like “the”, carries almost no information about which document is relevant, so its weight gets pushed towards zero). A rare word repeated several times in one document ends up with a high weight in exactly the right dimension. That is, in its entirety, what TF-IDF is doing.

function idf(term, docs) {
  const df = docs.filter(doc => doc.includes(term)).length;
  return Math.log((docs.length + 1) / (df + 1)) + 1;   // smoothed inverse document frequency
}
function tfidfVector(tokens, vocabulary, docs) {
  const counts = {};
  tokens.forEach(t => counts[t] = (counts[t] || 0) + 1);
  const vector = {};
  vocabulary.forEach(term => {
    if (counts[term]) vector[term] = (counts[term] / tokens.length) * idf(term, docs);
  });
  return vector;   // sparse: only non-zero dimensions are stored
}

That’s the entire vectoriser. No model weights, no training, no GPU — just counting.

A Real Search Engine, Running Now

The box below implements exactly the two functions above, indexes eight real sentences drawn from topics this blog has actually covered, and searches them live as you type. Every score you see is computed in your browser at that instant — open your devtools and you’ll find the same tfidfVector and cosine similarity functions sitting in the page source.

Try “hexagonal grid” first — it scores the H3 sentence highly because both words appear there. Then try “six-sided shape”, which is a near-perfect paraphrase of “hexagonal” and shares precisely zero words with any document in the corpus. You’ll get nothing, or close to it. That empty result is the whole limitation of lexical search in one click: it has no concept that “six-sided shape” and “hexagonal” mean approximately the same thing. It only knows whether the same characters showed up.

What Neural Embeddings Actually Change

A neural embedding model — the kind behind OpenAI’s text-embedding-3-small, Cohere’s embed-v3, or any sentence-transformer — produces a vector through the same cosine-similarity comparison, but builds the vector completely differently. Instead of counting words, the text is passed through a network trained on enormous amounts of text to place semantically similar passages near each other in a fixed-size vector space — typically 1536 dimensions for text-embedding-3-small, regardless of whether the input is one word or one thousand. Crucially, that training objective is explicitly about meaning, not vocabulary overlap — which is precisely the dimension the TF-IDF demo above cannot see.

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text: str) -> np.ndarray:
    response = client.embeddings.create(model="text-embedding-3-small", input=text)
    return np.array(response.data[0].embedding)

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

vec_a = embed("a hexagonal grid for spatial indexing")
vec_b = embed("a six-sided tiling system for dividing up a map")
print(cosine_similarity(vec_a, vec_b))  # high — despite zero shared keywords

Run that, and the pair that defeated the in-browser demo scores highly — because the comparison is still cosine similarity, but the vectors now encode something closer to meaning than vocabulary. The arithmetic from the first section of this post hasn’t changed at all. Only the vector-building step has.

Once text is a vector, “find similar things” becomes “find nearby points,” and that’s the operation underneath far more than search boxes: retrieval-augmented generation retrieves the passages nearest to a question’s embedding before asking a model to answer; recommendation systems find items near a user’s taste vector; deduplication finds near-duplicate documents by distance rather than exact match. At production scale, computing cosine similarity against every document — what this post’s demo does, brute-force, against eight sentences — stops being fast enough, which is why real systems use approximate nearest-neighbour indexes like HNSW (used inside FAISS, pgvector, and most managed vector databases) to find near-matches in a fraction of the comparisons, trading a small amount of recall for a very large amount of speed.

The mechanism, though, is exactly what you watched run in your browser a moment ago: vectorise, compare by angle, rank by score. Everything else is making that one idea faster and the vectors smarter.