Build Your Own Semantic Search: A Local Embeddings Engine on Apple Silicon

Semantic search has been sold to you as a product. Sign up, get an API key, push your documents to a cloud index, pay per query and per stored vector forever. The pitch is that the math is hard and the infrastructure is harder, so you should rent both.

It's mostly marketing. Semantic search is three moving parts: a model that turns text into a list of numbers, a place to keep those numbers, and a function that measures how close two lists are. None of the three requires a vendor, and all of them run on a Mac you already own. I run search over my own blog and catalog this way — locally, at zero marginal cost, in about fifteen lines of code.

Here's how the stack actually works, where the real boundaries are, and the point past which you'd genuinely want something heavier.

What You're Actually Paying Pinecone For

A managed vector database bundles three things, and it helps to unbundle them before you decide they're worth renting.

The first is an embedding model — the thing that reads "tunnel won't start" and "ingress rule ordering" and produces two number-lists that land near each other because the meaning is similar. Most managed search products don't even sell you this; they assume you call a cloud model and pay that bill separately. So the embedding step is a cost you carry regardless, often twice.

The second is a store — somewhere to keep the vectors so you don't recompute them on every search. That's a database. You have several already.

The third is a similarity function plus an index — the cosine math, wrapped in an approximate-nearest-neighbor structure (HNSW, IVF) so it stays fast at scale. The index is the only genuinely hard engineering in the product, and you don't need it until you have a lot of vectors. That's the part the pricing page is quietly built around.

Strip the bundle down and what's left is: embed, write rows to a table, loop. The managed product is renting you a solution to a scale problem you probably don't have yet.

The Local Stack: MiniLM as an On-Demand Sidecar

My embedding model is sentence-transformers/all-MiniLM-L6-v2. It produces 384-dimensional vectors, it's small enough to live in memory without a thought, and on Apple Silicon it runs around 80 items per second on CPU — no GPU cloud, no inference bill. For a blog, a catalog, or a wiki, 384 dimensions is plenty; reaching for a 1,536-dim model is mostly vanity, and the smaller one stores in a quarter of the space.

I don't run it as a daemon. It's a sidecar I boot on demand — a tiny Python service that loads the model once and exposes a single /embed endpoint. I start it when I'm indexing or testing search and let it go otherwise. The download is one-time; after that, every embedding is free and never leaves the machine.

That last part is the whole reason to self-host. Send your private corpus to a cloud embedding API and you've handed your knowledge base to a third party to feed an endpoint you don't control. Local embeddings keep the corpus where it belongs.

Store the Vectors in Plain SQLite

Here is the claim that saves the most money and pain: you don't need a dedicated vector engine until you're well past a hundred thousand rows. Below that, plain SQLite is the right answer, and it isn't close.

The pattern is to keep the embedding beside the source row, not in a separate system. If you already have a posts table, you add an embedding BLOB column and write the vector right next to the title and body. One database, one backup, one source of truth. No syncing a primary store to a vector store and reconciling the drift between them — the failure mode that eats afternoons.

Storing a vector is just serializing a float array. I pack the 384 floats into bytes (numpy's tobytes(), or struct.pack) and write the blob. To search, I read the blobs back, unpack them, and compare. SQLite reads tens of thousands of small blob rows in milliseconds, so "load them all and compare in memory" is not the bottleneck people assume it is.

The unlock is that your embeddings are just another column of your existing data. Treat them that way and most of the supposed complexity disappears.

The Cosine-Similarity Query in About Fifteen Lines

The search itself is brute force, and at this scale brute force is correct. Embed the query, compare it against every stored vector, sort, take the top. With numpy it's almost nothing:

import numpy as np

def search(query_vec, rows, k=10):
    q = query_vec / np.linalg.norm(query_vec)        # normalize the query
    scored = []
    for row_id, blob in rows:
        v = np.frombuffer(blob, dtype=np.float32)
        v = v / np.linalg.norm(v)                     # normalize the candidate
        score = float(np.dot(q, v))                   # cosine = dot of unit vectors
        scored.append((score, row_id))
    scored.sort(reverse=True)
    return scored[:k]

That's the engine. Cosine similarity is the dot product of two unit vectors, so once everything is normalized, "how similar" is one multiply-and-sum. Normalize once at write time and skip it at query time and it's purely the dot products.

People assume brute force can't be fast enough, but the arithmetic disqualifies the worry. Ten thousand vectors at 384 dimensions is under four million multiply-adds — a vectorized numpy operation finishes that before you notice. You're not slow until hundreds of thousands of rows, and a blog or catalog is nowhere near that.

Wiring It Into a Real Site

This is where it stops being a demo and becomes infrastructure. My production sites are Next.js processes on one Mac behind a single Cloudflare tunnel — the same one-machine, multi-domain pattern I run everything on. Search drops straight into that.

A search request hits the site server-side. The server calls the local sidecar to vectorize the query, runs the cosine loop against the SQLite table that already holds the content, and renders the ranked results into the page. Because it's server-rendered, the model and the vectors never touch the browser — the client just gets HTML. The whole round trip happens between processes on one machine, behind the tunnel: no API key in the frontend, no per-query fee, nothing metered.

The result is search that understands intent over a corpus I fully own, served from the same box that holds the data. That's the sovereign thesis in one feature: the machine that stores your knowledge is also the one that searches it.

Where It Breaks and What to Add Next

Honesty section. The pattern has real edges, and you should hit them on purpose.

Indexing one item at a time is slow. The first thing to add is batch embedding — feed the model an array of texts instead of looping, and throughput jumps because the framework vectorizes the batch. On a first index of a few thousand documents that's minutes versus seconds.

Brute force does eventually lose. Past roughly a hundred thousand vectors, scanning every row per query starts to drag. That's the moment to add an approximate-nearest-neighbor index — sqlite-vec, FAISS, or hnswlib — which trades a sliver of recall for a large speed win by not comparing against everything. Add it when the latency tells you to, not because a pricing page told you to up front.

Pure vector search isn't always the best ranking. Embeddings nail meaning but can miss exact keywords and rare terms. The mature move is a re-ranking pass: pull a wider candidate set with cosine similarity, then re-score the top handful with a cross-encoder or a keyword signal. That hybrid is what the expensive products quietly do under the hood — a few dozen lines you bolt on when quality, not cost, becomes the constraint.

Start with the fifteen-line loop. Add batch embedding, then an ANN index, then re-ranking — each one only when your own numbers demand it. You'll own every layer, and you'll have paid no one to learn you didn't need most of them yet.

FAQ

Do I really not need a vector database?

Not until you're past roughly a hundred thousand vectors. Below that, a plain SQLite table with embeddings stored beside the source rows and scanned with a brute-force cosine loop is fast enough on Apple Silicon and far simpler to operate. Reach for a dedicated index when latency at your real row count tells you to, not before.

Why all-MiniLM-L6-v2 instead of a larger model?

The 384-dimensional MiniLM model retrieves the right documents for blog, catalog, and wiki search while using a quarter of the storage of a 1,536-dim model and running locally at around 80 items per second. Bigger embeddings help for very large, nuanced corpora, but for most self-hosted search they're cost you don't get back.

Is cosine similarity actually all there is to it?

For ranking by meaning, yes — cosine similarity is the dot product of two normalized vectors, one multiply-and-sum per comparison. The sophistication in commercial products is the ANN index that keeps it fast at scale and optional re-ranking for quality. The core relevance math is genuinely this simple.

What does this stack cost to run?

Zero marginal cost. The Mac is already on, the Cloudflare tunnel is free, and the model is a one-time download. You're not adding a line item — you're deleting the vector-database subscription and the per-query embedding bill you'd otherwise pay forever.