How the chat works.
A first-class portfolio artifact, not a bottom-right widget. Every response is grounded in indexed content and cites its sources.
Overview
The chat page is powered by a hybrid RAG (Retrieval-Augmented Generation) pipeline running entirely on Cloudflare's free tier — no dedicated server, no database, zero monthly cost.
Pipeline
1. Content chunking (build time)
At build time, every MDX post and page is parsed into an AST. The text is split on
## and ### headings, then paragraphs within each section
are packed into chunks targeting 400 tokens (hard cap 600), with 50-token overlap
between adjacent chunks to preserve context across boundaries. Code blocks and
paragraphs are never split mid-way.
Each chunk records { slug, chunkIndex, headingPath, text } —
the headingPath later powers the deep-link in every citation.
2. Embedding + int8 quantization (build time)
Each chunk is embedded via Gemini gemini-embedding-001 (Matryoshka-truncated
to 768 dimensions, float32, L2-normalized). To minimize cold-start artifact size, vectors are quantized to int8
per-vector symmetric quantization: scale = max(|v|) / 127, then
each component is stored as round(v / scale) alongside the scale float32.
Measured recall@10 vs float32 baseline: ≥ 0.98 on a 100-chunk / 20-query sample.
The quantized binary format is generated/embeddings.bin — a compact binary blob
published alongside the static site.
3. BM25 lexical index (build time)
MiniSearch builds
a BM25-style full-text index over { text, title, headingPath, tags }.
The serialized index (generated/search-index.json) is also shipped as a static asset.
4. Hybrid retrieval (request time)
When you send a message:
- Your query is embedded via a single Gemini API call (~80ms).
- Vector top-20: cosine similarity over all dequantized int8 embeddings in memory.
- BM25 top-20: MiniSearch lexical search.
-
Results are merged and reranked by a weighted formula:
0.55 × vector + 0.30 × bm25 + 0.10 × recency_decay + 0.05 × type_boostwhererecency_decay = exp(−age_days / 365)andtype_boostis 1.1 for projects/case-studies, 1.0 for blog posts, 0.8 for pages. - Top-5 chunks are passed to the LLM as grounding context.
5. Gemini 2.5 Flash streaming
The top-5 chunks are wrapped in <source id="chunk-xyz"> tags and sent
to Gemini 2.5 Flash with a system prompt instructing it to cite sources using
<cite id="chunk-xyz"/> inline. The model's streaming response is validated
server-side: any <cite> referencing an ID not in the current prompt's whitelist
is replaced with [unknown source] to prevent hallucinated citations.
6. Citation deep-links
Valid <cite> tags render as superscripted footnote numbers in the UI.
Each links to the source chunk's URL: /blog/<slug>#<heading-anchor>.
You can click any citation to jump directly to the passage the model cited.
Infrastructure
- Hosting: Cloudflare Pages (static site + serverless Functions).
- Chat endpoint:
src/pages/api/chat.ts— a Pages Function at/api/chat. - Rate limit: 3 requests/IP/hour, stored in Cloudflare KV with 1-hour TTL. No PII — keys are SHA-256 of
ip + hour_bucket. - Spam protection: Cloudflare Turnstile (invisible widget) on every request.
- Preview guard: Preview deploys return HTTP 503 on
/api/chatso public preview URLs can't drain the Gemini API quota. - Cost: $0/month (Cloudflare free tier + Gemini free tier).
Scaling ceiling
The bundled static-asset architecture works comfortably to ~1,500 posts (about 7,500 chunks, ~6 MB binary blob). Beyond that, the natural upgrade is Cloudflare Vectorize as the vector store — a 2-hour swap when the pain is real. For now, everything fits in a Cloudflare Pages Function isolate with no external vector DB.
Source
All code is in the portfolio repo .
Key files: src/pages/api/chat.ts, src/lib/retrieve.ts,
src/lib/embeddings.ts, scripts/chunk-content.ts,
scripts/embed-content.ts.