How the Chat Works

Overview

The chat page is powered by a hybrid RAG (Retrieval-Augmented Generation) pipeline running entirely on Cloudflare's free tier — no dedicated server, no database, zero monthly cost.

Pipeline

1. Content chunking (build time)

At build time, every MDX post and page is parsed into an AST. The text is split on ## and ### headings, then paragraphs within each section are packed into chunks targeting 400 tokens (hard cap 600), with 50-token overlap between adjacent chunks to preserve context across boundaries. Code blocks and paragraphs are never split mid-way.

Each chunk records { slug, chunkIndex, headingPath, text } — the headingPath later powers the deep-link in every citation.

2. Embedding + int8 quantization (build time)

Each chunk is embedded via Gemini gemini-embedding-001 (Matryoshka-truncated to 768 dimensions, float32, L2-normalized). To minimize cold-start artifact size, vectors are quantized to int8 per-vector symmetric quantization: scale = max(|v|) / 127, then each component is stored as round(v / scale) alongside the scale float32.

Measured recall@10 vs float32 baseline: ≥ 0.98 on a 100-chunk / 20-query sample. The quantized binary format is generated/embeddings.bin — a compact binary blob published alongside the static site.

3. BM25 lexical index (build time)

MiniSearch builds a BM25-style full-text index over { text, title, headingPath, tags }. The serialized index (generated/search-index.json) is also shipped as a static asset.

4. Hybrid retrieval (request time)

When you send a message:

Your query is embedded via a single Gemini API call (~80ms).
Vector top-20: cosine similarity over all dequantized int8 embeddings in memory.
BM25 top-20: MiniSearch lexical search.
Results are merged and reranked by a weighted formula: 0.55 × vector + 0.30 × bm25 + 0.10 × recency_decay + 0.05 × type_boost where recency_decay = exp(−age_days / 365) and type_boost is 1.1 for projects/case-studies, 1.0 for blog posts, 0.8 for pages.
Top-5 chunks are passed to the LLM as grounding context.

5. Gemini 2.5 Flash streaming

The top-5 chunks are wrapped in <source id="chunk-xyz"> tags and sent to Gemini 2.5 Flash with a system prompt instructing it to cite sources using <cite id="chunk-xyz"/> inline. The model's streaming response is validated server-side: any <cite> referencing an ID not in the current prompt's whitelist is replaced with [unknown source] to prevent hallucinated citations.

6. Citation deep-links

Valid <cite> tags render as superscripted footnote numbers in the UI. Each links to the source chunk's URL: /blog/<slug>#<heading-anchor>. You can click any citation to jump directly to the passage the model cited.

Infrastructure

Hosting: Cloudflare Pages (static site + serverless Functions).
Chat endpoint: src/pages/api/chat.ts — a Pages Function at /api/chat.
Rate limit: 3 requests/IP/hour, stored in Cloudflare KV with 1-hour TTL. No PII — keys are SHA-256 of ip + hour_bucket.
Spam protection: Cloudflare Turnstile (invisible widget) on every request.
Preview guard: Preview deploys return HTTP 503 on /api/chat so public preview URLs can't drain the Gemini API quota.
Cost: $0/month (Cloudflare free tier + Gemini free tier).

Scaling ceiling

The bundled static-asset architecture works comfortably to ~1,500 posts (about 7,500 chunks, ~6 MB binary blob). Beyond that, the natural upgrade is Cloudflare Vectorize as the vector store — a 2-hour swap when the pain is real. For now, everything fits in a Cloudflare Pages Function isolate with no external vector DB.

Source

All code is in the portfolio repo . Key files: src/pages/api/chat.ts, src/lib/retrieve.ts, src/lib/embeddings.ts, scripts/chunk-content.ts, scripts/embed-content.ts.

How the chat works.