Skip to content
saltwaterbrc
Go back

Ask This Blog: Adding AI to the Edge with Workers AI, Vectorize, and AI Gateway

What I Built

Visit saltwaterbrc.com/ask.html and you’ll see a text box. Ask it anything about my blog posts — “How do I set up Durable Objects?” or “What’s the cost of R2?” — and it generates an answer based on the content I’ve published.

That’s the user-facing feature. But the architecture underneath is what I really want to talk about, because it uses three Cloudflare products that I couldn’t explain credibly until I built this.

The Architecture in Plain English

Here’s how it works:

Step 1: Content becomes vectors. When I publish a blog post, my deployment pipeline chunks the content into small pieces — a few hundred words each. Each chunk gets converted into a vector, which is a mathematical representation of its meaning. The word “Durable Objects” and the phrase “persistent state machine” might have the same vector because they mean the same thing. Keywords don’t matter. Meaning does.

Step 2: Vectors get stored in a database. Those vectors live in Vectorize, which is Cloudflare’s vector database. It’s optimized for similarity search — give it a vector, and it instantly finds the most similar ones. No scanning through millions of rows. No latency.

Step 3: Question becomes vector. When you ask a question, it gets converted to a vector using the same model that processed my blog content. Now we have a question and a database of content vectors.

Step 4: Find the relevant content. Vectorize searches for vectors most similar to your question. It returns the top three chunks from my blog posts that are most relevant to what you’re asking — not because of keywords, but because the meaning is closest.

Step 5: AI generates the answer. I feed those three chunks to Llama 3.1 running on Workers AI. I include your question and the context, and the model generates a natural language answer. Everything happens at the edge. No external API calls. No data leaving Cloudflare’s network.

Step 6: Route through the gateway. Every request — embeddings, vector search, and AI inference — routes through AI Gateway, which logs every single interaction: which model was called, how many tokens were used, latency, cache hits, everything.

Before I built this, I could explain these products individually. Now I can explain how they work together to solve a real problem: taking customer questions and answering them with AI, without ever leaving the network.

Why This Matters for Sales Conversations

When you’re sitting across from a prospect in healthcare, financial services, or any regulated industry, the first question isn’t “Can your AI answer questions?” It’s “Where does my data go?”

Here’s what I can now tell them: “Every request stays inside Cloudflare’s network. Your content gets vectorized, stored, and queried — all at the edge. The AI inference happens at the edge. No external calls. No data leaving your security perimeter. This is the same architecture that powers support chatbots, product search, and customer service automation, and it’s secure by default.”

That’s a conversation you can’t have without building it yourself.

But there’s a second piece that’s equally important: observability. The moment I deployed “Ask This Blog,” I opened the AI Gateway dashboard and could see everything. Every request. Every model call. Tokens consumed. Cache hits. Latency. Cost.

That’s the moment the sales conversation changes. “You wouldn’t run a database without monitoring it. Why would you run AI inference without visibility?” That’s not theory. I was looking at my own data.

AI Gateway turns inference from a black box into a fully observable system. That’s what enterprise customers need.

What Makes This Different from Just Using ChatGPT

There’s a reason enterprises can’t just call OpenAI’s API:

Data residency. When you call an external API, your data leaves your network. For healthcare providers, financial institutions, and companies subject to data localization laws, that’s not an option. Everything here happens at the edge.

Rate limiting and authentication. The same Workers that run the AI also enforce rate limits, validate requests, and authenticate users. No external gateway needed. No vendor lock-in to an API management platform.

Caching. If two customers ask the same question, the first one pays for the inference. The second one hits the cache and costs zero. Try that with ChatGPT API.

DLP (Data Loss Prevention). I can inject Cloudflare’s DLP rules into the request — scan for credit card numbers, PII, classified data — before it ever reaches the model. Same way I’d protect a database.

This is AI infrastructure, not just an AI feature. Most companies treating AI like a feature bolted onto their product. This is the infrastructure that scales to millions of concurrent users without vendor lock-in.

The Stack Now

saltwaterbrc.com now runs on seven Cloudflare products:

Pages — hosting and auto-deployment from GitHub. Workers — request routing and API logic. Durable Objects — visitor counter with persistent state. R2 — object storage for downloadable PDFs. Workers AI — AI model inference (Llama 3.1). Vectorize — vector database for semantic search. AI Gateway — request logging, monitoring, and analytics.

Total cost: $0/month.

And that’s before caching, rate limiting, and all the other products layered on top. The point isn’t that I’m saving money on a blog. The point is that I’m running an architecture that scales from a solo project to enterprise AI infrastructure without changing the code.

Protecting the Endpoint: Rate Limiting with WAF

Every question someone asks triggers Workers AI inference — embedding the question, querying Vectorize, and generating a response with Llama 3.1. That’s real compute. If someone scripts a bot to hit /ask a thousand times a minute, that’s a thousand AI inference calls you’re paying for.

Cloudflare’s WAF has advanced rate limiting built in. Here’s how to set it up:

  1. Go to Security → Security rules in the Cloudflare dashboard
  2. Under Advanced rate limiting rules, click Create rule
  3. Rule name: Rate Limit Ask Endpoint
  4. If matching: Field = URI Path, Operator = contains, Value = /ask
  5. Rate: 10 requests per 1 minute
  6. Action: Block
  7. Duration: 60 seconds
  8. Deploy the rule

That’s it. Any single IP that sends more than 10 questions per minute gets blocked for 60 seconds. Generous enough for a real user clicking around, but stops bots from hammering your AI inference.

This is the same WAF that’s already running three managed rulesets on the site — Cloudflare Managed Ruleset (SQLi, XSS, RCE), OWASP Core Ruleset, and Leaked Credentials Check. Rate limiting is just another layer on top. One dashboard, one policy engine.

For enterprise customers running AI endpoints in production, this is table stakes. You wouldn’t expose a database without rate limiting. Don’t expose an AI endpoint without it either.

What’s Next

Phase 5 is the Sandbox SDK — an interactive code playground where visitors can write and deploy a Worker directly in the browser. No CLI. No local setup. Just click, code, deploy.

After that, I’m migrating the entire site to Astro, which will give me a better framework for managing content and templates.

But for now, try the AI assistant. Ask it questions about blog posts. See what it gets right and wrong. And if you’re selling a developer platform or an AI product, build something like this. Hit the errors. Learn the limitations. That’s what turns a pitch into a conversation.


Share this post on:

Previous Post
Code Playground: Running Containers at the Edge with Sandbox SDK
Next Post
Three Builds, One Domain: Workers, Durable Objects, and R2