LLM · Ollama · FastAPI7 min read28 March 2025

Building a Self-Hosted LLM Stack That Actually Scales

Running a local language model is easy. Running one reliably under load, with a clean API, proper auth, and logging, is a different problem.

Most self-hosted LLM tutorials stop at 'run Ollama, see it respond.' That's fine for experiments. For production use — a coding assistant used by a 20-person engineering team, for example — you need a layer of infrastructure around the model.

The stack I settled on: Ollama for model serving (Llama 3 70B quantised at Q4_K_M), FastAPI for the gateway, Bearer token auth, Redis for in-flight request dedup, and Prometheus + Grafana for latency dashboards. Containerised with Docker Compose, orchestrated on a single beefy VM to start.

The subtlety is prompt caching. Ollama's KV cache works best when the system prompt is identical across requests — teach your clients to be consistent. A shared system prompt stored in Redis, versioned by hash, reduced average TTFT by 34% in my tests.

Would I run this on Kubernetes? For a team of 5+, yes. The horizontal scaling story for inference is awkward (GPUs are expensive), but you can run the gateway layer on cheap pods and let Ollama autoscale on metal separately. It's a split that's worked well for me.

← PreviousWhy n8n Is My Default Automation Layer Next →Python Scrapers in 2025: The Async Playwright Approach