Building a Self-Hosted LLM Stack That Actually Scales
Running a local language model is easy. Running one reliably under load, with a clean API, proper auth, and logging, is a different problem.
Most self-hosted LLM tutorials stop at 'run Ollama, see it respond.' That's fine for experiments. For production use — a coding assistant used by a 20-person engineering team, for example — you need a layer of infrastructure around the model.
The stack I settled on: Ollama for model serving (Llama 3 70B quantised at Q4_K_M), FastAPI for the gateway, Bearer token auth, Redis for in-flight request dedup, and Prometheus + Grafana for latency dashboards. Containerised with Docker Compose, orchestrated on a single beefy VM to start.
The subtlety is prompt caching. Ollama's KV cache works best when the system prompt is identical across requests — teach your clients to be consistent. A shared system prompt stored in Redis, versioned by hash, reduced average TTFT by 34% in my tests.
Would I run this on Kubernetes? For a team of 5+, yes. The horizontal scaling story for inference is awkward (GPUs are expensive), but you can run the gateway layer on cheap pods and let Ollama autoscale on metal separately. It's a split that's worked well for me.