At AINA, we build local AI solutions — systems where data never leaves the building. No cloud API, no ongoing token costs, full control. Sounds great in theory. But how far can you push it?
To find out, I deployed our RAG demo — an AI assistant for a fictional municipal administration — on a Hetzner VPS. The specs: 2 ARM CPU cores, 4 GB RAM, no GPU. Monthly cost: €3.29.
The result after one night of optimization: Response time from 80 seconds down to 7 seconds — an 11× speedup on identical hardware. Software decisions only.
Here's how we got there.
Phase 1: The Naive Approach — Ollama
The first instinct for LLM deployment is often: install Ollama, load a model, done. That's exactly what I did. Qwen2.5-1.5B as the model, Ollama as the runtime.
The result was sobering: 80 to 120 seconds per request. On a server with 2 cores and 4 GB RAM, that's not surprising — but it's not presentable either. No demo visitor waits two minutes for an answer.
To make matters worse, the LLM hallucinated freely. A question about waste collection fees returned "€3.58" — a completely fabricated number. This is a well-known problem with language models lacking context grounding, but for a demo targeting municipal staff, it's a dealbreaker.
Phase 2: Streaming and Prompt Engineering
Next step: enable streaming. Instead of making the user wait for the complete response, the first token appears after about 2 seconds. This dramatically changes the perceived speed — the visitor immediately sees something is happening.
In parallel, I tightened the prompts and reduced token limits. The RAG column of the demo now delivers verifiable answers: "€264.00 — Source: Waste Collection Fee Ordinance". Hallucination versus cited facts, side by side — as a demo effect, this is actually more powerful than a model that never hallucinates.
Still, the total response time remained above 60 seconds.
Phase 3: The Breakthrough — llama.cpp Instead of Ollama
At this point, I replaced Ollama with llama.cpp — compiled directly, no abstraction layer. Less overhead, full control over context window, batch size, and threading.
The difference was immediately measurable: llama.cpp was roughly 4× faster than Ollama on identical hardware.
But it didn't stop there. Two additional optimizations made the decisive difference:
Sequential instead of parallel. On a server with only 2 CPU cores, parallel processing is counterproductive. The overhead from context switching eats up any theoretical gain. In sequential mode — first the LLM, then the RAG retrieval — response time dropped by another 65 %.
Result: 7 seconds end-to-end. On an ARM server costing €3.29 per month.
Validation on x86: AVX2 and the Hidden Switch
To validate the results, I set up a second VPS — this time x86 (AMD EPYC, 4 cores, 8 GB RAM) at €5.49 per month.
First measurement: 14 seconds. Slower than expected for double the core count. The cause: The AVX2 compiler flag was disabled. GGML_AVX2 was set to OFF — the SIMD vectorization that gives x86 processors their speed advantage in matrix operations simply wasn't being used.
After recompiling with AVX2 enabled and adjusted threading (4 instead of 3 threads on a 4-core VPS in sequential mode): 11.5 seconds.
The x86 server is thus slower than the ARM server despite having double the cores. This comes down to architecture: ARM Ampere cores have a surprisingly good performance-per-watt ratio for LLM inference, and with only 2 cores, scheduling overhead is minimal.
What I Learned
1. The framework isn't the solution. Ollama is great for local development. For deployment on minimal hardware, the overhead is too large. Compiling llama.cpp directly yielded a 4× speedup — without any hardware change.
2. Fewer cores, less overhead. On systems with ≤2 cores, sequential processing is faster than parallel. This defies intuition but is measurable: −65 % response time.
3. Check your compiler flags. AVX2 made a difference of +53 % tokens/s on x86. A single compiler flag that was disabled by default.
4. Context size matters for RAG. 512 tokens are too few when RAG context needs to be meaningfully embedded in the prompt. 1024 is the minimum — and still feasible on minimal hardware.
5. Hallucination is a feature, not a bug — at least in the demo. The side-by-side comparison of hallucinated LLM responses and cited RAG answers makes the value of Retrieval Augmented Generation more tangible than any slide deck.
The Architecture
The demo runs strictly sequentially: the LLM streams its answer (the RAG column stays empty), after a brief pause the RAG retrieval starts, and the RAG column streams its cited answer.
The tech stack:
- Model: Qwen2.5-1.5B (Q4_K_M quantized, ~1 GB)
- Embeddings: Nomic Embed Text v1.5 (Q8, for RAG retrieval)
- Inference: llama.cpp (compiled directly)
- RAG: ChromaDB with local embeddings
- Frontend: Gradio (Python)
- Hosting: Hetzner Cloud VPS (ARM, 2 vCPU, 4 GB RAM)
Everything runs as systemd services, auto-starting. Total cost: €3.29 per month, no additional running costs.
Who Is This Relevant For?
Not every organization needs GPT-4 level capability. Many use cases — FAQ bots, internal knowledge assistants, form helpers — work excellently with small, specialized models. And sometimes "all data stays in-house" isn't optional — it's mandatory.
Local AI doesn't mean "start a model and you're done." It means understanding every layer — from compiler flags to the kernel scheduler. But the result is a system that runs on a server costing €3.29 per month, needs no GPU, no cloud API, no ongoing token costs. And responds in 7 seconds.
Live demo: local-ai.aina.technology