Why Self-Host LLM + RAG Instead of Paying API Fees?
Every month, companies burn hundreds of dollars on OpenAI and Anthropic API calls. A single RAG pipeline processing 10,000 documents can cost $20–50/month just for embeddings, plus another $50–200/month for chat completions.
Self-hosting solves both problems. With Ollama running a 7B–13B model locally on a $5–10 VPS, plus ChromaDB for vector storage, you get a fully private AI knowledge base with zero marginal cost per query. Your documents never leave your server. Your API bills drop to $0.
This guide walks through deploying a production-ready self-hosted RAG stack on a budget VPS — comparing RackNerd, Hostinger, and Vultr for LLM workloads.
Disclosure: This article contains affiliate links. If you purchase through these links, we may earn a commission at no extra cost to you. We only recommend services we’ve tested and believe offer genuine value.
The Self-Hosted RAG Stack Explained
A RAG (Retrieval-Augmented Generation) system has four layers:
┌─────────────────────────────────────────────────────┐
│ VPS ($5-10/month, 4-8GB RAM) │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Cloudflare Tunnel → HTTPS / No Open Ports │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────────────▼──────────────────────┐ │
│ │ FastAPI / Gradio (Chat UI) │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────────────▼──────────────────────┐ │
│ │ Ollama (LLM Inference: Llama 3.1 / Qwen) │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────────────▼──────────────────────┐ │
│ │ ChromaDB / Qdrant (Vector Store) │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────────────▼──────────────────────┐ │
│ │ Document Pipeline (PDF/Markdown/HTML) │ │
│ └─────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Layer breakdown:
| Layer | Component | RAM Usage | Purpose |
|---|---|---|---|
| LLM | Ollama + Llama 3.1 8B (Q4 quantized) | ~5GB | Generates answers from retrieved context |
| Vector DB | ChromaDB (persistent mode) | ~512MB | Stores and retrieves document embeddings |
| Embedding | nomic-embed-text (via Ollama) | ~512MB | Converts documents to searchable vectors |
| API/UI | FastAPI + simple chat frontend | ~256MB | Exposes the RAG pipeline via HTTP |
| Pipeline | Python doc loader + chunker | ~256MB | Ingests PDFs, Markdown, HTML into the KB |
Total: A 4GB VPS is the absolute minimum. 8GB RAM is recommended for comfortable operation with room for concurrent users.
VPS Provider Comparison for LLM Workloads
RackNerd — Best Entry Point for Learning
RackNerd’s $5.99/month plans offer surprising value for getting started with self-hosted LLMs.
- CPU: 1 vCPU (adequate for 7B model inference at ~5-10 tokens/sec)
- RAM: 2–4GB (use swap for 7B models; 8GB plan ideal)
- Storage: 40GB SSD (enough for model weights + document store)
- Network: Solid US West connectivity
Why RackNerd? Lowest barrier to entry. You can test the entire RAG stack on a $5/month instance. Upgrade when you’re ready for production traffic.
Hostinger — Best Balance of RAM and Performance
Hostinger’s KVM VPS plans are ideal for LLM workloads because they guarantee RAM — no overselling surprises.
- CPU: AMD EPYC dedicated cores
- RAM: 8GB plan at $19.99/month (or $9.99 with referral code JZ1ZL8465QCG for 4GB)
- Storage: 200GB NVMe SSD (fast document ingestion)
- Network: Premium routing, 5 data center regions
Why Hostinger for LLM? Guaranteed RAM is critical for LLM inference. Shared RAM on budget hosts causes OOM kills during peak usage. NVMe storage accelerates vector database operations by 3-5x.
Vultr — Best for Production with GPU Option
Vultr offers both standard CPU instances and affordable GPU instances — the only budget provider with this option.
- CPU: AMD High Frequency or Intel
- RAM: 4GB–64GB scalable
- GPU: Optional add-on ($3.75/hr for A100, or $60/month for dedicated GPU instances)
- Storage: NVMe SSD
- Network: 30+ global data centers
Why Vultr for LLM? If you need to run larger models (30B+) or want GPU acceleration for faster inference, Vultr’s GPU instances are the cheapest entry point available. For 7B–13B CPU-only inference, their standard instances are competitive.
Complete Docker Compose Deployment
This is the full stack — one docker-compose.yml file, everything running:
version: "3.8"
services:
# Ollama — Self-hosted LLM runtime
ollama:
image: ollama/ollama:latest
ports:
- "127.0.0.1:11434:11434"
volumes:
- ollama_data:/root/.ollama
restart: unless-stopped
# ChromaDB — Persistent vector database
chromadb:
image: chromadb/chromadb:latest
ports:
- "127.0.0.1:8000:8000"
volumes:
- chroma_data:/chroma/chroma
environment:
CHROMA_AUTH_TOKEN: ${CHROMA_AUTH_TOKEN}
CHROMA_SERVER_AUTH_CREDENTIALS: ${CHROMA_CREDENTIALS}
restart: unless-stopped
# RAG API — FastAPI service connecting everything
rag-api:
build: ./rag-api
ports:
- "127.0.0.1:8080:8080"
environment:
OLLAMA_BASE_URL: http://ollama:11434
CHROMADB_URL: http://chromadb:8000
EMBEDDING_MODEL: nomic-embed-text
LLM_MODEL: llama3.1:8b
COLLECTION_NAME: company-kb
depends_on:
- ollama
- chromadb
restart: unless-stopped
# Cloudflare Tunnel — Secure access without open ports
cloudflared:
image: cloudflare/cloudflared:latest
command: tunnel --no-autoupdate run --token ${CLOUDFLARE_TUNNEL_TOKEN}
restart: unless-stopped
depends_on:
- rag-api
volumes:
ollama_data:
chroma_data:
Preparing the RAG API
Create rag-api/Dockerfile:
FROM python:3.11-slim
WORKDIR /app
RUN pip install fastapi uvicorn pydantic ollama chromadb langchain-community langchain-text-splitters pypdf
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
Create rag-api/main.py:
import os
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import ollama
import chromadb
app = FastAPI(title="Self-Hosted RAG API")
OLLAMA_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
CHROMADB_URL = os.getenv("CHROMADB_URL", "http://localhost:8000")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "nomic-embed-text")
LLM_MODEL = os.getenv("LLM_MODEL", "llama3.1:8b")
COLLECTION_NAME = os.getenv("COLLECTION_NAME", "company-kb")
# Initialize clients
ollama_client = ollama.Client(base_url=OLLAMA_URL)
chroma_client = chromadb.HttpClient(host=CHROMADB_URL.split(":")[1].strip("/"), port=int(CHROMADB_URL.split(":")[2]))
collection = chroma_client.get_or_create_collection(COLLECTION_NAME)
class QueryRequest(BaseModel):
question: str
top_k: int = 3
class IngestRequest(BaseModel):
text: str
metadata: dict = {}
@app.post("/ingest")
def ingest_document(req: IngestRequest):
"""Ingest a document chunk into the vector store."""
try:
# Generate embedding
embedding_response = ollama_client.embeddings(model=EMBEDDING_MODEL, prompt=req.text)
embedding = embedding_response["embedding"]
# Store in ChromaDB
doc_id = f"doc_{len(collection.get()['ids']) + 1}"
collection.upsert(
ids=[doc_id],
embeddings=[embedding],
documents=[req.text],
metadatas=[req.metadata]
)
return {"status": "ok", "document_id": doc_id}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.post("/query")
def query_rag(req: QueryRequest):
"""Query the RAG system: retrieve relevant docs + generate answer."""
try:
# Get embedding for the question
question_embedding = ollama_client.embeddings(model=EMBEDDING_MODEL, prompt=req.question)["embedding"]
# Retrieve relevant documents
results = collection.query(query_embeddings=[question_embedding], n_results=req.top_k)
if not results["documents"][0]:
return {"answer": "No relevant documents found in the knowledge base.", "sources": []}
# Build context from retrieved documents
context = "\n\n".join(results["documents"][0])
# Generate answer using LLM
prompt = f"""You are a helpful AI assistant. Answer the user's question based only on the provided context.
Context:
{context}
Question: {req.question}
Answer:"""
response = ollama_client.generate(
model=LLM_MODEL,
prompt=prompt,
stream=False
)
return {
"answer": response["response"],
"sources": results["documents"][0],
"distances": results["distances"][0]
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
def health():
return {"status": "healthy", "ollama_model": LLM_MODEL, "embedding_model": EMBEDDING_MODEL}
Pull Required Models
After starting the containers:
# Pull the LLM and embedding model
docker exec ollama ollama pull llama3.1:8b
docker exec ollama ollama pull nomic-embed-text
# Verify models are loaded
docker exec ollama ollama list
Environment Variables
Create .env:
CHROMA_AUTH_TOKEN=your-chroma-auth-token
CHROMA_SERVER_AUTH_CREDENTIALS=admin:secret
CLOUDFLARE_TUNNEL_TOKEN=your-tunnel-token
Document Ingestion Pipeline
Once the API is running, ingest your documents:
# Ingest a text snippet
curl -X POST http://localhost:8080/ingest \
-H "Content-Type: application/json" \
-d '{
"text": "Our company policy states that employees get 20 days of paid time off per year...",
"metadata": {"source": "employee-handbook.pdf", "page": 12}
}'
# Query the knowledge base
curl -X POST http://localhost:8080/query \
-H "Content-Type: application/json" \
-d '{
"question": "How many days of PTO do employees get?",
"top_k": 3
}'
For bulk ingestion, use this Python script:
import os
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import requests
KB_URL = "http://localhost:8080/ingest"
def ingest_directory(directory: str):
"""Ingest all PDFs and text files from a directory."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "]
)
for filepath in Path(directory).rglob("*"):
if filepath.suffix in (".pdf", ".txt", ".md"):
print(f"Ingesting: {filepath}")
if filepath.suffix == ".pdf":
loader = PyPDFLoader(str(filepath))
else:
loader = TextLoader(str(filepath))
documents = loader.load()
chunks = splitter.split_documents(documents)
for i, chunk in enumerate(chunks):
response = requests.post(KB_URL, json={
"text": chunk.page_content,
"metadata": {
"source": str(filepath),
"chunk": i
}
})
if response.status_code != 200:
print(f" Failed chunk {i}: {response.text}")
print("Ingestion complete.")
# Usage: ingest_directory("./documents/")
Cost Breakdown: Self-Hosted vs API-Based
| Cost Item | Self-Hosted (8GB VPS) | API-Based (OpenAI) |
|---|---|---|
| Infrastructure | $10–20/month (VPS) | $0 |
| Embeddings | Free (local model) | ~$0.02/1K tokens |
| Chat Completions | Free (local model) | ~$0.015/1K tokens |
| 10K docs, 1K queries/day | $10–20/month | $150–300/month |
| Data privacy | Full (everything local) | Third-party |
| Customization | Fine-tune any model | Limited to provider options |
The math is clear: At just 1,000 queries per day, self-hosting pays for itself within the first month compared to API pricing. And the more you use it, the more you save.
Performance Tuning Tips
1. Swap Space for 7B Models
If your VPS has only 4GB RAM, add swap to handle the 7B model:
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
2. Quantized Models Save RAM
Ollama supports multiple quantization levels:
| Model | Precision | RAM Needed | Speed | Quality |
|---|---|---|---|---|
| Llama 3.1 8B | Q4_K_M | ~4.7GB | Fast | 95% of FP16 |
| Llama 3.1 8B | Q3_K_S | ~3.5GB | Faster | 90% of FP16 |
| Llama 3.1 8B | Q2_K | ~2.6GB | Fastest | 80% of FP16 |
For production, Q4_K_M is the sweet spot. Drop to Q3 if you’re RAM-constrained.
3. ChromaDB Persistence
Always use persistent ChromaDB mode (included in the Docker Compose above). Without it, your vector store resets on container restart.
4. Monitoring
# Watch Ollama memory usage
docker stats ollama chromadb rag-api
# Check inference speed
time curl -s http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"Hello","stream":false}'
Security Hardening
Since your RAG system processes sensitive documents, these steps are non-negotiable:
# 1. Bind all services to localhost only (already in docker-compose)
# 2. Use Cloudflare Tunnel for encrypted access
# 3. Enable ChromaDB auth
# 4. Rate-limit the API
# Add to rag-api Dockerfile:
# pip install slowapi
# Add to main.py:
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
@app.post("/query")
@limiter.limit("10/minute")
def query_rag(req: QueryRequest):
...
When Self-Hosting Makes Sense (and When It Doesn’t)
Self-host if:
- You process sensitive/confidential documents (legal, medical, financial)
- Your query volume exceeds 500/day (API costs add up fast)
- You need offline/intranet access
- You want full control over model selection and fine-tuning
Stick with API if:
- You’re a solo developer testing a prototype (< 100 queries/day)
- You need 70B+ model quality (requires GPU infrastructure)
- You don’t want to manage infrastructure
Summary
A self-hosted LLM + RAG system on a budget VPS delivers enterprise-grade AI capability for $10–20/month. With Ollama handling inference, ChromaDB storing your knowledge base, and Cloudflare Tunnel securing access, you get a fully private AI assistant that pays for itself within weeks.
Recommended path: Start with RackNerd’s $5.99/month plan for testing. Migrate to Hostinger’s 8GB VPS ($10–20/month) for production. Scale to Vultr GPU instances only if you need 30B+ model sizes.
