Self-Host LLM + RAG on a Budget VPS: Build a Private AI Knowledge Base for Under $10/Month

Wed, 01 Jul 2026 00:00:00 +0000

Why Self-Host LLM + RAG Instead of Paying API Fees?

Every month, companies burn hundreds of dollars on OpenAI and Anthropic API calls. A single RAG pipeline processing 10,000 documents can cost $20–50/month just for embeddings, plus another $50–200/month for chat completions.

Self-hosting solves both problems. With Ollama running a 7B–13B model locally on a $5–10 VPS, plus ChromaDB for vector storage, you get a fully private AI knowledge base with zero marginal cost per query. Your documents never leave your server. Your API bills drop to $0.

This guide walks through deploying a production-ready self-hosted RAG stack on a budget VPS — comparing RackNerd, Hostinger, and Vultr for LLM workloads.

Disclosure: This article contains affiliate links. If you purchase through these links, we may earn a commission at no extra cost to you. We only recommend services we’ve tested and believe offer genuine value.

The Self-Hosted RAG Stack Explained

A RAG (Retrieval-Augmented Generation) system has four layers:

┌─────────────────────────────────────────────────────┐
│ VPS ($5-10/month, 4-8GB RAM) │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Cloudflare Tunnel → HTTPS / No Open Ports │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────────────▼──────────────────────┐ │
│ │ FastAPI / Gradio (Chat UI) │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────────────▼──────────────────────┐ │
│ │ Ollama (LLM Inference: Llama 3.1 / Qwen) │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────────────▼──────────────────────┐ │
│ │ ChromaDB / Qdrant (Vector Store) │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────────────▼──────────────────────┐ │
│ │ Document Pipeline (PDF/Markdown/HTML) │ │
│ └─────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘

Layer breakdown:

Layer	Component	RAM Usage	Purpose
LLM	Ollama + Llama 3.1 8B (Q4 quantized)	~5GB	Generates answers from retrieved context
Vector DB	ChromaDB (persistent mode)	~512MB	Stores and retrieves document embeddings
Embedding	nomic-embed-text (via Ollama)	~512MB	Converts documents to searchable vectors
API/UI	FastAPI + simple chat frontend	~256MB	Exposes the RAG pipeline via HTTP
Pipeline	Python doc loader + chunker	~256MB	Ingests PDFs, Markdown, HTML into the KB

Total: A 4GB VPS is the absolute minimum. 8GB RAM is recommended for comfortable operation with room for concurrent users.

VPS Provider Comparison for LLM Workloads

RackNerd — Best Entry Point for Learning

RackNerd’s $5.99/month plans offer surprising value for getting started with self-hosted LLMs.

CPU: 1 vCPU (adequate for 7B model inference at ~5-10 tokens/sec)
RAM: 2–4GB (use swap for 7B models; 8GB plan ideal)
Storage: 40GB SSD (enough for model weights + document store)
Network: Solid US West connectivity

Why RackNerd? Lowest barrier to entry. You can test the entire RAG stack on a $5/month instance. Upgrade when you’re ready for production traffic.

RackNerd VPS Plans →

Hostinger — Best Balance of RAM and Performance

Hostinger’s KVM VPS plans are ideal for LLM workloads because they guarantee RAM — no overselling surprises.

CPU: AMD EPYC dedicated cores
RAM: 8GB plan at $19.99/month (or $9.99 with referral code JZ1ZL8465QCG for 4GB)
Storage: 200GB NVMe SSD (fast document ingestion)
Network: Premium routing, 5 data center regions

Why Hostinger for LLM? Guaranteed RAM is critical for LLM inference. Shared RAM on budget hosts causes OOM kills during peak usage. NVMe storage accelerates vector database operations by 3-5x.

Hostinger VPS with Discount →

Vultr — Best for Production with GPU Option

Vultr offers both standard CPU instances and affordable GPU instances — the only budget provider with this option.

CPU: AMD High Frequency or Intel
RAM: 4GB–64GB scalable
GPU: Optional add-on ($3.75/hr for A100, or $60/month for dedicated GPU instances)
Storage: NVMe SSD
Network: 30+ global data centers

Why Vultr for LLM? If you need to run larger models (30B+) or want GPU acceleration for faster inference, Vultr’s GPU instances are the cheapest entry point available. For 7B–13B CPU-only inference, their standard instances are competitive.

Vultr VPS with $100 Credit →

Complete Docker Compose Deployment

This is the full stack — one docker-compose.yml file, everything running:

version: "3.8"

services:
 # Ollama — Self-hosted LLM runtime
 ollama:
 image: ollama/ollama:latest
 ports:
 - "127.0.0.1:11434:11434"
 volumes:
 - ollama_data:/root/.ollama
 restart: unless-stopped

 # ChromaDB — Persistent vector database
 chromadb:
 image: chromadb/chromadb:latest
 ports:
 - "127.0.0.1:8000:8000"
 volumes:
 - chroma_data:/chroma/chroma
 environment:
 CHROMA_AUTH_TOKEN: ${CHROMA_AUTH_TOKEN}
 CHROMA_SERVER_AUTH_CREDENTIALS: ${CHROMA_CREDENTIALS}
 restart: unless-stopped

 # RAG API — FastAPI service connecting everything
 rag-api:
 build: ./rag-api
 ports:
 - "127.0.0.1:8080:8080"
 environment:
 OLLAMA_BASE_URL: http://ollama:11434
 CHROMADB_URL: http://chromadb:8000
 EMBEDDING_MODEL: nomic-embed-text
 LLM_MODEL: llama3.1:8b
 COLLECTION_NAME: company-kb
 depends_on:
 - ollama
 - chromadb
 restart: unless-stopped

 # Cloudflare Tunnel — Secure access without open ports
 cloudflared:
 image: cloudflare/cloudflared:latest
 command: tunnel --no-autoupdate run --token ${CLOUDFLARE_TUNNEL_TOKEN}
 restart: unless-stopped
 depends_on:
 - rag-api

volumes:
 ollama_data:
 chroma_data:

Preparing the RAG API

Create rag-api/Dockerfile:

FROM python:3.11-slim

WORKDIR /app

RUN pip install fastapi uvicorn pydantic ollama chromadb langchain-community langchain-text-splitters pypdf

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

EXPOSE 8080

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

Create rag-api/main.py:

import os
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import ollama
import chromadb

app = FastAPI(title="Self-Hosted RAG API")

OLLAMA_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
CHROMADB_URL = os.getenv("CHROMADB_URL", "http://localhost:8000")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "nomic-embed-text")
LLM_MODEL = os.getenv("LLM_MODEL", "llama3.1:8b")
COLLECTION_NAME = os.getenv("COLLECTION_NAME", "company-kb")

# Initialize clients
ollama_client = ollama.Client(base_url=OLLAMA_URL)
chroma_client = chromadb.HttpClient(host=CHROMADB_URL.split(":")[1].strip("/"), port=int(CHROMADB_URL.split(":")[2]))
collection = chroma_client.get_or_create_collection(COLLECTION_NAME)


class QueryRequest(BaseModel):
 question: str
 top_k: int = 3


class IngestRequest(BaseModel):
 text: str
 metadata: dict = {}


@app.post("/ingest")
def ingest_document(req: IngestRequest):
 """Ingest a document chunk into the vector store."""
 try:
 # Generate embedding
 embedding_response = ollama_client.embeddings(model=EMBEDDING_MODEL, prompt=req.text)
 embedding = embedding_response["embedding"]

 # Store in ChromaDB
 doc_id = f"doc_{len(collection.get()['ids']) + 1}"
 collection.upsert(
 ids=[doc_id],
 embeddings=[embedding],
 documents=[req.text],
 metadatas=[req.metadata]
 )
 return {"status": "ok", "document_id": doc_id}
 except Exception as e:
 raise HTTPException(status_code=500, detail=str(e))


@app.post("/query")
def query_rag(req: QueryRequest):
 """Query the RAG system: retrieve relevant docs + generate answer."""
 try:
 # Get embedding for the question
 question_embedding = ollama_client.embeddings(model=EMBEDDING_MODEL, prompt=req.question)["embedding"]

 # Retrieve relevant documents
 results = collection.query(query_embeddings=[question_embedding], n_results=req.top_k)

 if not results["documents"][0]:
 return {"answer": "No relevant documents found in the knowledge base.", "sources": []}

 # Build context from retrieved documents
 context = "\n\n".join(results["documents"][0])

 # Generate answer using LLM
 prompt = f"""You are a helpful AI assistant. Answer the user's question based only on the provided context.

Context:
{context}

Question: {req.question}

Answer:"""

 response = ollama_client.generate(
 model=LLM_MODEL,
 prompt=prompt,
 stream=False
 )

 return {
 "answer": response["response"],
 "sources": results["documents"][0],
 "distances": results["distances"][0]
 }
 except Exception as e:
 raise HTTPException(status_code=500, detail=str(e))


@app.get("/health")
def health():
 return {"status": "healthy", "ollama_model": LLM_MODEL, "embedding_model": EMBEDDING_MODEL}

Pull Required Models

After starting the containers:

# Pull the LLM and embedding model
docker exec ollama ollama pull llama3.1:8b
docker exec ollama ollama pull nomic-embed-text

# Verify models are loaded
docker exec ollama ollama list

Environment Variables

Create .env:

CHROMA_AUTH_TOKEN=your-chroma-auth-token
CHROMA_SERVER_AUTH_CREDENTIALS=admin:secret
CLOUDFLARE_TUNNEL_TOKEN=your-tunnel-token

Document Ingestion Pipeline

Once the API is running, ingest your documents:

# Ingest a text snippet
curl -X POST http://localhost:8080/ingest \
 -H "Content-Type: application/json" \
 -d '{
 "text": "Our company policy states that employees get 20 days of paid time off per year...",
 "metadata": {"source": "employee-handbook.pdf", "page": 12}
 }'

# Query the knowledge base
curl -X POST http://localhost:8080/query \
 -H "Content-Type: application/json" \
 -d '{
 "question": "How many days of PTO do employees get?",
 "top_k": 3
 }'

For bulk ingestion, use this Python script:

import os
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import requests

KB_URL = "http://localhost:8080/ingest"

def ingest_directory(directory: str):
 """Ingest all PDFs and text files from a directory."""
 splitter = RecursiveCharacterTextSplitter(
 chunk_size=1000,
 chunk_overlap=200,
 separators=["\n\n", "\n", ". ", " "]
 )

 for filepath in Path(directory).rglob("*"):
 if filepath.suffix in (".pdf", ".txt", ".md"):
 print(f"Ingesting: {filepath}")

 if filepath.suffix == ".pdf":
 loader = PyPDFLoader(str(filepath))
 else:
 loader = TextLoader(str(filepath))

 documents = loader.load()
 chunks = splitter.split_documents(documents)

 for i, chunk in enumerate(chunks):
 response = requests.post(KB_URL, json={
 "text": chunk.page_content,
 "metadata": {
 "source": str(filepath),
 "chunk": i
 }
 })
 if response.status_code != 200:
 print(f" Failed chunk {i}: {response.text}")

 print("Ingestion complete.")

# Usage: ingest_directory("./documents/")

Cost Breakdown: Self-Hosted vs API-Based

Cost Item	Self-Hosted (8GB VPS)	API-Based (OpenAI)
Infrastructure	$10–20/month (VPS)	$0
Embeddings	Free (local model)	~$0.02/1K tokens
Chat Completions	Free (local model)	~$0.015/1K tokens
10K docs, 1K queries/day	$10–20/month	$150–300/month
Data privacy	Full (everything local)	Third-party
Customization	Fine-tune any model	Limited to provider options

The math is clear: At just 1,000 queries per day, self-hosting pays for itself within the first month compared to API pricing. And the more you use it, the more you save.

Performance Tuning Tips

1. Swap Space for 7B Models

If your VPS has only 4GB RAM, add swap to handle the 7B model:

sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

2. Quantized Models Save RAM

Ollama supports multiple quantization levels:

Model	Precision	RAM Needed	Speed	Quality
Llama 3.1 8B	Q4_K_M	~4.7GB	Fast	95% of FP16
Llama 3.1 8B	Q3_K_S	~3.5GB	Faster	90% of FP16
Llama 3.1 8B	Q2_K	~2.6GB	Fastest	80% of FP16

For production, Q4_K_M is the sweet spot. Drop to Q3 if you’re RAM-constrained.

3. ChromaDB Persistence

Always use persistent ChromaDB mode (included in the Docker Compose above). Without it, your vector store resets on container restart.

4. Monitoring

# Watch Ollama memory usage
docker stats ollama chromadb rag-api

# Check inference speed
time curl -s http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"Hello","stream":false}'

Security Hardening

Since your RAG system processes sensitive documents, these steps are non-negotiable:

# 1. Bind all services to localhost only (already in docker-compose)
# 2. Use Cloudflare Tunnel for encrypted access
# 3. Enable ChromaDB auth
# 4. Rate-limit the API

# Add to rag-api Dockerfile:
# pip install slowapi

# Add to main.py:
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter

@app.post("/query")
@limiter.limit("10/minute")
def query_rag(req: QueryRequest):
 ...

When Self-Hosting Makes Sense (and When It Doesn’t)

Self-host if:

You process sensitive/confidential documents (legal, medical, financial)
Your query volume exceeds 500/day (API costs add up fast)
You need offline/intranet access
You want full control over model selection and fine-tuning

Stick with API if:

You’re a solo developer testing a prototype (< 100 queries/day)
You need 70B+ model quality (requires GPU infrastructure)
You don’t want to manage infrastructure

Summary

A self-hosted LLM + RAG system on a budget VPS delivers enterprise-grade AI capability for $10–20/month. With Ollama handling inference, ChromaDB storing your knowledge base, and Cloudflare Tunnel securing access, you get a fully private AI assistant that pays for itself within weeks.

Recommended path: Start with RackNerd’s $5.99/month plan for testing. Migrate to Hostinger’s 8GB VPS ($10–20/month) for production. Scale to Vultr GPU instances only if you need 30B+ model sizes.

RackNerd VPS Plans → · Hostinger VPS → · Vultr VPS →

Private AI on 诚实雷达