Featured image of post Self-Host LLM + RAG on a Budget VPS: Build a Private AI Knowledge Base for Under $10/Month

Self-Host LLM + RAG on a Budget VPS: Build a Private AI Knowledge Base for Under $10/Month

Complete guide to self-hosting Ollama + ChromaDB + a RAG pipeline on a $5-10/month VPS. Turn any document collection into a private AI assistant — zero API costs, full data privacy.

Why Self-Host LLM + RAG Instead of Paying API Fees?

Every month, companies burn hundreds of dollars on OpenAI and Anthropic API calls. A single RAG pipeline processing 10,000 documents can cost $20–50/month just for embeddings, plus another $50–200/month for chat completions.

Self-hosting solves both problems. With Ollama running a 7B–13B model locally on a $5–10 VPS, plus ChromaDB for vector storage, you get a fully private AI knowledge base with zero marginal cost per query. Your documents never leave your server. Your API bills drop to $0.

This guide walks through deploying a production-ready self-hosted RAG stack on a budget VPS — comparing RackNerd, Hostinger, and Vultr for LLM workloads.

Disclosure: This article contains affiliate links. If you purchase through these links, we may earn a commission at no extra cost to you. We only recommend services we’ve tested and believe offer genuine value.


The Self-Hosted RAG Stack Explained

A RAG (Retrieval-Augmented Generation) system has four layers:

┌─────────────────────────────────────────────────────┐
│              VPS ($5-10/month, 4-8GB RAM)            │
│                                                      │
│  ┌─────────────────────────────────────────────┐    │
│  │   Cloudflare Tunnel → HTTPS / No Open Ports │    │
│  └──────────────────────┬──────────────────────┘    │
│                         │                           │
│  ┌──────────────────────▼──────────────────────┐    │
│  │       FastAPI / Gradio (Chat UI)            │    │
│  └──────────────────────┬──────────────────────┘    │
│                         │                           │
│  ┌──────────────────────▼──────────────────────┐    │
│  │   Ollama (LLM Inference: Llama 3.1 / Qwen)  │    │
│  └──────────────────────┬──────────────────────┘    │
│                         │                           │
│  ┌──────────────────────▼──────────────────────┐    │
│  │   ChromaDB / Qdrant (Vector Store)          │    │
│  └──────────────────────┬──────────────────────┘    │
│                         │                           │
│  ┌──────────────────────▼──────────────────────┐    │
│  │   Document Pipeline (PDF/Markdown/HTML)     │    │
│  └─────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────┘

Layer breakdown:

LayerComponentRAM UsagePurpose
LLMOllama + Llama 3.1 8B (Q4 quantized)~5GBGenerates answers from retrieved context
Vector DBChromaDB (persistent mode)~512MBStores and retrieves document embeddings
Embeddingnomic-embed-text (via Ollama)~512MBConverts documents to searchable vectors
API/UIFastAPI + simple chat frontend~256MBExposes the RAG pipeline via HTTP
PipelinePython doc loader + chunker~256MBIngests PDFs, Markdown, HTML into the KB

Total: A 4GB VPS is the absolute minimum. 8GB RAM is recommended for comfortable operation with room for concurrent users.


VPS Provider Comparison for LLM Workloads

RackNerd — Best Entry Point for Learning

RackNerd’s $5.99/month plans offer surprising value for getting started with self-hosted LLMs.

  • CPU: 1 vCPU (adequate for 7B model inference at ~5-10 tokens/sec)
  • RAM: 2–4GB (use swap for 7B models; 8GB plan ideal)
  • Storage: 40GB SSD (enough for model weights + document store)
  • Network: Solid US West connectivity

Why RackNerd? Lowest barrier to entry. You can test the entire RAG stack on a $5/month instance. Upgrade when you’re ready for production traffic.

RackNerd VPS Plans →

Hostinger — Best Balance of RAM and Performance

Hostinger’s KVM VPS plans are ideal for LLM workloads because they guarantee RAM — no overselling surprises.

  • CPU: AMD EPYC dedicated cores
  • RAM: 8GB plan at $19.99/month (or $9.99 with referral code JZ1ZL8465QCG for 4GB)
  • Storage: 200GB NVMe SSD (fast document ingestion)
  • Network: Premium routing, 5 data center regions

Why Hostinger for LLM? Guaranteed RAM is critical for LLM inference. Shared RAM on budget hosts causes OOM kills during peak usage. NVMe storage accelerates vector database operations by 3-5x.

Hostinger VPS with Discount →

Vultr — Best for Production with GPU Option

Vultr offers both standard CPU instances and affordable GPU instances — the only budget provider with this option.

  • CPU: AMD High Frequency or Intel
  • RAM: 4GB–64GB scalable
  • GPU: Optional add-on ($3.75/hr for A100, or $60/month for dedicated GPU instances)
  • Storage: NVMe SSD
  • Network: 30+ global data centers

Why Vultr for LLM? If you need to run larger models (30B+) or want GPU acceleration for faster inference, Vultr’s GPU instances are the cheapest entry point available. For 7B–13B CPU-only inference, their standard instances are competitive.

Vultr VPS with $100 Credit →


Complete Docker Compose Deployment

This is the full stack — one docker-compose.yml file, everything running:

version: "3.8"

services:
  # Ollama — Self-hosted LLM runtime
  ollama:
    image: ollama/ollama:latest
    ports:
      - "127.0.0.1:11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped

  # ChromaDB — Persistent vector database
  chromadb:
    image: chromadb/chromadb:latest
    ports:
      - "127.0.0.1:8000:8000"
    volumes:
      - chroma_data:/chroma/chroma
    environment:
      CHROMA_AUTH_TOKEN: ${CHROMA_AUTH_TOKEN}
      CHROMA_SERVER_AUTH_CREDENTIALS: ${CHROMA_CREDENTIALS}
    restart: unless-stopped

  # RAG API — FastAPI service connecting everything
  rag-api:
    build: ./rag-api
    ports:
      - "127.0.0.1:8080:8080"
    environment:
      OLLAMA_BASE_URL: http://ollama:11434
      CHROMADB_URL: http://chromadb:8000
      EMBEDDING_MODEL: nomic-embed-text
      LLM_MODEL: llama3.1:8b
      COLLECTION_NAME: company-kb
    depends_on:
      - ollama
      - chromadb
    restart: unless-stopped

  # Cloudflare Tunnel — Secure access without open ports
  cloudflared:
    image: cloudflare/cloudflared:latest
    command: tunnel --no-autoupdate run --token ${CLOUDFLARE_TUNNEL_TOKEN}
    restart: unless-stopped
    depends_on:
      - rag-api

volumes:
  ollama_data:
  chroma_data:

Preparing the RAG API

Create rag-api/Dockerfile:

FROM python:3.11-slim

WORKDIR /app

RUN pip install fastapi uvicorn pydantic ollama chromadb langchain-community langchain-text-splitters pypdf

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

EXPOSE 8080

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

Create rag-api/main.py:

import os
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import ollama
import chromadb

app = FastAPI(title="Self-Hosted RAG API")

OLLAMA_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
CHROMADB_URL = os.getenv("CHROMADB_URL", "http://localhost:8000")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "nomic-embed-text")
LLM_MODEL = os.getenv("LLM_MODEL", "llama3.1:8b")
COLLECTION_NAME = os.getenv("COLLECTION_NAME", "company-kb")

# Initialize clients
ollama_client = ollama.Client(base_url=OLLAMA_URL)
chroma_client = chromadb.HttpClient(host=CHROMADB_URL.split(":")[1].strip("/"), port=int(CHROMADB_URL.split(":")[2]))
collection = chroma_client.get_or_create_collection(COLLECTION_NAME)


class QueryRequest(BaseModel):
    question: str
    top_k: int = 3


class IngestRequest(BaseModel):
    text: str
    metadata: dict = {}


@app.post("/ingest")
def ingest_document(req: IngestRequest):
    """Ingest a document chunk into the vector store."""
    try:
        # Generate embedding
        embedding_response = ollama_client.embeddings(model=EMBEDDING_MODEL, prompt=req.text)
        embedding = embedding_response["embedding"]

        # Store in ChromaDB
        doc_id = f"doc_{len(collection.get()['ids']) + 1}"
        collection.upsert(
            ids=[doc_id],
            embeddings=[embedding],
            documents=[req.text],
            metadatas=[req.metadata]
        )
        return {"status": "ok", "document_id": doc_id}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.post("/query")
def query_rag(req: QueryRequest):
    """Query the RAG system: retrieve relevant docs + generate answer."""
    try:
        # Get embedding for the question
        question_embedding = ollama_client.embeddings(model=EMBEDDING_MODEL, prompt=req.question)["embedding"]

        # Retrieve relevant documents
        results = collection.query(query_embeddings=[question_embedding], n_results=req.top_k)

        if not results["documents"][0]:
            return {"answer": "No relevant documents found in the knowledge base.", "sources": []}

        # Build context from retrieved documents
        context = "\n\n".join(results["documents"][0])

        # Generate answer using LLM
        prompt = f"""You are a helpful AI assistant. Answer the user's question based only on the provided context.

Context:
{context}

Question: {req.question}

Answer:"""

        response = ollama_client.generate(
            model=LLM_MODEL,
            prompt=prompt,
            stream=False
        )

        return {
            "answer": response["response"],
            "sources": results["documents"][0],
            "distances": results["distances"][0]
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.get("/health")
def health():
    return {"status": "healthy", "ollama_model": LLM_MODEL, "embedding_model": EMBEDDING_MODEL}

Pull Required Models

After starting the containers:

# Pull the LLM and embedding model
docker exec ollama ollama pull llama3.1:8b
docker exec ollama ollama pull nomic-embed-text

# Verify models are loaded
docker exec ollama ollama list

Environment Variables

Create .env:

CHROMA_AUTH_TOKEN=your-chroma-auth-token
CHROMA_SERVER_AUTH_CREDENTIALS=admin:secret
CLOUDFLARE_TUNNEL_TOKEN=your-tunnel-token

Document Ingestion Pipeline

Once the API is running, ingest your documents:

# Ingest a text snippet
curl -X POST http://localhost:8080/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Our company policy states that employees get 20 days of paid time off per year...",
    "metadata": {"source": "employee-handbook.pdf", "page": 12}
  }'

# Query the knowledge base
curl -X POST http://localhost:8080/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "How many days of PTO do employees get?",
    "top_k": 3
  }'

For bulk ingestion, use this Python script:

import os
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import requests

KB_URL = "http://localhost:8080/ingest"

def ingest_directory(directory: str):
    """Ingest all PDFs and text files from a directory."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["\n\n", "\n", ". ", " "]
    )

    for filepath in Path(directory).rglob("*"):
        if filepath.suffix in (".pdf", ".txt", ".md"):
            print(f"Ingesting: {filepath}")

            if filepath.suffix == ".pdf":
                loader = PyPDFLoader(str(filepath))
            else:
                loader = TextLoader(str(filepath))

            documents = loader.load()
            chunks = splitter.split_documents(documents)

            for i, chunk in enumerate(chunks):
                response = requests.post(KB_URL, json={
                    "text": chunk.page_content,
                    "metadata": {
                        "source": str(filepath),
                        "chunk": i
                    }
                })
                if response.status_code != 200:
                    print(f"  Failed chunk {i}: {response.text}")

    print("Ingestion complete.")

# Usage: ingest_directory("./documents/")

Cost Breakdown: Self-Hosted vs API-Based

Cost ItemSelf-Hosted (8GB VPS)API-Based (OpenAI)
Infrastructure$10–20/month (VPS)$0
EmbeddingsFree (local model)~$0.02/1K tokens
Chat CompletionsFree (local model)~$0.015/1K tokens
10K docs, 1K queries/day$10–20/month$150–300/month
Data privacyFull (everything local)Third-party
CustomizationFine-tune any modelLimited to provider options

The math is clear: At just 1,000 queries per day, self-hosting pays for itself within the first month compared to API pricing. And the more you use it, the more you save.


Performance Tuning Tips

1. Swap Space for 7B Models

If your VPS has only 4GB RAM, add swap to handle the 7B model:

sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

2. Quantized Models Save RAM

Ollama supports multiple quantization levels:

ModelPrecisionRAM NeededSpeedQuality
Llama 3.1 8BQ4_K_M~4.7GBFast95% of FP16
Llama 3.1 8BQ3_K_S~3.5GBFaster90% of FP16
Llama 3.1 8BQ2_K~2.6GBFastest80% of FP16

For production, Q4_K_M is the sweet spot. Drop to Q3 if you’re RAM-constrained.

3. ChromaDB Persistence

Always use persistent ChromaDB mode (included in the Docker Compose above). Without it, your vector store resets on container restart.

4. Monitoring

# Watch Ollama memory usage
docker stats ollama chromadb rag-api

# Check inference speed
time curl -s http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"Hello","stream":false}'

Security Hardening

Since your RAG system processes sensitive documents, these steps are non-negotiable:

# 1. Bind all services to localhost only (already in docker-compose)
# 2. Use Cloudflare Tunnel for encrypted access
# 3. Enable ChromaDB auth
# 4. Rate-limit the API

# Add to rag-api Dockerfile:
# pip install slowapi

# Add to main.py:
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter

@app.post("/query")
@limiter.limit("10/minute")
def query_rag(req: QueryRequest):
    ...

When Self-Hosting Makes Sense (and When It Doesn’t)

Self-host if:

  • You process sensitive/confidential documents (legal, medical, financial)
  • Your query volume exceeds 500/day (API costs add up fast)
  • You need offline/intranet access
  • You want full control over model selection and fine-tuning

Stick with API if:

  • You’re a solo developer testing a prototype (< 100 queries/day)
  • You need 70B+ model quality (requires GPU infrastructure)
  • You don’t want to manage infrastructure

Summary

A self-hosted LLM + RAG system on a budget VPS delivers enterprise-grade AI capability for $10–20/month. With Ollama handling inference, ChromaDB storing your knowledge base, and Cloudflare Tunnel securing access, you get a fully private AI assistant that pays for itself within weeks.

Recommended path: Start with RackNerd’s $5.99/month plan for testing. Migrate to Hostinger’s 8GB VPS ($10–20/month) for production. Scale to Vultr GPU instances only if you need 30B+ model sizes.

RackNerd VPS Plans → · Hostinger VPS → · Vultr VPS →