<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Private AI on 诚实雷达</title><link>https://honestradar.com/tags/private-ai/</link><description>Recent content in Private AI on 诚实雷达</description><generator>Hugo -- gohugo.io</generator><language>zh-cn</language><lastBuildDate>Wed, 01 Jul 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://honestradar.com/tags/private-ai/index.xml" rel="self" type="application/rss+xml"/><item><title>Self-Host LLM + RAG on a Budget VPS: Build a Private AI Knowledge Base for Under $10/Month</title><link>https://honestradar.com/vps-hosting/self-host-llm-rag-budget-vps-2026/</link><pubDate>Wed, 01 Jul 2026 00:00:00 +0000</pubDate><guid>https://honestradar.com/vps-hosting/self-host-llm-rag-budget-vps-2026/</guid><description>&lt;img src="https://honestradar.com/images/self-host-llm-rag-budget-vps-2026.jpg" alt="Featured image of post Self-Host LLM + RAG on a Budget VPS: Build a Private AI Knowledge Base for Under $10/Month" /&gt;&lt;h2 id="why-self-host-llm--rag-instead-of-paying-api-fees"&gt;Why Self-Host LLM + RAG Instead of Paying API Fees?
&lt;/h2&gt;&lt;p&gt;Every month, companies burn hundreds of dollars on OpenAI and Anthropic API calls. A single RAG pipeline processing 10,000 documents can cost $20–50/month just for embeddings, plus another $50–200/month for chat completions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Self-hosting solves both problems.&lt;/strong&gt; With Ollama running a 7B–13B model locally on a $5–10 VPS, plus ChromaDB for vector storage, you get a fully private AI knowledge base with zero marginal cost per query. Your documents never leave your server. Your API bills drop to $0.&lt;/p&gt;
&lt;p&gt;This guide walks through deploying a production-ready self-hosted RAG stack on a budget VPS — comparing RackNerd, Hostinger, and Vultr for LLM workloads.&lt;/p&gt;

 &lt;blockquote&gt;
 &lt;p&gt;&lt;strong&gt;Disclosure&lt;/strong&gt;: This article contains affiliate links. If you purchase through these links, we may earn a commission at no extra cost to you. We only recommend services we&amp;rsquo;ve tested and believe offer genuine value.&lt;/p&gt;

 &lt;/blockquote&gt;
&lt;hr&gt;
&lt;h2 id="the-self-hosted-rag-stack-explained"&gt;The Self-Hosted RAG Stack Explained
&lt;/h2&gt;&lt;p&gt;A RAG (Retrieval-Augmented Generation) system has four layers:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────┐
│ VPS ($5-10/month, 4-8GB RAM) │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Cloudflare Tunnel → HTTPS / No Open Ports │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────────────▼──────────────────────┐ │
│ │ FastAPI / Gradio (Chat UI) │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────────────▼──────────────────────┐ │
│ │ Ollama (LLM Inference: Llama 3.1 / Qwen) │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────────────▼──────────────────────┐ │
│ │ ChromaDB / Qdrant (Vector Store) │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────────────▼──────────────────────┐ │
│ │ Document Pipeline (PDF/Markdown/HTML) │ │
│ └─────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;strong&gt;Layer breakdown:&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Layer&lt;/th&gt;
 &lt;th&gt;Component&lt;/th&gt;
 &lt;th&gt;RAM Usage&lt;/th&gt;
 &lt;th&gt;Purpose&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;LLM&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Ollama + Llama 3.1 8B (Q4 quantized)&lt;/td&gt;
 &lt;td&gt;~5GB&lt;/td&gt;
 &lt;td&gt;Generates answers from retrieved context&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Vector DB&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;ChromaDB (persistent mode)&lt;/td&gt;
 &lt;td&gt;~512MB&lt;/td&gt;
 &lt;td&gt;Stores and retrieves document embeddings&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Embedding&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;nomic-embed-text (via Ollama)&lt;/td&gt;
 &lt;td&gt;~512MB&lt;/td&gt;
 &lt;td&gt;Converts documents to searchable vectors&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;API/UI&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;FastAPI + simple chat frontend&lt;/td&gt;
 &lt;td&gt;~256MB&lt;/td&gt;
 &lt;td&gt;Exposes the RAG pipeline via HTTP&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Pipeline&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Python doc loader + chunker&lt;/td&gt;
 &lt;td&gt;~256MB&lt;/td&gt;
 &lt;td&gt;Ingests PDFs, Markdown, HTML into the KB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Total&lt;/strong&gt;: A 4GB VPS is the absolute minimum. &lt;strong&gt;8GB RAM is recommended&lt;/strong&gt; for comfortable operation with room for concurrent users.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="vps-provider-comparison-for-llm-workloads"&gt;VPS Provider Comparison for LLM Workloads
&lt;/h2&gt;&lt;h3 id="racknerd--best-entry-point-for-learning"&gt;RackNerd — Best Entry Point for Learning
&lt;/h3&gt;&lt;p&gt;RackNerd&amp;rsquo;s $5.99/month plans offer surprising value for getting started with self-hosted LLMs.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU&lt;/strong&gt;: 1 vCPU (adequate for 7B model inference at ~5-10 tokens/sec)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RAM&lt;/strong&gt;: 2–4GB (use swap for 7B models; 8GB plan ideal)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage&lt;/strong&gt;: 40GB SSD (enough for model weights + document store)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network&lt;/strong&gt;: Solid US West connectivity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Why RackNerd?&lt;/strong&gt; Lowest barrier to entry. You can test the entire RAG stack on a $5/month instance. Upgrade when you&amp;rsquo;re ready for production traffic.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://racknerd.com/aff/19978" target="_blank" rel="noopener"
 &gt;RackNerd VPS Plans →&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="hostinger--best-balance-of-ram-and-performance"&gt;Hostinger — Best Balance of RAM and Performance
&lt;/h3&gt;&lt;p&gt;Hostinger&amp;rsquo;s KVM VPS plans are ideal for LLM workloads because they guarantee RAM — no overselling surprises.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU&lt;/strong&gt;: AMD EPYC dedicated cores&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RAM&lt;/strong&gt;: 8GB plan at $19.99/month (or $9.99 with referral code JZ1ZL8465QCG for 4GB)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage&lt;/strong&gt;: 200GB NVMe SSD (fast document ingestion)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network&lt;/strong&gt;: Premium routing, 5 data center regions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Why Hostinger for LLM?&lt;/strong&gt; Guaranteed RAM is critical for LLM inference. Shared RAM on budget hosts causes OOM kills during peak usage. NVMe storage accelerates vector database operations by 3-5x.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.hostinger.com/vps-hosting?REFERRALCODE=JZ1ZL8465QCG" target="_blank" rel="noopener"
 &gt;Hostinger VPS with Discount →&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="vultr--best-for-production-with-gpu-option"&gt;Vultr — Best for Production with GPU Option
&lt;/h3&gt;&lt;p&gt;Vultr offers both standard CPU instances and affordable GPU instances — the only budget provider with this option.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU&lt;/strong&gt;: AMD High Frequency or Intel&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RAM&lt;/strong&gt;: 4GB–64GB scalable&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPU&lt;/strong&gt;: Optional add-on ($3.75/hr for A100, or $60/month for dedicated GPU instances)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage&lt;/strong&gt;: NVMe SSD&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network&lt;/strong&gt;: 30+ global data centers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Why Vultr for LLM?&lt;/strong&gt; If you need to run larger models (30B+) or want GPU acceleration for faster inference, Vultr&amp;rsquo;s GPU instances are the cheapest entry point available. For 7B–13B CPU-only inference, their standard instances are competitive.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.vultr.com/?ref=9706229" target="_blank" rel="noopener"
 &gt;Vultr VPS with $100 Credit →&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="complete-docker-compose-deployment"&gt;Complete Docker Compose Deployment
&lt;/h2&gt;&lt;p&gt;This is the full stack — one &lt;code&gt;docker-compose.yml&lt;/code&gt; file, everything running:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;version&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;3.8&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;services&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Ollama — Self-hosted LLM runtime&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;ollama&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;image&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;ollama/ollama:latest&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;ports&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#e6db74"&gt;&amp;#34;127.0.0.1:11434:11434&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;volumes&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;ollama_data:/root/.ollama&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;restart&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;unless-stopped&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# ChromaDB — Persistent vector database&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;chromadb&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;image&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;chromadb/chromadb:latest&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;ports&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#e6db74"&gt;&amp;#34;127.0.0.1:8000:8000&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;volumes&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;chroma_data:/chroma/chroma&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;environment&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;CHROMA_AUTH_TOKEN&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;${CHROMA_AUTH_TOKEN}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;CHROMA_SERVER_AUTH_CREDENTIALS&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;${CHROMA_CREDENTIALS}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;restart&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;unless-stopped&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# RAG API — FastAPI service connecting everything&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;rag-api&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;build&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;./rag-api&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;ports&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#e6db74"&gt;&amp;#34;127.0.0.1:8080:8080&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;environment&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;OLLAMA_BASE_URL&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;http://ollama:11434&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;CHROMADB_URL&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;http://chromadb:8000&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;EMBEDDING_MODEL&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;nomic-embed-text&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;LLM_MODEL&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;llama3.1:8b&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;COLLECTION_NAME&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;company-kb&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;depends_on&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;ollama&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;chromadb&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;restart&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;unless-stopped&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Cloudflare Tunnel — Secure access without open ports&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;cloudflared&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;image&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;cloudflare/cloudflared:latest&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;command&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;tunnel --no-autoupdate run --token ${CLOUDFLARE_TUNNEL_TOKEN}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;restart&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;unless-stopped&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;depends_on&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;rag-api&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;volumes&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;ollama_data&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;chroma_data&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="preparing-the-rag-api"&gt;Preparing the RAG API
&lt;/h3&gt;&lt;p&gt;Create &lt;code&gt;rag-api/Dockerfile&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-dockerfile" data-lang="dockerfile"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;FROM&lt;/span&gt; &lt;span style="color:#e6db74"&gt;python:3.11-slim&lt;/span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;WORKDIR&lt;/span&gt; &lt;span style="color:#e6db74"&gt;/app&lt;/span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;RUN&lt;/span&gt; pip install fastapi uvicorn pydantic ollama chromadb langchain-community langchain-text-splitters pypdf&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;COPY&lt;/span&gt; requirements.txt .&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;RUN&lt;/span&gt; pip install -r requirements.txt&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;COPY&lt;/span&gt; . .&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;EXPOSE&lt;/span&gt; &lt;span style="color:#e6db74"&gt;8080&lt;/span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;CMD&lt;/span&gt; [&lt;span style="color:#e6db74"&gt;&amp;#34;uvicorn&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;main:app&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;--host&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;0.0.0.0&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;--port&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;8080&amp;#34;&lt;/span&gt;]&lt;span style="color:#960050;background-color:#1e0010"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Create &lt;code&gt;rag-api/main.py&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;import&lt;/span&gt; os
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;from&lt;/span&gt; fastapi &lt;span style="color:#f92672"&gt;import&lt;/span&gt; FastAPI, HTTPException
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;from&lt;/span&gt; pydantic &lt;span style="color:#f92672"&gt;import&lt;/span&gt; BaseModel
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;from&lt;/span&gt; typing &lt;span style="color:#f92672"&gt;import&lt;/span&gt; List
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;import&lt;/span&gt; ollama
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;import&lt;/span&gt; chromadb
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;app &lt;span style="color:#f92672"&gt;=&lt;/span&gt; FastAPI(title&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;Self-Hosted RAG API&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;OLLAMA_URL &lt;span style="color:#f92672"&gt;=&lt;/span&gt; os&lt;span style="color:#f92672"&gt;.&lt;/span&gt;getenv(&lt;span style="color:#e6db74"&gt;&amp;#34;OLLAMA_BASE_URL&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;http://localhost:11434&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;CHROMADB_URL &lt;span style="color:#f92672"&gt;=&lt;/span&gt; os&lt;span style="color:#f92672"&gt;.&lt;/span&gt;getenv(&lt;span style="color:#e6db74"&gt;&amp;#34;CHROMADB_URL&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;http://localhost:8000&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;EMBEDDING_MODEL &lt;span style="color:#f92672"&gt;=&lt;/span&gt; os&lt;span style="color:#f92672"&gt;.&lt;/span&gt;getenv(&lt;span style="color:#e6db74"&gt;&amp;#34;EMBEDDING_MODEL&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;nomic-embed-text&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;LLM_MODEL &lt;span style="color:#f92672"&gt;=&lt;/span&gt; os&lt;span style="color:#f92672"&gt;.&lt;/span&gt;getenv(&lt;span style="color:#e6db74"&gt;&amp;#34;LLM_MODEL&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;llama3.1:8b&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;COLLECTION_NAME &lt;span style="color:#f92672"&gt;=&lt;/span&gt; os&lt;span style="color:#f92672"&gt;.&lt;/span&gt;getenv(&lt;span style="color:#e6db74"&gt;&amp;#34;COLLECTION_NAME&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;company-kb&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Initialize clients&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama_client &lt;span style="color:#f92672"&gt;=&lt;/span&gt; ollama&lt;span style="color:#f92672"&gt;.&lt;/span&gt;Client(base_url&lt;span style="color:#f92672"&gt;=&lt;/span&gt;OLLAMA_URL)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;chroma_client &lt;span style="color:#f92672"&gt;=&lt;/span&gt; chromadb&lt;span style="color:#f92672"&gt;.&lt;/span&gt;HttpClient(host&lt;span style="color:#f92672"&gt;=&lt;/span&gt;CHROMADB_URL&lt;span style="color:#f92672"&gt;.&lt;/span&gt;split(&lt;span style="color:#e6db74"&gt;&amp;#34;:&amp;#34;&lt;/span&gt;)[&lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;]&lt;span style="color:#f92672"&gt;.&lt;/span&gt;strip(&lt;span style="color:#e6db74"&gt;&amp;#34;/&amp;#34;&lt;/span&gt;), port&lt;span style="color:#f92672"&gt;=&lt;/span&gt;int(CHROMADB_URL&lt;span style="color:#f92672"&gt;.&lt;/span&gt;split(&lt;span style="color:#e6db74"&gt;&amp;#34;:&amp;#34;&lt;/span&gt;)[&lt;span style="color:#ae81ff"&gt;2&lt;/span&gt;]))
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;collection &lt;span style="color:#f92672"&gt;=&lt;/span&gt; chroma_client&lt;span style="color:#f92672"&gt;.&lt;/span&gt;get_or_create_collection(COLLECTION_NAME)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;class&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;QueryRequest&lt;/span&gt;(BaseModel):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; question: str
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; top_k: int &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;3&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;class&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;IngestRequest&lt;/span&gt;(BaseModel):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; text: str
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; metadata: dict &lt;span style="color:#f92672"&gt;=&lt;/span&gt; {}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;@app.post&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;/ingest&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;def&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;ingest_document&lt;/span&gt;(req: IngestRequest):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&amp;#34;&amp;#34;Ingest a document chunk into the vector store.&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;try&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Generate embedding&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; embedding_response &lt;span style="color:#f92672"&gt;=&lt;/span&gt; ollama_client&lt;span style="color:#f92672"&gt;.&lt;/span&gt;embeddings(model&lt;span style="color:#f92672"&gt;=&lt;/span&gt;EMBEDDING_MODEL, prompt&lt;span style="color:#f92672"&gt;=&lt;/span&gt;req&lt;span style="color:#f92672"&gt;.&lt;/span&gt;text)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; embedding &lt;span style="color:#f92672"&gt;=&lt;/span&gt; embedding_response[&lt;span style="color:#e6db74"&gt;&amp;#34;embedding&amp;#34;&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Store in ChromaDB&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; doc_id &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#e6db74"&gt;f&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;doc_&lt;/span&gt;&lt;span style="color:#e6db74"&gt;{&lt;/span&gt;len(collection&lt;span style="color:#f92672"&gt;.&lt;/span&gt;get()[&lt;span style="color:#e6db74"&gt;&amp;#39;ids&amp;#39;&lt;/span&gt;]) &lt;span style="color:#f92672"&gt;+&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; collection&lt;span style="color:#f92672"&gt;.&lt;/span&gt;upsert(
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; ids&lt;span style="color:#f92672"&gt;=&lt;/span&gt;[doc_id],
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; embeddings&lt;span style="color:#f92672"&gt;=&lt;/span&gt;[embedding],
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; documents&lt;span style="color:#f92672"&gt;=&lt;/span&gt;[req&lt;span style="color:#f92672"&gt;.&lt;/span&gt;text],
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; metadatas&lt;span style="color:#f92672"&gt;=&lt;/span&gt;[req&lt;span style="color:#f92672"&gt;.&lt;/span&gt;metadata]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; )
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; {&lt;span style="color:#e6db74"&gt;&amp;#34;status&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;ok&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;document_id&amp;#34;&lt;/span&gt;: doc_id}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;except&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;Exception&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;as&lt;/span&gt; e:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;raise&lt;/span&gt; HTTPException(status_code&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;500&lt;/span&gt;, detail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;str(e))
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;@app.post&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;/query&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;def&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;query_rag&lt;/span&gt;(req: QueryRequest):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&amp;#34;&amp;#34;Query the RAG system: retrieve relevant docs + generate answer.&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;try&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Get embedding for the question&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; question_embedding &lt;span style="color:#f92672"&gt;=&lt;/span&gt; ollama_client&lt;span style="color:#f92672"&gt;.&lt;/span&gt;embeddings(model&lt;span style="color:#f92672"&gt;=&lt;/span&gt;EMBEDDING_MODEL, prompt&lt;span style="color:#f92672"&gt;=&lt;/span&gt;req&lt;span style="color:#f92672"&gt;.&lt;/span&gt;question)[&lt;span style="color:#e6db74"&gt;&amp;#34;embedding&amp;#34;&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Retrieve relevant documents&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; results &lt;span style="color:#f92672"&gt;=&lt;/span&gt; collection&lt;span style="color:#f92672"&gt;.&lt;/span&gt;query(query_embeddings&lt;span style="color:#f92672"&gt;=&lt;/span&gt;[question_embedding], n_results&lt;span style="color:#f92672"&gt;=&lt;/span&gt;req&lt;span style="color:#f92672"&gt;.&lt;/span&gt;top_k)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#f92672"&gt;not&lt;/span&gt; results[&lt;span style="color:#e6db74"&gt;&amp;#34;documents&amp;#34;&lt;/span&gt;][&lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;]:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; {&lt;span style="color:#e6db74"&gt;&amp;#34;answer&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;No relevant documents found in the knowledge base.&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;sources&amp;#34;&lt;/span&gt;: []}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Build context from retrieved documents&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; context &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;\n\n&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;&lt;span style="color:#f92672"&gt;.&lt;/span&gt;join(results[&lt;span style="color:#e6db74"&gt;&amp;#34;documents&amp;#34;&lt;/span&gt;][&lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;])
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Generate answer using LLM&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; prompt &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#e6db74"&gt;f&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;&amp;#34;&amp;#34;You are a helpful AI assistant. Answer the user&amp;#39;s question based only on the provided context.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;Context:
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;{&lt;/span&gt;context&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;Question: &lt;/span&gt;&lt;span style="color:#e6db74"&gt;{&lt;/span&gt;req&lt;span style="color:#f92672"&gt;.&lt;/span&gt;question&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;Answer:&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; response &lt;span style="color:#f92672"&gt;=&lt;/span&gt; ollama_client&lt;span style="color:#f92672"&gt;.&lt;/span&gt;generate(
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; model&lt;span style="color:#f92672"&gt;=&lt;/span&gt;LLM_MODEL,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; prompt&lt;span style="color:#f92672"&gt;=&lt;/span&gt;prompt,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; stream&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;False&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; )
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;answer&amp;#34;&lt;/span&gt;: response[&lt;span style="color:#e6db74"&gt;&amp;#34;response&amp;#34;&lt;/span&gt;],
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;sources&amp;#34;&lt;/span&gt;: results[&lt;span style="color:#e6db74"&gt;&amp;#34;documents&amp;#34;&lt;/span&gt;][&lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;],
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;distances&amp;#34;&lt;/span&gt;: results[&lt;span style="color:#e6db74"&gt;&amp;#34;distances&amp;#34;&lt;/span&gt;][&lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;except&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;Exception&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;as&lt;/span&gt; e:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;raise&lt;/span&gt; HTTPException(status_code&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;500&lt;/span&gt;, detail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;str(e))
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;@app.get&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;/health&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;def&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;health&lt;/span&gt;():
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; {&lt;span style="color:#e6db74"&gt;&amp;#34;status&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;healthy&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;ollama_model&amp;#34;&lt;/span&gt;: LLM_MODEL, &lt;span style="color:#e6db74"&gt;&amp;#34;embedding_model&amp;#34;&lt;/span&gt;: EMBEDDING_MODEL}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="pull-required-models"&gt;Pull Required Models
&lt;/h3&gt;&lt;p&gt;After starting the containers:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Pull the LLM and embedding model&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;docker exec ollama ollama pull llama3.1:8b
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;docker exec ollama ollama pull nomic-embed-text
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Verify models are loaded&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;docker exec ollama ollama list
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="environment-variables"&gt;Environment Variables
&lt;/h3&gt;&lt;p&gt;Create &lt;code&gt;.env&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;CHROMA_AUTH_TOKEN&lt;span style="color:#f92672"&gt;=&lt;/span&gt;your-chroma-auth-token
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;CHROMA_SERVER_AUTH_CREDENTIALS&lt;span style="color:#f92672"&gt;=&lt;/span&gt;admin:secret
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;CLOUDFLARE_TUNNEL_TOKEN&lt;span style="color:#f92672"&gt;=&lt;/span&gt;your-tunnel-token
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="document-ingestion-pipeline"&gt;Document Ingestion Pipeline
&lt;/h2&gt;&lt;p&gt;Once the API is running, ingest your documents:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Ingest a text snippet&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -X POST http://localhost:8080/ingest &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -H &lt;span style="color:#e6db74"&gt;&amp;#34;Content-Type: application/json&amp;#34;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -d &lt;span style="color:#e6db74"&gt;&amp;#39;{
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; &amp;#34;text&amp;#34;: &amp;#34;Our company policy states that employees get 20 days of paid time off per year...&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; &amp;#34;metadata&amp;#34;: {&amp;#34;source&amp;#34;: &amp;#34;employee-handbook.pdf&amp;#34;, &amp;#34;page&amp;#34;: 12}
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; }&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Query the knowledge base&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -X POST http://localhost:8080/query &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -H &lt;span style="color:#e6db74"&gt;&amp;#34;Content-Type: application/json&amp;#34;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -d &lt;span style="color:#e6db74"&gt;&amp;#39;{
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; &amp;#34;question&amp;#34;: &amp;#34;How many days of PTO do employees get?&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; &amp;#34;top_k&amp;#34;: 3
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; }&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For bulk ingestion, use this Python script:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;import&lt;/span&gt; os
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;from&lt;/span&gt; pathlib &lt;span style="color:#f92672"&gt;import&lt;/span&gt; Path
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;from&lt;/span&gt; langchain_community.document_loaders &lt;span style="color:#f92672"&gt;import&lt;/span&gt; PyPDFLoader, TextLoader
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;from&lt;/span&gt; langchain_text_splitters &lt;span style="color:#f92672"&gt;import&lt;/span&gt; RecursiveCharacterTextSplitter
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;import&lt;/span&gt; requests
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;KB_URL &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;http://localhost:8080/ingest&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;def&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;ingest_directory&lt;/span&gt;(directory: str):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&amp;#34;&amp;#34;Ingest all PDFs and text files from a directory.&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; splitter &lt;span style="color:#f92672"&gt;=&lt;/span&gt; RecursiveCharacterTextSplitter(
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; chunk_size&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;1000&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; chunk_overlap&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;200&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; separators&lt;span style="color:#f92672"&gt;=&lt;/span&gt;[&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;\n\n&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;\n&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;. &amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34; &amp;#34;&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; )
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;for&lt;/span&gt; filepath &lt;span style="color:#f92672"&gt;in&lt;/span&gt; Path(directory)&lt;span style="color:#f92672"&gt;.&lt;/span&gt;rglob(&lt;span style="color:#e6db74"&gt;&amp;#34;*&amp;#34;&lt;/span&gt;):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; filepath&lt;span style="color:#f92672"&gt;.&lt;/span&gt;suffix &lt;span style="color:#f92672"&gt;in&lt;/span&gt; (&lt;span style="color:#e6db74"&gt;&amp;#34;.pdf&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;.txt&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;.md&amp;#34;&lt;/span&gt;):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; print(&lt;span style="color:#e6db74"&gt;f&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;Ingesting: &lt;/span&gt;&lt;span style="color:#e6db74"&gt;{&lt;/span&gt;filepath&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; filepath&lt;span style="color:#f92672"&gt;.&lt;/span&gt;suffix &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;.pdf&amp;#34;&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; loader &lt;span style="color:#f92672"&gt;=&lt;/span&gt; PyPDFLoader(str(filepath))
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;else&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; loader &lt;span style="color:#f92672"&gt;=&lt;/span&gt; TextLoader(str(filepath))
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; documents &lt;span style="color:#f92672"&gt;=&lt;/span&gt; loader&lt;span style="color:#f92672"&gt;.&lt;/span&gt;load()
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; chunks &lt;span style="color:#f92672"&gt;=&lt;/span&gt; splitter&lt;span style="color:#f92672"&gt;.&lt;/span&gt;split_documents(documents)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;for&lt;/span&gt; i, chunk &lt;span style="color:#f92672"&gt;in&lt;/span&gt; enumerate(chunks):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; response &lt;span style="color:#f92672"&gt;=&lt;/span&gt; requests&lt;span style="color:#f92672"&gt;.&lt;/span&gt;post(KB_URL, json&lt;span style="color:#f92672"&gt;=&lt;/span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;: chunk&lt;span style="color:#f92672"&gt;.&lt;/span&gt;page_content,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;metadata&amp;#34;&lt;/span&gt;: {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;source&amp;#34;&lt;/span&gt;: str(filepath),
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;chunk&amp;#34;&lt;/span&gt;: i
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; })
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; response&lt;span style="color:#f92672"&gt;.&lt;/span&gt;status_code &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;200&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; print(&lt;span style="color:#e6db74"&gt;f&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34; Failed chunk &lt;/span&gt;&lt;span style="color:#e6db74"&gt;{&lt;/span&gt;i&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;: &lt;/span&gt;&lt;span style="color:#e6db74"&gt;{&lt;/span&gt;response&lt;span style="color:#f92672"&gt;.&lt;/span&gt;text&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; print(&lt;span style="color:#e6db74"&gt;&amp;#34;Ingestion complete.&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Usage: ingest_directory(&amp;#34;./documents/&amp;#34;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="cost-breakdown-self-hosted-vs-api-based"&gt;Cost Breakdown: Self-Hosted vs API-Based
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Cost Item&lt;/th&gt;
 &lt;th&gt;Self-Hosted (8GB VPS)&lt;/th&gt;
 &lt;th&gt;API-Based (OpenAI)&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;$10–20/month (VPS)&lt;/td&gt;
 &lt;td&gt;$0&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Embeddings&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Free (local model)&lt;/td&gt;
 &lt;td&gt;~$0.02/1K tokens&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Chat Completions&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Free (local model)&lt;/td&gt;
 &lt;td&gt;~$0.015/1K tokens&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;10K docs, 1K queries/day&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;$10–20/month&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;$150–300/month&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Data privacy&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Full (everything local)&lt;/td&gt;
 &lt;td&gt;Third-party&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Customization&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Fine-tune any model&lt;/td&gt;
 &lt;td&gt;Limited to provider options&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;The math is clear&lt;/strong&gt;: At just 1,000 queries per day, self-hosting pays for itself within the first month compared to API pricing. And the more you use it, the more you save.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="performance-tuning-tips"&gt;Performance Tuning Tips
&lt;/h2&gt;&lt;h3 id="1-swap-space-for-7b-models"&gt;1. Swap Space for 7B Models
&lt;/h3&gt;&lt;p&gt;If your VPS has only 4GB RAM, add swap to handle the 7B model:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo fallocate -l 4G /swapfile
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo chmod &lt;span style="color:#ae81ff"&gt;600&lt;/span&gt; /swapfile
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo mkswap /swapfile
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo swapon /swapfile
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;echo &lt;span style="color:#e6db74"&gt;&amp;#39;/swapfile none swap sw 0 0&amp;#39;&lt;/span&gt; | sudo tee -a /etc/fstab
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="2-quantized-models-save-ram"&gt;2. Quantized Models Save RAM
&lt;/h3&gt;&lt;p&gt;Ollama supports multiple quantization levels:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Model&lt;/th&gt;
 &lt;th&gt;Precision&lt;/th&gt;
 &lt;th&gt;RAM Needed&lt;/th&gt;
 &lt;th&gt;Speed&lt;/th&gt;
 &lt;th&gt;Quality&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Llama 3.1 8B&lt;/td&gt;
 &lt;td&gt;Q4_K_M&lt;/td&gt;
 &lt;td&gt;~4.7GB&lt;/td&gt;
 &lt;td&gt;Fast&lt;/td&gt;
 &lt;td&gt;95% of FP16&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Llama 3.1 8B&lt;/td&gt;
 &lt;td&gt;Q3_K_S&lt;/td&gt;
 &lt;td&gt;~3.5GB&lt;/td&gt;
 &lt;td&gt;Faster&lt;/td&gt;
 &lt;td&gt;90% of FP16&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Llama 3.1 8B&lt;/td&gt;
 &lt;td&gt;Q2_K&lt;/td&gt;
 &lt;td&gt;~2.6GB&lt;/td&gt;
 &lt;td&gt;Fastest&lt;/td&gt;
 &lt;td&gt;80% of FP16&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For production, &lt;strong&gt;Q4_K_M is the sweet spot&lt;/strong&gt;. Drop to Q3 if you&amp;rsquo;re RAM-constrained.&lt;/p&gt;
&lt;h3 id="3-chromadb-persistence"&gt;3. ChromaDB Persistence
&lt;/h3&gt;&lt;p&gt;Always use persistent ChromaDB mode (included in the Docker Compose above). Without it, your vector store resets on container restart.&lt;/p&gt;
&lt;h3 id="4-monitoring"&gt;4. Monitoring
&lt;/h3&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Watch Ollama memory usage&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;docker stats ollama chromadb rag-api
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Check inference speed&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;time curl -s http://localhost:11434/api/generate -d &lt;span style="color:#e6db74"&gt;&amp;#39;{&amp;#34;model&amp;#34;:&amp;#34;llama3.1:8b&amp;#34;,&amp;#34;prompt&amp;#34;:&amp;#34;Hello&amp;#34;,&amp;#34;stream&amp;#34;:false}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="security-hardening"&gt;Security Hardening
&lt;/h2&gt;&lt;p&gt;Since your RAG system processes sensitive documents, these steps are non-negotiable:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 1. Bind all services to localhost only (already in docker-compose)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 2. Use Cloudflare Tunnel for encrypted access&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 3. Enable ChromaDB auth&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 4. Rate-limit the API&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Add to rag-api Dockerfile:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# pip install slowapi&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Add to main.py:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;from slowapi import Limiter
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;from slowapi.util import get_remote_address
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;limiter &lt;span style="color:#f92672"&gt;=&lt;/span&gt; Limiter&lt;span style="color:#f92672"&gt;(&lt;/span&gt;key_func&lt;span style="color:#f92672"&gt;=&lt;/span&gt;get_remote_address&lt;span style="color:#f92672"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;app.state.limiter &lt;span style="color:#f92672"&gt;=&lt;/span&gt; limiter
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;@app.post&lt;span style="color:#f92672"&gt;(&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;/query&amp;#34;&lt;/span&gt;&lt;span style="color:#f92672"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;@limiter.limit&lt;span style="color:#f92672"&gt;(&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;10/minute&amp;#34;&lt;/span&gt;&lt;span style="color:#f92672"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;def query_rag&lt;span style="color:#f92672"&gt;(&lt;/span&gt;req: QueryRequest&lt;span style="color:#f92672"&gt;)&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; ...
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="when-self-hosting-makes-sense-and-when-it-doesnt"&gt;When Self-Hosting Makes Sense (and When It Doesn&amp;rsquo;t)
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Self-host if:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You process sensitive/confidential documents (legal, medical, financial)&lt;/li&gt;
&lt;li&gt;Your query volume exceeds 500/day (API costs add up fast)&lt;/li&gt;
&lt;li&gt;You need offline/intranet access&lt;/li&gt;
&lt;li&gt;You want full control over model selection and fine-tuning&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Stick with API if:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You&amp;rsquo;re a solo developer testing a prototype (&amp;lt; 100 queries/day)&lt;/li&gt;
&lt;li&gt;You need 70B+ model quality (requires GPU infrastructure)&lt;/li&gt;
&lt;li&gt;You don&amp;rsquo;t want to manage infrastructure&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="summary"&gt;Summary
&lt;/h2&gt;&lt;p&gt;A self-hosted LLM + RAG system on a budget VPS delivers enterprise-grade AI capability for $10–20/month. With Ollama handling inference, ChromaDB storing your knowledge base, and Cloudflare Tunnel securing access, you get a fully private AI assistant that pays for itself within weeks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Recommended path&lt;/strong&gt;: Start with RackNerd&amp;rsquo;s $5.99/month plan for testing. Migrate to Hostinger&amp;rsquo;s 8GB VPS ($10–20/month) for production. Scale to Vultr GPU instances only if you need 30B+ model sizes.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://racknerd.com/aff/19978" target="_blank" rel="noopener"
 &gt;RackNerd VPS Plans →&lt;/a&gt; · &lt;a class="link" href="https://www.hostinger.com/vps-hosting?REFERRALCODE=JZ1ZL8465QCG" target="_blank" rel="noopener"
 &gt;Hostinger VPS →&lt;/a&gt; · &lt;a class="link" href="https://www.vultr.com/?ref=9706229" target="_blank" rel="noopener"
 &gt;Vultr VPS →&lt;/a&gt;&lt;/p&gt;</description></item></channel></rss>