<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Vector-Db on 诚实雷达</title><link>https://honestradar.com/tags/vector-db/</link><description>Recent content in Vector-Db on 诚实雷达</description><generator>Hugo -- gohugo.io</generator><language>zh-cn</language><lastBuildDate>Thu, 25 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://honestradar.com/tags/vector-db/index.xml" rel="self" type="application/rss+xml"/><item><title>AI Multi-Service Stack Cost Optimization: Run LLM, Vector DB, Agent &amp; Monitoring on One VPS</title><link>https://honestradar.com/vps-hosting/ai-multi-service-stack-cost-optimization-2026/</link><pubDate>Thu, 25 Jun 2026 00:00:00 +0000</pubDate><guid>https://honestradar.com/vps-hosting/ai-multi-service-stack-cost-optimization-2026/</guid><description>&lt;img src="https://honestradar.com/images/ai-multi-service-stack-cost-optimization-2026.jpg" alt="Featured image of post AI Multi-Service Stack Cost Optimization: Run LLM, Vector DB, Agent &amp; Monitoring on One VPS" /&gt;&lt;p&gt;Running a full AI stack — LLM inference, vector database, AI agents, workflow automation, and observability — typically costs $100+/month across multiple cloud providers. But with the right VPS strategy, you can consolidate everything onto a single affordable host and cut costs by 60–80%.&lt;/p&gt;
&lt;p&gt;This guide covers the real-world approach we use: selecting the right VPS specs, choosing lightweight open-source alternatives, implementing resource-aware scheduling, and monitoring cost-per-inference.&lt;/p&gt;
&lt;h2 id="why-consolidate-your-ai-stack"&gt;Why Consolidate Your AI Stack?
&lt;/h2&gt;&lt;p&gt;Most AI projects spread services across multiple providers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;LLM inference&lt;/strong&gt; on a GPU VPS ($30–200/mo)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vector database&lt;/strong&gt; on a separate cloud instance ($15–50/mo)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent/workflow engine&lt;/strong&gt; on another VPS ($10–30/mo)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitoring&lt;/strong&gt; (Langfuse, Grafana) on yet another host ($10–20/mo)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That&amp;rsquo;s 2–4 separate monthly bills, each with its own networking overhead, security surface, and operational complexity.&lt;/p&gt;
&lt;p&gt;Consolidating to a single well-provisioned VPS gives you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Predictable costs&lt;/strong&gt; — one bill, easy to budget&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lower latency&lt;/strong&gt; — no cross-network hops between services&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Simpler backups&lt;/strong&gt; — one snapshot captures everything&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Easier security&lt;/strong&gt; — one firewall to manage&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The tradeoff is resource contention. We&amp;rsquo;ll cover how to mitigate that below.&lt;/p&gt;
&lt;h2 id="vps-specs-what-you-actually-need"&gt;VPS Specs: What You Actually Need
&lt;/h2&gt;&lt;p&gt;Here&amp;rsquo;s the minimum viable configuration for running a full AI stack on one VPS:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Service&lt;/th&gt;
 &lt;th&gt;CPU&lt;/th&gt;
 &lt;th&gt;RAM&lt;/th&gt;
 &lt;th&gt;Disk&lt;/th&gt;
 &lt;th&gt;Notes&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;LLM Inference&lt;/strong&gt; (7B param, quantized)&lt;/td&gt;
 &lt;td&gt;2 cores&lt;/td&gt;
 &lt;td&gt;8 GB&lt;/td&gt;
 &lt;td&gt;40 GB SSD&lt;/td&gt;
 &lt;td&gt;llama.cpp, Ollama, or vLLM&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Vector Database&lt;/strong&gt; (Chroma/Qdrant)&lt;/td&gt;
 &lt;td&gt;1 core&lt;/td&gt;
 &lt;td&gt;4 GB&lt;/td&gt;
 &lt;td&gt;20 GB SSD&lt;/td&gt;
 &lt;td&gt;SQLite-based Chroma for small datasets&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;AI Agents&lt;/strong&gt; (OpenClaw, n8n)&lt;/td&gt;
 &lt;td&gt;1 core&lt;/td&gt;
 &lt;td&gt;2 GB&lt;/td&gt;
 &lt;td&gt;10 GB SSD&lt;/td&gt;
 &lt;td&gt;Async workloads, bursty&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Observability&lt;/strong&gt; (Langfuse + Grafana)&lt;/td&gt;
 &lt;td&gt;0.5 core&lt;/td&gt;
 &lt;td&gt;1 GB&lt;/td&gt;
 &lt;td&gt;10 GB SSD&lt;/td&gt;
 &lt;td&gt;Low throughput, storage-heavy&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;~4.5 cores&lt;/td&gt;
 &lt;td&gt;~15 GB&lt;/td&gt;
 &lt;td&gt;~80 GB&lt;/td&gt;
 &lt;td&gt;Add 20% buffer&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="budget-option-4-vcpu--16-gb-ram"&gt;Budget Option: 4 vCPU / 16 GB RAM
&lt;/h3&gt;&lt;p&gt;This is the sweet spot. You can run:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Ollama&lt;/strong&gt; with a 7B model (Q4_K_M quantized) for inference&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Qdrant&lt;/strong&gt; for vector search (up to ~500K embeddings)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OpenClaw&lt;/strong&gt; or &lt;strong&gt;n8n&lt;/strong&gt; for agent workflows&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Langfuse&lt;/strong&gt; for LLM observability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Recommended hosts:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a class="link" href="https://honestradar.com/recommend?aff=19978" target="_blank" rel="noopener"
 &gt;RackNerd&lt;/a&gt;&lt;/strong&gt; — Their annual plans often offer 4 vCPU / 16 GB for $50–80/year. Not the fastest network, but perfectly adequate for self-hosted AI. Use code &lt;code&gt;19978&lt;/code&gt; when signing up.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a class="link" href="https://honestradar.com/recommend?ref=JZ1ZL8465QCG" target="_blank" rel="noopener"
 &gt;Hostinger VPS&lt;/a&gt;&lt;/strong&gt; — The 16 GB plan runs ~$15/mo. Better performance per dollar than most competitors. Use referral code &lt;code&gt;JZ1ZL8465QCG&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a class="link" href="https://honestradar.com/recommend?ref=9706229" target="_blank" rel="noopener"
 &gt;Vultr&lt;/a&gt;&lt;/strong&gt; — Their 16 GB VPS is ~$24/mo. Good for bursty workloads since you can scale up/down quickly. Use ref &lt;code&gt;9706229&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="performance-option-8-vcpu--32-gb-ram--gpu"&gt;Performance Option: 8 vCPU / 32 GB RAM + GPU
&lt;/h3&gt;&lt;p&gt;If you want to run larger models (13B–34B params) or handle multiple concurrent users, you&amp;rsquo;ll need more resources:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU VPS only&lt;/strong&gt;: 8 vCPU / 32 GB / 100 GB NVMe (~$30–50/mo)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPU VPS&lt;/strong&gt;: Adds $50–150/mo but enables much faster inference&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For most indie developers and small teams, the CPU-only path with quantized models is sufficient. A 7B model quantized to Q4_K_M runs well on CPU and gives 80% of the quality of a full-precision 13B model for most tasks.&lt;/p&gt;
&lt;h2 id="architecture-how-to-organize-services"&gt;Architecture: How to Organize Services
&lt;/h2&gt;&lt;h3 id="container-orchestration-with-docker-compose"&gt;Container Orchestration with Docker Compose
&lt;/h3&gt;&lt;p&gt;Docker Compose is the simplest way to manage multiple services on a single VPS. Here&amp;rsquo;s our recommended &lt;code&gt;docker-compose.yml&lt;/code&gt; structure:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;version&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;3.8&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;services&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# LLM Inference&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;ollama&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;image&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;ollama/ollama:latest&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;ports&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#e6db74"&gt;&amp;#34;11434:11434&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;volumes&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;./ollama:/root/.ollama&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;deploy&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;resources&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;limits&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;cpus&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;2.0&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;memory&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;6G&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;reservations&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;cpus&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;0.5&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;memory&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;2G&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Vector Database&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;qdrant&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;image&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;qdrant/qdrant:latest&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;ports&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#e6db74"&gt;&amp;#34;6333:6333&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;volumes&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;./qdrant/storage:/qdrant/storage&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;deploy&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;resources&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;limits&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;cpus&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;1.0&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;memory&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;4G&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# AI Agent Framework&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;openclaw&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;image&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;ghcr.io/openclaw/openclaw:latest&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;ports&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#e6db74"&gt;&amp;#34;3000:3000&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;environment&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;OLLAMA_BASE_URL=http://ollama:11434&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;QDRANT_URL=http://qdrant:6333&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;deploy&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;resources&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;limits&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;cpus&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;1.0&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;memory&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;2G&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Observability&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;langfuse&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;image&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;langfuse/langfuse:latest&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;ports&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#e6db74"&gt;&amp;#34;3001:3000&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;environment&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;DATABASE_URL=postgresql://langfuse:password@postgres:5432/langfuse&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;deploy&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;resources&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;limits&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;cpus&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;0.5&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;memory&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;1G&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Task Queue (for n8n workflows)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;postgres&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;image&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;postgres:16-alpine&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;volumes&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;./postgres/data:/var/lib/postgresql/data&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;environment&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;POSTGRES_PASSWORD&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;password&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;deploy&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;resources&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;limits&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;cpus&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;0.5&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;memory&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;1G&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Key principles:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Resource limits per service&lt;/strong&gt; — Prevent one service from starving others&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Internal networking&lt;/strong&gt; — Services communicate via Docker network, not localhost&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Persistent volumes&lt;/strong&gt; — Data survives container restarts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sequential startup&lt;/strong&gt; — Use &lt;code&gt;depends_on&lt;/code&gt; to ensure Ollama is ready before agents connect&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="resource-aware-scheduling"&gt;Resource-Aware Scheduling
&lt;/h3&gt;&lt;p&gt;Not all AI workloads are equal. Implement scheduling based on resource availability:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Priority tiers:&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Tier&lt;/th&gt;
 &lt;th&gt;Service&lt;/th&gt;
 &lt;th&gt;Behavior&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;P0&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;LLM inference&lt;/td&gt;
 &lt;td&gt;Always running, highest CPU allocation&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;P1&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Vector DB&lt;/td&gt;
 &lt;td&gt;Always running, moderate RAM&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;P2&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Agent framework&lt;/td&gt;
 &lt;td&gt;Runs on-demand, burst-capable&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;P3&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Observability&lt;/td&gt;
 &lt;td&gt;Low priority, can be paused&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Implementation with systemd:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Create a systemd override for Ollama to ensure it gets CPU priority:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-ini" data-lang="ini"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;[Service]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;CPUWeight&lt;/span&gt;&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;800&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;MemoryHigh&lt;/span&gt;&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;70%&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;MemoryMax&lt;/span&gt;&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;80%&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This tells the kernel to favor Ollama during CPU contention while still allowing other services to run.&lt;/p&gt;
&lt;h2 id="cost-per-inference-tracking-your-real-spend"&gt;Cost Per Inference: Tracking Your Real Spend
&lt;/h2&gt;&lt;p&gt;One of the biggest mistakes we see is not tracking the cost per operation. Here&amp;rsquo;s how to measure it:&lt;/p&gt;
&lt;h3 id="using-langfuse-for-cost-attribution"&gt;Using Langfuse for Cost Attribution
&lt;/h3&gt;&lt;p&gt;&lt;a class="link" href="https://honestradar.com/recommend?ref=JZ1ZL8465QCG" target="_blank" rel="noopener"
 &gt;Langfuse&lt;/a&gt; is an open-source LLM observability platform that tracks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Token usage per request&lt;/li&gt;
&lt;li&gt;Latency per model&lt;/li&gt;
&lt;li&gt;Cost per user/session&lt;/li&gt;
&lt;li&gt;Error rates by service&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Setup is simple — run Langfuse as a Docker container (see compose above) and point your Ollama client to it via middleware:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;from&lt;/span&gt; langfuse.callback &lt;span style="color:#f92672"&gt;import&lt;/span&gt; CallbackHandler
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;langfuse_handler &lt;span style="color:#f92672"&gt;=&lt;/span&gt; CallbackHandler()
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;response &lt;span style="color:#f92672"&gt;=&lt;/span&gt; ollama&lt;span style="color:#f92672"&gt;.&lt;/span&gt;chat(
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; model&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;llama3.2&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; messages&lt;span style="color:#f92672"&gt;=&lt;/span&gt;[{&lt;span style="color:#e6db74"&gt;&amp;#34;role&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;user&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;Hello&amp;#34;&lt;/span&gt;}],
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; hooks&lt;span style="color:#f92672"&gt;=&lt;/span&gt;[langfuse_handler]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="typical-costs-on-a-15mo-vps"&gt;Typical Costs on a $15/mo VPS
&lt;/h3&gt;&lt;p&gt;With a single VPS running all services:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Metric&lt;/th&gt;
 &lt;th&gt;Value&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Monthly VPS cost&lt;/td&gt;
 &lt;td&gt;$15 (Hostinger 16 GB)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Daily requests supported&lt;/td&gt;
 &lt;td&gt;~10,000 (7B model, Q4)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Cost per inference&lt;/td&gt;
 &lt;td&gt;~$0.0015&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Cost per month (10K req/day)&lt;/td&gt;
 &lt;td&gt;$0.45 (your share of hardware)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Total effective cost&lt;/td&gt;
 &lt;td&gt;~$15.45/month&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Compare this to cloud APIs:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Provider&lt;/th&gt;
 &lt;th&gt;Cost per 1M tokens&lt;/th&gt;
 &lt;th&gt;Equivalent monthly (10K req/day)&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;OpenAI GPT-4o&lt;/td&gt;
 &lt;td&gt;$2.50/$5.00 (input/output)&lt;/td&gt;
 &lt;td&gt;$75–150/month&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Anthropic Claude&lt;/td&gt;
 &lt;td&gt;$3.00/$15.00&lt;/td&gt;
 &lt;td&gt;$90–450/month&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Self-hosted Ollama&lt;/td&gt;
 &lt;td&gt;$0 (hardware only)&lt;/td&gt;
 &lt;td&gt;$15/month&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The savings are dramatic, especially at scale.&lt;/p&gt;
&lt;h2 id="model-selection-quality-vs-cost-tradeoffs"&gt;Model Selection: Quality vs. Cost Tradeoffs
&lt;/h2&gt;&lt;p&gt;Not all models are created equal. Here&amp;rsquo;s a practical guide:&lt;/p&gt;
&lt;h3 id="recommended-models-for-different-tasks"&gt;Recommended Models for Different Tasks
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Task&lt;/th&gt;
 &lt;th&gt;Model&lt;/th&gt;
 &lt;th&gt;Size&lt;/th&gt;
 &lt;th&gt;Quantization&lt;/th&gt;
 &lt;th&gt;Approx. RAM&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Code generation&lt;/td&gt;
 &lt;td&gt;Qwen2.5-Coder&lt;/td&gt;
 &lt;td&gt;7B&lt;/td&gt;
 &lt;td&gt;Q4_K_M&lt;/td&gt;
 &lt;td&gt;5 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;General chat&lt;/td&gt;
 &lt;td&gt;Mistral Small 3&lt;/td&gt;
 &lt;td&gt;7B&lt;/td&gt;
 &lt;td&gt;Q4_K_M&lt;/td&gt;
 &lt;td&gt;5 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Reasoning&lt;/td&gt;
 &lt;td&gt;Llama 3.2&lt;/td&gt;
 &lt;td&gt;3B&lt;/td&gt;
 &lt;td&gt;Q4_K_S&lt;/td&gt;
 &lt;td&gt;2.5 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Multilingual&lt;/td&gt;
 &lt;td&gt;Qwen2.5&lt;/td&gt;
 &lt;td&gt;7B&lt;/td&gt;
 &lt;td&gt;Q4_K_M&lt;/td&gt;
 &lt;td&gt;5 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Large context&lt;/td&gt;
 &lt;td&gt;Mistral Large&lt;/td&gt;
 &lt;td&gt;8x22B&lt;/td&gt;
 &lt;td&gt;AWQ&lt;/td&gt;
 &lt;td&gt;24 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="when-to-upgrade-to-gpu"&gt;When to Upgrade to GPU
&lt;/h3&gt;&lt;p&gt;GPU acceleration becomes worthwhile when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You serve &lt;strong&gt;100+ concurrent users&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;You run &lt;strong&gt;models larger than 13B params&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Latency matters (&lt;strong&gt;sub-second response times&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;You&amp;rsquo;re doing &lt;strong&gt;real-time streaming&lt;/strong&gt; responses&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For most personal projects and small teams, CPU inference with quantized models is perfectly adequate. A 7B model on CPU can handle 10–20 tokens/sec, which is fast enough for chat interfaces and agent workflows.&lt;/p&gt;
&lt;h2 id="backup-strategy-dont-lose-your-ai-stack"&gt;Backup Strategy: Don&amp;rsquo;t Lose Your AI Stack
&lt;/h2&gt;&lt;p&gt;A single VPS means a single point of failure. Here&amp;rsquo;s how to protect your investment:&lt;/p&gt;
&lt;h3 id="automated-backups"&gt;Automated Backups
&lt;/h3&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#!/bin/bash
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# backup-ai-stack.sh&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;TIMESTAMP&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;$(&lt;/span&gt;date +%Y%m%d-%H%M%S&lt;span style="color:#66d9ef"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;BACKUP_DIR&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;/backups/&lt;/span&gt;&lt;span style="color:#e6db74"&gt;${&lt;/span&gt;TIMESTAMP&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Backup Ollama models and data&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;docker exec ollama tar czf - /root/.ollama &amp;gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;&lt;span style="color:#e6db74"&gt;${&lt;/span&gt;BACKUP_DIR&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;/ollama.tar.gz&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Backup Qdrant vectors&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;docker exec qdrant tar czf - /qdrant/storage &amp;gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;&lt;span style="color:#e6db74"&gt;${&lt;/span&gt;BACKUP_DIR&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;/qdrant.tar.gz&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Backup PostgreSQL (Langfuse + n8n)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;docker exec postgres pg_dumpall -U langfuse &amp;gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;&lt;span style="color:#e6db74"&gt;${&lt;/span&gt;BACKUP_DIR&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;/postgres.sql&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Upload to object storage (Backblaze B2 or Wasabi)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;aws s3 cp &lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;&lt;span style="color:#e6db74"&gt;${&lt;/span&gt;BACKUP_DIR&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;s3://ai-stack-backups/&lt;/span&gt;&lt;span style="color:#e6db74"&gt;${&lt;/span&gt;TIMESTAMP&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;/&amp;#34;&lt;/span&gt; --recursive
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Keep only last 7 days&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;find /backups -maxdepth &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt; -mtime +7 -exec rm -rf &lt;span style="color:#f92672"&gt;{}&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Schedule this daily with cron:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code class="language-cron" data-lang="cron"&gt;0 3 * * * /usr/local/bin/backup-ai-stack.sh
&lt;/code&gt;&lt;/pre&gt;&lt;h3 id="disaster-recovery-checklist"&gt;Disaster Recovery Checklist
&lt;/h3&gt;&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Provision new VPS&lt;/strong&gt; — Same specs, fresh Ubuntu/Debian&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pull docker-compose.yml&lt;/strong&gt; — From git repo&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Restore backups&lt;/strong&gt; — Download from object storage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify services&lt;/strong&gt; — Check each container status&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test inference&lt;/strong&gt; — Run a quick Ollama query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Update DNS&lt;/strong&gt; — Point domain to new IP&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Total recovery time: ~30 minutes with automated backups.&lt;/p&gt;
&lt;h2 id="security-hardening-for-public-facing-ai-services"&gt;Security Hardening for Public-Facing AI Services
&lt;/h2&gt;&lt;p&gt;Running AI services on a public VPS requires careful security configuration:&lt;/p&gt;
&lt;h3 id="essential-steps"&gt;Essential Steps
&lt;/h3&gt;&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Firewall rules&lt;/strong&gt; — Only expose necessary ports:&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo ufw allow 22/tcp &lt;span style="color:#75715e"&gt;# SSH&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo ufw allow 443/tcp &lt;span style="color:#75715e"&gt;# HTTPS (reverse proxy)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo ufw allow 11434/tcp &lt;span style="color:#75715e"&gt;# Ollama API (internal only, behind proxy)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo ufw enable
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ol start="2"&gt;
&lt;li&gt;&lt;strong&gt;Reverse proxy with Cloudflare Tunnel&lt;/strong&gt; — Hide your VPS IP entirely:&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;cloudflared tunnel --url http://localhost:11434
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ol start="3"&gt;
&lt;li&gt;&lt;strong&gt;Authentication&lt;/strong&gt; — Protect Langfuse and agent dashboards:&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-nginx" data-lang="nginx"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;server&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;listen&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;443&lt;/span&gt; &lt;span style="color:#e6db74"&gt;ssl&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;server_name&lt;/span&gt; &lt;span style="color:#e6db74"&gt;ai.yourdomain.com&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;location&lt;/span&gt; &lt;span style="color:#e6db74"&gt;/&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;auth_basic&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;AI&lt;/span&gt; &lt;span style="color:#e6db74"&gt;Dashboard&amp;#34;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;auth_basic_user_file&lt;/span&gt; &lt;span style="color:#e6db74"&gt;/etc/nginx/.htpasswd&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;proxy_pass&lt;/span&gt; &lt;span style="color:#e6db74"&gt;http://localhost:3001&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ol start="4"&gt;
&lt;li&gt;&lt;strong&gt;Rate limiting&lt;/strong&gt; — Prevent abuse:&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-nginx" data-lang="nginx"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;limit_req_zone&lt;/span&gt; $binary_remote_addr &lt;span style="color:#e6db74"&gt;zone=ai_limit:10m&lt;/span&gt; &lt;span style="color:#e6db74"&gt;rate=10r/s&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;location&lt;/span&gt; &lt;span style="color:#e6db74"&gt;/api/&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;limit_req&lt;/span&gt; &lt;span style="color:#e6db74"&gt;zone=ai_limit&lt;/span&gt; &lt;span style="color:#e6db74"&gt;burst=20&lt;/span&gt; &lt;span style="color:#e6db74"&gt;nodelay&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;proxy_pass&lt;/span&gt; &lt;span style="color:#e6db74"&gt;http://localhost:11434&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ol start="5"&gt;
&lt;li&gt;&lt;strong&gt;Regular updates&lt;/strong&gt; — Keep the base OS patched:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre tabindex="0"&gt;&lt;code class="language-cron" data-lang="cron"&gt;0 4 * * 0 apt update &amp;amp;&amp;amp; apt upgrade -y
&lt;/code&gt;&lt;/pre&gt;&lt;h2 id="scaling-beyond-a-single-vps"&gt;Scaling Beyond a Single VPS
&lt;/h2&gt;&lt;p&gt;Eventually you&amp;rsquo;ll outgrow one machine. Here&amp;rsquo;s the migration path:&lt;/p&gt;
&lt;h3 id="phase-1-split-vector-db-month-13"&gt;Phase 1: Split Vector DB (Month 1–3)
&lt;/h3&gt;&lt;p&gt;Move Qdrant to a separate VPS when your embedding collection exceeds 1M vectors. This frees 4 GB RAM on the main host.&lt;/p&gt;
&lt;h3 id="phase-2-gpu-inference-month-36"&gt;Phase 2: GPU Inference (Month 3–6)
&lt;/h3&gt;&lt;p&gt;Move LLM inference to a GPU VPS (&lt;a class="link" href="https://honestradar.com/recommend?ref=9706229" target="_blank" rel="noopener"
 &gt;Vultr GPU&lt;/a&gt; starts at $50/mo for A10G) when concurrent users exceed 50 or latency becomes unacceptable.&lt;/p&gt;
&lt;h3 id="phase-3-full-microservices-month-6"&gt;Phase 3: Full Microservices (Month 6+)
&lt;/h3&gt;&lt;p&gt;Split each service onto its own VPS when you need independent scaling. Use &lt;a class="link" href="https://honestradar.com/recommend?aff=19978" target="_blank" rel="noopener"
 &gt;RackNerd&lt;/a&gt; budget VPSes for non-critical services (monitoring, databases).&lt;/p&gt;
&lt;h2 id="summary-your-action-plan"&gt;Summary: Your Action Plan
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Pick a VPS&lt;/strong&gt; — Hostinger 16 GB (&lt;del&gt;$15/mo) or RackNerd annual deal (&lt;/del&gt;$80/year) for the budget path&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deploy Docker Compose&lt;/strong&gt; — Use the template above as a starting point&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pull a quantized model&lt;/strong&gt; — &lt;code&gt;ollama pull llama3.2:3b-q4_K_M&lt;/code&gt; (2.5 GB)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Set up Langfuse&lt;/strong&gt; — Enable observability from day one&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configure backups&lt;/strong&gt; — Automate daily snapshots to object storage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitor costs&lt;/strong&gt; — Track token usage and adjust as needed&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The total cost for a fully functional AI stack: &lt;strong&gt;$15–20/month&lt;/strong&gt; on a single VPS, versus $100–300/month if you used managed cloud services for each component.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;This article is based on real deployment experience running multiple AI services on budget VPS infrastructure. All affiliate links support Honest Radar&amp;rsquo;s research and testing.&lt;/em&gt;&lt;/p&gt;</description></item></channel></rss>