<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Cost Savings on 诚实雷达</title><link>https://honestradar.com/tags/cost-savings/</link><description>Recent content in Cost Savings on 诚实雷达</description><generator>Hugo -- gohugo.io</generator><language>zh-cn</language><lastBuildDate>Thu, 18 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://honestradar.com/tags/cost-savings/index.xml" rel="self" type="application/rss+xml"/><item><title>How to Self-Host an LLM on a Cheap VPS in 2026: Save Hundreds with Ollama + Open WebUI</title><link>https://honestradar.com/vps-hosting/vps-self-host-llm-guide-2026/</link><pubDate>Thu, 18 Jun 2026 00:00:00 +0000</pubDate><guid>https://honestradar.com/vps-hosting/vps-self-host-llm-guide-2026/</guid><description>&lt;h2 id="introduction"&gt;Introduction
&lt;/h2&gt;&lt;p&gt;If you&amp;rsquo;ve been building AI agents, chatbots, or internal knowledge bases in 2026, you&amp;rsquo;ve probably felt the pain of API bills. Send 1 million tokens through Claude or GPT-5 and you&amp;rsquo;re looking at $50-$270 per month. Scale that to 10 million tokens and the bill hits thousands.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s what changed: the open-source LLMs of 2026 are genuinely good. Llama 4, Qwen 3.6, Gemma 4, and GLM-5.1 now match or beat proprietary models on key benchmarks. And the tools to run them — Ollama, vLLM, Open WebUI — have matured to the point where you can go from zero to a fully functional AI chat interface in under 10 minutes.&lt;/p&gt;
&lt;p&gt;This guide walks you through everything: choosing the right VPS, installing Ollama, deploying Open WebUI, and doing the math on whether self-hosting actually saves you money.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; We may earn a commission when you use our affiliate links. This doesn&amp;rsquo;t affect our pricing or recommendations.&lt;/p&gt;
&lt;h2 id="why-self-host-your-llm"&gt;Why Self-Host Your LLM?
&lt;/h2&gt;&lt;p&gt;Three reasons dominate the decision:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Cost at scale.&lt;/strong&gt; At 500K tokens per day, API costs run $200-$500/month for flagship models. A $48/month VPS running the same workload costs less than a quarter of that. The breakeven point — where self-hosting becomes cheaper than API — is typically around 2-4 million tokens per month for 7B-14B models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Data privacy.&lt;/strong&gt; Your prompts, your documents, your customer data — none of it leaves your server. For healthcare, legal, finance, or any regulated industry, this isn&amp;rsquo;t optional. It&amp;rsquo;s a compliance requirement.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. No vendor lock-in.&lt;/strong&gt; API providers change pricing overnight. They deprecate models. They impose rate limits. When the model lives on your VPS, you control the upgrade path, the version, the quantization — everything.&lt;/p&gt;
&lt;h2 id="hardware-requirements-what-vps-specs-do-you-actually-need"&gt;Hardware Requirements: What VPS Specs Do You Actually Need?
&lt;/h2&gt;&lt;p&gt;The specs depend entirely on which model you want to run. Here&amp;rsquo;s the realistic breakdown for CPU-only inference (no GPU):&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Model&lt;/th&gt;
 &lt;th&gt;Min RAM&lt;/th&gt;
 &lt;th&gt;Comfortable RAM&lt;/th&gt;
 &lt;th&gt;Storage&lt;/th&gt;
 &lt;th&gt;Tokens/sec (CPU)&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Qwen 3.6 3B (Q4)&lt;/td&gt;
 &lt;td&gt;4 GB&lt;/td&gt;
 &lt;td&gt;8 GB&lt;/td&gt;
 &lt;td&gt;3 GB&lt;/td&gt;
 &lt;td&gt;8-15&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Llama 4 8B (Q4)&lt;/td&gt;
 &lt;td&gt;6 GB&lt;/td&gt;
 &lt;td&gt;8 GB&lt;/td&gt;
 &lt;td&gt;5 GB&lt;/td&gt;
 &lt;td&gt;5-10&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Gemma 4 12B (Q4)&lt;/td&gt;
 &lt;td&gt;8 GB&lt;/td&gt;
 &lt;td&gt;16 GB&lt;/td&gt;
 &lt;td&gt;8 GB&lt;/td&gt;
 &lt;td&gt;3-7&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Llama 4 70B (Q4)&lt;/td&gt;
 &lt;td&gt;48 GB&lt;/td&gt;
 &lt;td&gt;64 GB&lt;/td&gt;
 &lt;td&gt;40 GB&lt;/td&gt;
 &lt;td&gt;1-3&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Key insight:&lt;/strong&gt; For most use cases (coding assistance, document analysis, chat), a 7B-14B model with 8GB RAM is the sweet spot. You get 80% of the quality at 20% of the cost.&lt;/p&gt;
&lt;p&gt;If you need serious throughput (concurrent users, RAG pipelines, function calling), look at GPU VPS instances. Vultr and specialized GPU hosts offer RTX-based instances starting at ~$0.50/hr, but that&amp;rsquo;s a different cost category entirely.&lt;/p&gt;
&lt;h2 id="tool-comparison-ollama-vs-vllm"&gt;Tool Comparison: Ollama vs vLLM
&lt;/h2&gt;&lt;p&gt;Two tools dominate the self-hosted LLM space in 2026. Here&amp;rsquo;s how they compare:&lt;/p&gt;
&lt;h3 id="ollama--best-for-getting-started"&gt;Ollama — Best for Getting Started
&lt;/h3&gt;&lt;p&gt;Ollama is a single binary. Install it, pull a model, run it. Done. Under the hood it uses llama.cpp with GGUF quantization, which means excellent CPU support and tight quantization.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Zero configuration — &lt;code&gt;ollama pull llama4:8b&lt;/code&gt; and you&amp;rsquo;re running&lt;/li&gt;
&lt;li&gt;Automatic CPU fallback — works on any Linux VPS, no GPU required&lt;/li&gt;
&lt;li&gt;REST API compatible with OpenAI format — swap in your existing code&lt;/li&gt;
&lt;li&gt;Built-in model library — 50+ models available with one command&lt;/li&gt;
&lt;li&gt;Great for development, personal projects, small teams&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Single-stream inference — handles one request at a time&lt;/li&gt;
&lt;li&gt;Throughplate limited to ~10 tokens/sec on CPU hardware&lt;/li&gt;
&lt;li&gt;No continuous batching — concurrent requests queue up&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="vllm--best-for-production"&gt;vLLM — Best for Production
&lt;/h3&gt;&lt;p&gt;vLLM uses PagedAttention to achieve 14-24x throughput improvement over naive implementations. It&amp;rsquo;s what you reach for when you need to serve dozens of concurrent users.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Continuous batching — handles many requests simultaneously&lt;/li&gt;
&lt;li&gt;PagedAttention — near-zero KV cache memory waste (&amp;lt;4% vs 60-80%)&lt;/li&gt;
&lt;li&gt;OpenAI-compatible API — drop-in replacement for API calls&lt;/li&gt;
&lt;li&gt;GPU-optimized — saturates GPU memory efficiently&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Requires NVIDIA GPU with CUDA — CPU mode exists but is impractical&lt;/li&gt;
&lt;li&gt;More complex setup — Docker, CUDA toolkit, model loading&lt;/li&gt;
&lt;li&gt;Overkill for single-user or development scenarios&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Start with Ollama. If you hit throughput limits, migrate to vLLM. Many teams use both — Ollama for development, vLLM for production.&lt;/p&gt;
&lt;h2 id="vps-provider-comparison-for-self-hosted-llm"&gt;VPS Provider Comparison for Self-Hosted LLM
&lt;/h2&gt;&lt;p&gt;Here&amp;rsquo;s how three budget-friendly VPS providers stack up for running self-hosted LLM workloads:&lt;/p&gt;
&lt;h3 id="racknerd--best-budget-option"&gt;RackNerd — Best Budget Option
&lt;/h3&gt;&lt;p&gt;RackNerd offers KVM VPS plans starting at $11.29/year (~$0.94/month). Their 3.5GB plan ($32.49/year) gives you 2 vCPU cores, 3.5GB RAM, 65GB SSD, and 7TB bandwidth — enough to comfortably run a 7B model with Q4 quantization.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Extremely cheap, annual price lock (no renewal shock), 21 data center locations, KVM virtualization with Docker support.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; SolusVM control panel feels dated, no snapshots/backups, community support only, SolusVM panel is basic.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Hobbyists, personal AI assistants, development environments, low-traffic internal tools.&lt;/p&gt;
&lt;p&gt;👉 &lt;a class="link" href="https://my.racknerd.com/aff.php?aff=19978" target="_blank" rel="noopener"
 &gt;Check RackNerd VPS Plans&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="hostinger--best-balance-of-price-and-ease"&gt;Hostinger — Best Balance of Price and Ease
&lt;/h3&gt;&lt;p&gt;Hostinger&amp;rsquo;s KVM VPS starts at $6.49/month (KVM 1) for 1 vCPU, 4GB RAM, 50GB NVMe, and 4TB bandwidth. The KVM 2 plan ($8.99/month) bumps you to 2 vCPU, 8GB RAM, 100GB NVMe — the sweet spot for running a 12B model comfortably.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Excellent hPanel control panel, NVMe storage (fast I/O for model loading), weekly backups included, 30-day money-back guarantee, Kodee AI assistant built in.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; Renewal prices jump 2-3x (KVM 1 renews at $19.49/month), no GPU options, 1Gbps network cap on lower tiers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams needing reliability plus ease of use, production workloads that need backups.&lt;/p&gt;
&lt;p&gt;👉 &lt;a class="link" href="https://www.hostinger.com/vps-hosting?REFERRALCODE=JZ1ZL8465QCG" target="_blank" rel="noopener"
 &gt;Check Hostinger VPS Plans&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="vultr--best-for-gpu-workloads"&gt;Vultr — Best for GPU Workloads
&lt;/h3&gt;&lt;p&gt;Vultr&amp;rsquo;s High Frequency instances start at $6/month (1GB RAM, 1 vCPU). For LLM workloads, their optimized instances with GPU access are the real draw — RTX 4090 instances for deep learning, or their VX1 line for cost-efficient general compute.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Largest global presence (32 locations), GPU instances available, hourly billing (scale up/down), NVMe storage, High Frequency option with dedicated CPU.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; Higher base prices than RackNerd, GPU instances are expensive ($0.50-4.00/hr), bandwidth costs add up quickly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams needing GPU acceleration, global deployment, or hourly-scaling workloads.&lt;/p&gt;
&lt;p&gt;👉 &lt;a class="link" href="https://www.vultr.com/?ref=9706229" target="_blank" rel="noopener"
 &gt;Check Vultr VPS Plans&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="step-by-step-deploy-ollama--open-webui-on-your-vps"&gt;Step-by-Step: Deploy Ollama + Open WebUI on Your VPS
&lt;/h2&gt;&lt;p&gt;Here&amp;rsquo;s the complete walkthrough. We&amp;rsquo;ll use a $6-12/month VPS (any of the providers above) and deploy Ollama with Open WebUI as the frontend.&lt;/p&gt;
&lt;h3 id="step-1-provision-your-vps"&gt;Step 1: Provision Your VPS
&lt;/h3&gt;&lt;p&gt;Choose your VPS provider and create a new instance. Recommended specs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;OS:&lt;/strong&gt; Ubuntu 22.04 or 24.04 LTS&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RAM:&lt;/strong&gt; 8GB minimum (4GB for 7B models, 16GB for 12B+)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage:&lt;/strong&gt; 50GB+ NVMe/SSD (models take 3-40GB depending on size)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CPU:&lt;/strong&gt; 2+ cores recommended&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="step-2-install-ollama"&gt;Step 2: Install Ollama
&lt;/h3&gt;&lt;p&gt;SSH into your VPS and run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -fsSL https://ollama.com/install.sh | sh
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This installs Ollama as a systemd service. Start it:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;systemctl start ollama
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;systemctl enable ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="step-3-pull-your-first-model"&gt;Step 3: Pull Your First Model
&lt;/h3&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama pull llama4:8b-q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This downloads the 8B model (quantized to 4-bit, ~5GB). You can also try:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;qwen3:4bit&lt;/code&gt; — Qwen 3.6, excellent for coding and reasoning&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gemma4:8b-q4&lt;/code&gt; — Google&amp;rsquo;s Gemma 4, strong multilingual support&lt;/li&gt;
&lt;li&gt;&lt;code&gt;llama4:70b-q4&lt;/code&gt; — if your VPS has 48GB+ RAM&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Check it&amp;rsquo;s working:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama list
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama run llama4:8b-q4_K_M &lt;span style="color:#e6db74"&gt;&amp;#34;Hello, world!&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="step-4-configure-ollama-for-remote-access"&gt;Step 4: Configure Ollama for Remote Access
&lt;/h3&gt;&lt;p&gt;By default, Ollama only listens on localhost. Edit the systemd service:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;mkdir -p /etc/systemd/system/ollama.service.d
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;cat &lt;span style="color:#e6db74"&gt;&amp;lt;&amp;lt;EOF &amp;gt; /etc/systemd/system/ollama.service.d/env.conf
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;[Service]
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;ENV=&amp;#34;OLLAMA_HOST=0.0.0.0:11434&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;ENV=&amp;#34;OLLAMA_MAX_LOADED_MODELS=1&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;ENV=&amp;#34;OLLAMA_NUM_PARALLEL=1&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;EOF&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;systemctl daemon-reload
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;systemctl restart ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="step-5-deploy-open-webui"&gt;Step 5: Deploy Open WebUI
&lt;/h3&gt;&lt;p&gt;Open WebUI is a beautiful, feature-rich chat interface for Ollama. Deploy with Docker:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;docker run -d -p 3000:8080 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -v open-webui:/app/backend/data &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -e OLLAMA_BASE_URL&lt;span style="color:#f92672"&gt;=&lt;/span&gt;http://localhost:11434 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --name open-webui &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; ghcr.io/open-webui/open-webui:main
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Visit &lt;code&gt;http://your-vps-ip:3000&lt;/code&gt; and you&amp;rsquo;ll see the Open WebUI login page. Create an admin account and start chatting.&lt;/p&gt;
&lt;h3 id="step-6-add-a-reverse-proxy-optional-but-recommended"&gt;Step 6: Add a Reverse Proxy (Optional but Recommended)
&lt;/h3&gt;&lt;p&gt;For HTTPS and a proper domain, set up Nginx as a reverse proxy:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo apt install nginx certbot python3-certbot-nginx -y
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo certbot --nginx -d chat.yourdomain.com
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then create &lt;code&gt;/etc/nginx/sites-available/open-webui&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-nginx" data-lang="nginx"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;server&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;listen&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;80&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;server_name&lt;/span&gt; &lt;span style="color:#e6db74"&gt;chat.yourdomain.com&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;return&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;301&lt;/span&gt; &lt;span style="color:#e6db74"&gt;https://&lt;/span&gt;$server_name$request_uri;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;server&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;listen&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;443&lt;/span&gt; &lt;span style="color:#e6db74"&gt;ssl&lt;/span&gt; &lt;span style="color:#e6db74"&gt;http2&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;server_name&lt;/span&gt; &lt;span style="color:#e6db74"&gt;chat.yourdomain.com&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;ssl_certificate&lt;/span&gt; &lt;span style="color:#e6db74"&gt;/etc/letsencrypt/live/chat.yourdomain.com/fullchain.pem&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;ssl_certificate_key&lt;/span&gt; &lt;span style="color:#e6db74"&gt;/etc/letsencrypt/live/chat.yourdomain.com/privkey.pem&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;location&lt;/span&gt; &lt;span style="color:#e6db74"&gt;/&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;proxy_pass&lt;/span&gt; &lt;span style="color:#e6db74"&gt;http://localhost:3000&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;proxy_set_header&lt;/span&gt; &lt;span style="color:#e6db74"&gt;Host&lt;/span&gt; $host;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;proxy_set_header&lt;/span&gt; &lt;span style="color:#e6db74"&gt;X-Real-IP&lt;/span&gt; $remote_addr;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;proxy_set_header&lt;/span&gt; &lt;span style="color:#e6db74"&gt;X-Forwarded-For&lt;/span&gt; $proxy_add_x_forwarded_for;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;proxy_set_header&lt;/span&gt; &lt;span style="color:#e6db74"&gt;X-Forwarded-Proto&lt;/span&gt; $scheme;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# WebSocket support for streaming
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;proxy_http_version&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;span style="color:#e6db74"&gt;.1&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;proxy_set_header&lt;/span&gt; &lt;span style="color:#e6db74"&gt;Upgrade&lt;/span&gt; $http_upgrade;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;proxy_set_header&lt;/span&gt; &lt;span style="color:#e6db74"&gt;Connection&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;upgrade&amp;#34;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="step-7-secure-your-instance"&gt;Step 7: Secure Your Instance
&lt;/h3&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Firewall rules&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo ufw allow 22/tcp
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo ufw allow 80/tcp
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo ufw allow 443/tcp
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo ufw enable
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Fail2ban for SSH protection&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo apt install fail2ban -y
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;sudo systemctl enable fail2ban
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="cost-comparison-self-hosted-vs-api"&gt;Cost Comparison: Self-Hosted vs API
&lt;/h2&gt;&lt;p&gt;Let&amp;rsquo;s do the real math for a typical usage scenario: 500K input tokens + 500K output tokens per day (15M tokens/month).&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Scenario&lt;/th&gt;
 &lt;th&gt;Monthly Cost&lt;/th&gt;
 &lt;th&gt;Notes&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;GPT-5 API&lt;/td&gt;
 &lt;td&gt;~$168&lt;/td&gt;
 &lt;td&gt;$1.25/$10 per 1M tokens&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Claude Sonnet 4.5 API&lt;/td&gt;
 &lt;td&gt;~$270&lt;/td&gt;
 &lt;td&gt;$3/$15 per 1M tokens&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;RackNerd 3.5GB VPS&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;$2.71&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;$32.49/year, runs 7B-14B model&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Hostinger KVM 1 VPS&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;$6.49&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;4GB RAM, runs 7B models comfortably&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Hostinger KVM 2 VPS&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;$8.99&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;8GB RAM, runs 12B models comfortably&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Vultr High Frequency $6&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;$6.00&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;1GB RAM, limited for LLM but cheap&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;The savings are dramatic.&lt;/strong&gt; Even at the higher-end Hostinger KVM 2 plan, you&amp;rsquo;re paying $8.99/month for unlimited inference on a 12B model that rivals GPT-4-tier models on many benchmarks. The API equivalent costs 30x more.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Breakeven calculation:&lt;/strong&gt; If you&amp;rsquo;re spending more than $10/month on AI APIs, a VPS paying for itself. For teams running 10M+ tokens/month, the savings exceed 95%.&lt;/p&gt;
&lt;h2 id="when-not-to-self-host"&gt;When NOT to Self-Host
&lt;/h2&gt;&lt;p&gt;Self-hosting isn&amp;rsquo;t for everyone. Consider API access if:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;You need 70B+ models regularly.&lt;/strong&gt; CPU inference on 70B models is painfully slow (1-3 tokens/sec). GPU VPS or API is better.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Your usage is sporadic.&lt;/strong&gt; If you only run AI queries a few times per week, paying $5-10/month for API access is cheaper than a $6/month VPS you barely use.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;You need zero maintenance.&lt;/strong&gt; APIs just work. Self-hosted means you handle updates, security, backups, and downtime.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;You need the absolute latest model.&lt;/strong&gt; API providers ship new models instantly. Self-hosted requires you to pull and test new models manually.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="buying-decision-guide"&gt;Buying Decision Guide
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Use Case&lt;/th&gt;
 &lt;th&gt;Recommended Setup&lt;/th&gt;
 &lt;th&gt;Estimated Cost&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Personal AI assistant, coding help&lt;/td&gt;
 &lt;td&gt;RackNerd 3.5GB + Ollama + llama4:8b&lt;/td&gt;
 &lt;td&gt;$2.71/mo&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Team internal chatbot, knowledge base&lt;/td&gt;
 &lt;td&gt;Hostinger KVM 2 + Ollama + Open WebUI&lt;/td&gt;
 &lt;td&gt;$8.99/mo&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Production RAG pipeline, concurrent users&lt;/td&gt;
 &lt;td&gt;Vultr GPU instance + vLLM&lt;/td&gt;
 &lt;td&gt;$30-100/mo&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Occasional AI queries, prototyping&lt;/td&gt;
 &lt;td&gt;API access (DeepSeek/GPT-4.1 Nano)&lt;/td&gt;
 &lt;td&gt;$6-8/mo&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Maximum quality, unlimited tokens&lt;/td&gt;
 &lt;td&gt;API access (GPT-5/Claude Sonnet)&lt;/td&gt;
 &lt;td&gt;$168-270/mo&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; For anyone running AI workloads regularly, self-hosting on a budget VPS is the smartest move in 2026. The open-source models are good enough, the tools are easy to deploy, and the cost savings are enormous.&lt;/p&gt;
&lt;p&gt;Start with Ollama on a $6/month VPS. If you outgrow it, migrate to vLLM or add GPU capacity. The path from hobby project to production AI infrastructure has never been this accessible.&lt;/p&gt;
&lt;p&gt;👉 &lt;a class="link" href="https://my.racknerd.com/aff.php?aff=19978" target="_blank" rel="noopener"
 &gt;Get Started with RackNerd VPS&lt;/a&gt; — From $11.29/year
👉 &lt;a class="link" href="https://www.hostinger.com/vps-hosting?REFERRALCODE=JZ1ZL8465QCG" target="_blank" rel="noopener"
 &gt;Check Hostinger VPS Plans&lt;/a&gt; — From $6.49/month
👉 &lt;a class="link" href="https://www.vultr.com/?ref=9706229" target="_blank" rel="noopener"
 &gt;Explore Vultr GPU VPS&lt;/a&gt; — From $6/month&lt;/p&gt;
&lt;h2 id="faq"&gt;FAQ
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Can I run LLMs on a 4GB RAM VPS?&lt;/strong&gt; Yes, but only smaller models. Qwen 3.6 3B or Llama 4 8B with aggressive quantization (Q3) will run on 4GB, but expect 3-5 tokens/sec. For comfortable performance, 8GB+ RAM is recommended.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How long does it take to set up?&lt;/strong&gt; The Ollama install takes 2 minutes. Open WebUI deployment takes 5 minutes. Total: under 10 minutes from zero to a working AI chat interface.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Can I use my own API key with self-hosted models?&lt;/strong&gt; Self-hosted models don&amp;rsquo;t need API keys — you&amp;rsquo;re running the model locally. However, Open WebUI supports hybrid setups where you can add external API providers alongside your local models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What happens when the VPS goes down?&lt;/strong&gt; You lose access to your AI service until it&amp;rsquo;s back up. This is why production deployments often use redundant VPS instances or cloud auto-scaling. For personal use, a single VPS downtime of a few hours is rarely disruptive.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Do I need a domain name?&lt;/strong&gt; Not for Ollama itself — it works on IP addresses. But Open WebUI looks better with a domain, and you&amp;rsquo;ll need one for SSL certificates via Let&amp;rsquo;s Encrypt.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Can I fine-tune models on a VPS?&lt;/strong&gt; Fine-tuning requires significantly more resources than inference. A single GPU VPS (RTX 4090 or better) is the minimum for practical fine-tuning. For most use cases, RAG (Retrieval-Augmented Generation) with your existing models achieves similar results at a fraction of the cost.&lt;/p&gt;</description></item></channel></rss>