<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Llama on 诚实雷达</title><link>https://honestradar.com/tags/llama/</link><description>Recent content in Llama on 诚实雷达</description><generator>Hugo -- gohugo.io</generator><language>zh-cn</language><lastBuildDate>Fri, 19 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://honestradar.com/tags/llama/index.xml" rel="self" type="application/rss+xml"/><item><title>Budget VPS Self-Host LLM in 2026: Complete Guide to Running AI Models Cheaply</title><link>https://honestradar.com/vps-hosting/self-hosted-llm-vps-guide-2026/</link><pubDate>Fri, 19 Jun 2026 00:00:00 +0000</pubDate><guid>https://honestradar.com/vps-hosting/self-hosted-llm-vps-guide-2026/</guid><description>&lt;img src="https://honestradar.com/images/self-hosted-llm-vps-guide-2026.jpg" alt="Featured image of post Budget VPS Self-Host LLM in 2026: Complete Guide to Running AI Models Cheaply" /&gt;&lt;h2 id="introduction"&gt;Introduction
&lt;/h2&gt;&lt;p&gt;Running your own AI models is no longer a cloud-GPU expense. With the right VPS configuration, you can self-host Llama 3 8B, Mistral, Qwen, and even 70B-class models for under $10/month on CPU or around $20-50/month on affordable GPU VPS.&lt;/p&gt;
&lt;p&gt;This guide walks through practical approaches for AI developers, indie hackers, and teams who want full control over their LLM stack — no API rate limits, no vendor lock-in, no data leaving your server.&lt;/p&gt;
&lt;p&gt;We cover:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU inference&lt;/strong&gt; with quantized models (q4/q8) on $5-20/month VPS&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPU VPS options&lt;/strong&gt; for faster inference at $20-50/month&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provider comparison&lt;/strong&gt; across RackNerd, Hostinger, Vultr, and GPU specialists&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Step-by-step deployment&lt;/strong&gt; using Ollama, llama.cpp, and vLLM&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost optimization&lt;/strong&gt; strategies for long-running AI workloads&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Whether you&amp;rsquo;re building an AI agent, a RAG pipeline, or a custom chatbot, self-hosting gives you privacy, unlimited usage, and total cost predictability.&lt;/p&gt;
&lt;h2 id="why-self-host-llms-on-a-vps"&gt;Why Self-Host LLMs on a VPS?
&lt;/h2&gt;&lt;p&gt;API-based LLM access is convenient but comes with real costs at scale:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Cost Factor&lt;/th&gt;
 &lt;th&gt;API Usage&lt;/th&gt;
 &lt;th&gt;Self-Hosted VPS&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Per-token cost&lt;/td&gt;
 &lt;td&gt;$0.001-$0.03/token&lt;/td&gt;
 &lt;td&gt;$0 (after hardware)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Data privacy&lt;/td&gt;
 &lt;td&gt;Third-party processing&lt;/td&gt;
 &lt;td&gt;Full control&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Rate limits&lt;/td&gt;
 &lt;td&gt;Strict quotas&lt;/td&gt;
 &lt;td&gt;Unlimited&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Customization&lt;/td&gt;
 &lt;td&gt;Fine-tuning expensive&lt;/td&gt;
 &lt;td&gt;Easy local fine-tuning&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Uptime dependency&lt;/td&gt;
 &lt;td&gt;Provider outages affect you&lt;/td&gt;
 &lt;td&gt;Your infrastructure&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For AI application developers, the math becomes compelling quickly. A moderate AI agent processing 10,000 requests/day at $0.005/request costs $150/month in API fees. The same workload on a well-configured VPS with quantized models runs for $5-20/month in hosting.&lt;/p&gt;
&lt;h2 id="cpu-based-llm-hosting-the-budget-approach"&gt;CPU-Based LLM Hosting: The Budget Approach
&lt;/h2&gt;&lt;h3 id="model-sizes-and-ram-requirements"&gt;Model Sizes and RAM Requirements
&lt;/h3&gt;&lt;p&gt;For CPU inference with quantized models, here are realistic VPS requirements:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Model&lt;/th&gt;
 &lt;th&gt;Quantization&lt;/th&gt;
 &lt;th&gt;Min RAM&lt;/th&gt;
 &lt;th&gt;Recommended RAM&lt;/th&gt;
 &lt;th&gt;Tokens/sec (8-core CPU)&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Llama 3.2 1B&lt;/td&gt;
 &lt;td&gt;Q4_K_M&lt;/td&gt;
 &lt;td&gt;2 GB&lt;/td&gt;
 &lt;td&gt;4 GB&lt;/td&gt;
 &lt;td&gt;~25-40&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Llama 3.2 3B&lt;/td&gt;
 &lt;td&gt;Q4_K_M&lt;/td&gt;
 &lt;td&gt;4 GB&lt;/td&gt;
 &lt;td&gt;8 GB&lt;/td&gt;
 &lt;td&gt;~15-25&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Mistral 7B&lt;/td&gt;
 &lt;td&gt;Q4_K_M&lt;/td&gt;
 &lt;td&gt;6 GB&lt;/td&gt;
 &lt;td&gt;8 GB&lt;/td&gt;
 &lt;td&gt;~8-15&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Llama 3.1 8B&lt;/td&gt;
 &lt;td&gt;Q4_K_M&lt;/td&gt;
 &lt;td&gt;6 GB&lt;/td&gt;
 &lt;td&gt;8 GB&lt;/td&gt;
 &lt;td&gt;~8-15&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Qwen 2.5 7B&lt;/td&gt;
 &lt;td&gt;Q4_K_M&lt;/td&gt;
 &lt;td&gt;6 GB&lt;/td&gt;
 &lt;td&gt;8 GB&lt;/td&gt;
 &lt;td&gt;~8-15&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Llama 3.1 70B&lt;/td&gt;
 &lt;td&gt;Q4_K_M&lt;/td&gt;
 &lt;td&gt;40 GB&lt;/td&gt;
 &lt;td&gt;48 GB&lt;/td&gt;
 &lt;td&gt;~3-6&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Mixtral 8x7B&lt;/td&gt;
 &lt;td&gt;Q4_K_M&lt;/td&gt;
 &lt;td&gt;24 GB&lt;/td&gt;
 &lt;td&gt;32 GB&lt;/td&gt;
 &lt;td&gt;~5-10&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="best-budget-vps-providers-for-cpu-llm-hosting"&gt;Best Budget VPS Providers for CPU LLM Hosting
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;RackNerd&lt;/strong&gt; (affiliate: 19978) — Starting at $4.99/month for basic plans with 1-2 vCPU and 1-2 GB RAM. Their annual deals offer exceptional value for smaller models. Look for their $19.99/year plans with 2GB RAM and 1 vCPU for 1B-3B model hosting.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://racknerd.com/aff.php?aff=19978" target="_blank" rel="noopener"
 &gt;Check RackNerd VPS deals&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hostinger&lt;/strong&gt; (referral: JZ1ZL8465QCG) — Starting at $4.99/month with 4GB RAM (KVM 1 plan). The higher RAM makes it ideal for 7B-class models with Q4 quantization. Their NVMe storage ensures fast model loading.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.hostinger.com/vps-hosting?REFERRALCODE=JZ1ZL8465QCG" target="_blank" rel="noopener"
 &gt;Explore Hostinger VPS plans&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Vultr&lt;/strong&gt; (ref: 9706229) — Starting at $3.50/month for basic plans, $60/month for their 32GB RAM option. Good middle-ground with reliable uptime and global locations.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.vultr.com/?ref=9706229" target="_blank" rel="noopener"
 &gt;Vultr VPS hosting&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="step-by-step-deploy-ollama-on-a-5-vps"&gt;Step-by-Step: Deploy Ollama on a $5 VPS
&lt;/h3&gt;&lt;p&gt;Here&amp;rsquo;s a complete deployment guide for running Llama 3.2 3B on a minimal VPS:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 1. Connect to your VPS&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ssh root@your-vps-ip
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 2. Update system&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;apt update &lt;span style="color:#f92672"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt upgrade -y
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 3. Install Ollama&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -fsSL https://ollama.com/install.sh | sh
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 4. Pull and run Llama 3.2 3B&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama run llama3.2:3b
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 5. Configure Ollama to listen on all interfaces&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;export OLLAMA_HOST&lt;span style="color:#f92672"&gt;=&lt;/span&gt;0.0.0.0:11434
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 6. Create systemd service for persistence&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;cat &lt;span style="color:#e6db74"&gt;&amp;lt;&amp;lt; EOF &amp;gt; /etc/systemd/system/ollama.service
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;[Unit]
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;Description=Ollama Service
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;After=network-online.target
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;[Service]
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;ExecStart=/usr/local/bin/ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;Environment=&amp;#34;OLLAMA_HOST=0.0.0.0:11434&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;Environment=&amp;#34;OLLAMA_MODELS=/var/lib/ollama/models&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;Restart=always
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;RestartSec=3
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;[Install]
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;WantedBy=default.target
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;EOF&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;systemctl enable ollama
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;systemctl start ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Once deployed, your LLM is accessible at &lt;code&gt;http://your-vps-ip:11434&lt;/code&gt; and integrates with any OpenAI-compatible client.&lt;/p&gt;
&lt;h2 id="gpu-vps-for-llm-inference-when-cpu-isnt-enough"&gt;GPU VPS for LLM Inference: When CPU Isn&amp;rsquo;t Enough
&lt;/h2&gt;&lt;p&gt;For 70B models, real-time RAG pipelines, or fine-tuning, GPU acceleration becomes essential. Here&amp;rsquo;s what to expect:&lt;/p&gt;
&lt;h3 id="gpu-vps-pricing-comparison"&gt;GPU VPS Pricing Comparison
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Provider&lt;/th&gt;
 &lt;th&gt;GPU Type&lt;/th&gt;
 &lt;th&gt;VRAM&lt;/th&gt;
 &lt;th&gt;Hourly Rate&lt;/th&gt;
 &lt;th&gt;Monthly Estimate&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;RunPod&lt;/td&gt;
 &lt;td&gt;RTX 4090&lt;/td&gt;
 &lt;td&gt;24GB&lt;/td&gt;
 &lt;td&gt;$0.40&lt;/td&gt;
 &lt;td&gt;~$290&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Vast.ai&lt;/td&gt;
 &lt;td&gt;RTX 3090&lt;/td&gt;
 &lt;td&gt;24GB&lt;/td&gt;
 &lt;td&gt;$0.25&lt;/td&gt;
 &lt;td&gt;~$180&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Lambda Labs&lt;/td&gt;
 &lt;td&gt;A100&lt;/td&gt;
 &lt;td&gt;80GB&lt;/td&gt;
 &lt;td&gt;$1.50&lt;/td&gt;
 &lt;td&gt;~$1,090&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Vultr&lt;/td&gt;
 &lt;td&gt;A100&lt;/td&gt;
 &lt;td&gt;24GB&lt;/td&gt;
 &lt;td&gt;$0.85&lt;/td&gt;
 &lt;td&gt;~$620&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Hostinger&lt;/td&gt;
 &lt;td&gt;N/A (CPU only)&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="when-you-need-gpu-vs-cpu"&gt;When You Need GPU vs CPU
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Use Case&lt;/th&gt;
 &lt;th&gt;Recommendation&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Chatbot with 7B model, low traffic&lt;/td&gt;
 &lt;td&gt;CPU VPS ($5-20/mo)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;AI agent with RAG, 7B-13B&lt;/td&gt;
 &lt;td&gt;CPU VPS with 16-32GB RAM&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Real-time streaming inference&lt;/td&gt;
 &lt;td&gt;GPU VPS (RTX 4090+)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Fine-tuning small models&lt;/td&gt;
 &lt;td&gt;GPU VPS (A100/H100)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;70B+ model serving&lt;/td&gt;
 &lt;td&gt;GPU VPS (A100 80GB+)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Batch processing/embeddings&lt;/td&gt;
 &lt;td&gt;CPU VPS with high RAM&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For most self-hosted AI applications, a well-configured CPU VPS with 16-32GB RAM running quantized 7B-13B models delivers excellent performance at a fraction of GPU costs.&lt;/p&gt;
&lt;h2 id="advanced-deployment-patterns"&gt;Advanced Deployment Patterns
&lt;/h2&gt;&lt;h3 id="rag-pipeline-on-self-hosted-vps"&gt;RAG Pipeline on Self-Hosted VPS
&lt;/h3&gt;&lt;p&gt;Combine your self-hosted LLM with a vector database for retrieval-augmented generation:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 1. Install Ollama&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -fsSL https://ollama.com/install.sh | sh
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama pull mistral
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 2. Install ChromaDB (vector store)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;pip install chromadb
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 3. Install embedding model&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama pull nomic-embed-text
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 4. Python RAG setup&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;cat &lt;span style="color:#e6db74"&gt;&amp;lt;&amp;lt; &amp;#39;PYEOF&amp;#39; &amp;gt; rag_server.py
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;from fastapi import FastAPI
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;from pydantic import BaseModel
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;import chromadb
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;from langchain_community.llms.ollama import Ollama
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;from langchain_community.vectorstores import Chroma
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;from langchain.text_splitter import RecursiveCharacterTextSplitter
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;from langchain.chains import RetrievalQA
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;app = FastAPI()
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;# Initialize components
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;llm = Ollama(model=&amp;#34;mistral&amp;#34;, temperature=0)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;client = chromadb.PersistentClient(path=&amp;#34;/var/lib/chroma/db&amp;#34;)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;vectorstore = Chroma(client=client, embedding_function=None)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;class QueryRequest(BaseModel):
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; question: str
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; documents: list[str] = []
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;@app.post(&amp;#34;/query&amp;#34;)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;def query(request: QueryRequest):
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; if request.documents:
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; # Add new documents
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; vectorstore.add_texts(request.documents)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; # Perform RAG query
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; retriever = vectorstore.as_retriever()
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; qa_chain = RetrievalQA.from_chain_type(
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; llm=llm,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; chain_type=&amp;#34;stuff&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; retriever=retriever
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; )
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; return {&amp;#34;answer&amp;#34;: qa_chain.run(request.question)}
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;PYEOF&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 5. Run with uvicorn&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;uvicorn rag_server:app --host 0.0.0.0 --port &lt;span style="color:#ae81ff"&gt;8000&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="ai-agent-self-hosting-with-n8n--ollama"&gt;AI Agent Self-Hosting with n8n + Ollama
&lt;/h3&gt;&lt;p&gt;For workflow automation with AI:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 1. Deploy Ollama on your VPS&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;curl -fsSL https://ollama.com/install.sh | sh
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama pull llama3.2:3b
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 2. Install n8n (workflow automation)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;npm install n8n -g
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 3. Configure n8n to use local Ollama&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;export N8N_AI_PROVIDER&lt;span style="color:#f92672"&gt;=&lt;/span&gt;ollama
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;export N8N_AI_MODEL&lt;span style="color:#f92672"&gt;=&lt;/span&gt;llama3.2:3b
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;export N8N_AI_BASE_URL&lt;span style="color:#f92672"&gt;=&lt;/span&gt;http://localhost:11434
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 4. Start n8n&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;n8n start
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This combination lets you build autonomous AI agents that call your local LLM for reasoning while managing workflows through n8n&amp;rsquo;s visual interface.&lt;/p&gt;
&lt;h2 id="cost-optimization-strategies"&gt;Cost Optimization Strategies
&lt;/h2&gt;&lt;h3 id="model-selection-for-your-budget"&gt;Model Selection for Your Budget
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Monthly Budget&lt;/th&gt;
 &lt;th&gt;Model Size&lt;/th&gt;
 &lt;th&gt;Provider Recommendation&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;$5-10&lt;/td&gt;
 &lt;td&gt;1B-3B (Q4)&lt;/td&gt;
 &lt;td&gt;RackNerd annual deal&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;$10-20&lt;/td&gt;
 &lt;td&gt;7B (Q4/Q8)&lt;/td&gt;
 &lt;td&gt;Hostinger KVM 2-3&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;$20-30&lt;/td&gt;
 &lt;td&gt;13B (Q4)&lt;/td&gt;
 &lt;td&gt;Vultr 16GB plans&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;$30-50&lt;/td&gt;
 &lt;td&gt;70B (Q4)&lt;/td&gt;
 &lt;td&gt;High-RAM VPS or GPU spot&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="quantization-trade-offs"&gt;Quantization Trade-offs
&lt;/h3&gt;&lt;pre tabindex="0"&gt;&lt;code&gt;Q8 (8-bit): ~98% accuracy, 2x model size, 2x memory
Q4_K_M: ~95% accuracy, 1x model size, 1x memory (recommended)
Q3_K_M: ~90% accuracy, 0.75x model size, 0.75x memory
Q2_K: ~85% accuracy, 0.5x model size, 0.5x memory
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;For most production AI applications, Q4_K_M provides the best balance. The 5-7% quality drop from full precision is typically imperceptible in chatbot and agent use cases, while cutting memory requirements by half.&lt;/p&gt;
&lt;h3 id="storage-optimization"&gt;Storage Optimization
&lt;/h3&gt;&lt;p&gt;LLM models consume significant disk space. Optimize with:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Check model sizes&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama list
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Remove unused models&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama rm mistral
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Use model compression&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Convert GGUF models with llama.cpp quantization tools&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;./quantize original.Q8_0.gguf compressed.Q4_K_M.gguf q4_k_m
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A typical 7B model in Q8 takes ~8GB. The same model in Q4_K_M takes ~4.5GB — nearly halving your storage and memory footprint.&lt;/p&gt;
&lt;h2 id="security-best-practices-for-self-hosted-llms"&gt;Security Best Practices for Self-Hosted LLMs
&lt;/h2&gt;&lt;p&gt;Running an LLM on a public VPS requires security hardening:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 1. Configure firewall&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ufw allow 22/tcp
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ufw allow 11434/tcp &lt;span style="color:#75715e"&gt;# Ollama API&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ufw enable
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 2. Use Cloudflare Tunnel for secure exposure&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Instead of opening ports directly&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;cloudflared tunnel --url http://localhost:11434
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 3. Add authentication to Ollama&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;export OLLAMA_ORIGINS&lt;span style="color:#f92672"&gt;=&lt;/span&gt;https://your-domain.com
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 4. Regular model updates&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama pull --force llama3.2:3b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For production deployments, consider using &lt;a class="link" href="https://honestradar.com/vps-hosting/cloudflare-tunnel-vps-self-host/" &gt;Cloudflare Tunnel&lt;/a&gt; to expose your LLM API securely without opening firewall ports directly.&lt;/p&gt;
&lt;h2 id="faq"&gt;FAQ
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Can I run Llama 3 70B on a budget VPS?&lt;/strong&gt;
Yes, with Q4 quantization it requires &lt;del&gt;40GB RAM. Look for Vultr&amp;rsquo;s 40GB plan (&lt;/del&gt;$60/month) or Hostinger&amp;rsquo;s KVM 8 (32GB RAM, ~$78/month) with swap space. CPU inference will be slow (~3-6 tokens/sec) but functional for batch processing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What&amp;rsquo;s the cheapest VPS for a self-hosted AI chatbot?&lt;/strong&gt;
A $5/month RackNerd VPS with 2GB RAM can run Llama 3.2 1B or Phi-3 mini efficiently. For better quality, upgrade to 8GB RAM for Mistral 7B or Llama 3.1 8B with Q4 quantization.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ollama vs llama.cpp vs vLLM — which should I use?&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Ollama&lt;/strong&gt;: Easiest setup, great for 1-7B models, built-in API compatibility&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;llama.cpp&lt;/strong&gt;: Most efficient memory usage, best for constrained VPS, supports all quantizations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;vLLM&lt;/strong&gt;: Highest throughput for GPU, PagedAttention for memory efficiency&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;How do I monitor my self-hosted LLM&amp;rsquo;s performance?&lt;/strong&gt;
Use &lt;a class="link" href="https://honestradar.com/vps-hosting/ai-vps-monitoring-automation-tutorial/" &gt;our VPS monitoring guide&lt;/a&gt; to set up Prometheus + Grafana for tracking token generation speed, memory usage, and request latency.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;Self-hosting LLMs on a budget VPS is now practical for most AI applications. For small models (1B-7B), a $5-20/month CPU VPS handles inference well with quantized models. For larger models or fine-tuning, GPU VPS options range from $20-500/month depending on requirements.&lt;/p&gt;
&lt;p&gt;The key decisions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Start with CPU&lt;/strong&gt; for 7B models — quantization makes them viable on $10-20 VPS&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Ollama&lt;/strong&gt; for quick setup and OpenAI-compatible API&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quantize aggressively&lt;/strong&gt; — Q4_K_M maintains quality while halving resource needs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add RAG&lt;/strong&gt; with ChromaDB for knowledge-base applications&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitor costs&lt;/strong&gt; — your self-hosted LLM should cost less than equivalent API usage&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For most AI developers, the sweet spot is a 16-32GB RAM VPS running Mistral 7B or Llama 3.1 8B with Q4 quantization — delivering capable AI at $10-20/month with zero API costs and full data privacy.&lt;/p&gt;
&lt;p&gt;Ready to get started? Compare &lt;a class="link" href="https://racknerd.com/aff.php?aff=19978" target="_blank" rel="noopener"
 &gt;RackNerd&lt;/a&gt;, &lt;a class="link" href="https://www.hostinger.com/vps-hosting?REFERRALCODE=JZ1ZL8465QCG" target="_blank" rel="noopener"
 &gt;Hostinger&lt;/a&gt;, and &lt;a class="link" href="https://www.vultr.com/?ref=9706229" target="_blank" rel="noopener"
 &gt;Vultr&lt;/a&gt; VPS plans to find your ideal hosting setup.&lt;/p&gt;</description></item></channel></rss>