Self-Host Ollama and Open WebUI on Proxmox LXC
Run a complete self-hosted AI inference stack with Ollama, Open WebUI, and Whisper across lightweight Proxmox LXC containers — no GPU passthrough VM required.
On this page
Running large language models locally has gone from a fringe experiment to a legitimate homelab staple in under two years. If you're already running Proxmox, you're closer to a full self-hosted AI stack than you think — and you don't need to dedicate an entire GPU passthrough VM to get there. This guide walks you through deploying Ollama, Open WebUI, and Whisper across lightweight LXC containers, keeping your hypervisor lean while giving you a capable local AI inference environment.
Why LXC Instead of a Full VM?
The instinct is to spin up a VM for anything AI-related, especially since GPU passthrough tutorials dominate the space. But for CPU-based inference — which covers a surprising range of useful models — LXC containers are a better fit.
Containers share the host kernel, which means lower memory overhead, faster startup, and less disk I/O. A VM running Ubuntu for Ollama might consume 2–3 GB of RAM before you load a single model. An LXC container doing the same job sits closer to 300–500 MB baseline.
The tradeoff: LXC containers require a privileged setup or specific kernel capabilities for GPU access. If you're on a system with an NVIDIA or AMD GPU and want hardware acceleration, you'll need a privileged container with the right device bindings — covered below.
What You'll Build
By the end of this guide you'll have:
- Ollama LXC — the model runner and inference backend
- Open WebUI LXC — a ChatGPT-style frontend that connects to Ollama
- Whisper LXC (optional) — local speech-to-text that Open WebUI can use for voice input
All three containers sit on the same Proxmox bridge and communicate over a private network. Open WebUI is exposed on your LAN; Ollama and Whisper stay internal.
Prerequisites
Before starting, make sure you have:
- Proxmox VE 8.x or newer
- At least 16 GB RAM (32 GB recommended if running 13B+ models)
- 50+ GB of free storage for model weights
- A Debian 12 LXC template downloaded in Proxmox
- Basic familiarity with the Proxmox shell
Download the Debian 12 template from the Proxmox UI under Datacenter → your node → local storage → CT Templates → Templates, then search for debian-12.
Step 1: Create the Ollama LXC Container
From the Proxmox web UI, click Create CT. Use these settings as a baseline:
- Hostname:
ollama - Template:
debian-12-standard - Disk: 60 GB (models are large — Llama 3.1 8B is ~4.7 GB, 70B is ~40 GB)
- CPU: 4–8 cores
- Memory: 8192 MB minimum; 16384 MB if you plan to run 13B models
- Network: Bridge
vmbr0, static IP (e.g.,192.168.1.50/24) - Uncheck "Start after created"
If you want GPU access from this container, enable Privileged mode in the General tab before creating it. You cannot change this after the fact.
GPU Passthrough for the Ollama Container (Optional)
If you have an NVIDIA GPU on the host, add the following to the container config file after creation. The container ID in this example is 100:
# On the Proxmox host
nano /etc/pve/lxc/100.conf
Add these lines:
lxc.cgroup2.devices.allow: c 195:* rwm lxc.cgroup2.devices.allow: c 235:* rwm lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
For AMD GPU (ROCm), the device path is /dev/kfd and /dev/dri. The specifics vary by driver version — check the Ollama GPU documentation for your card.
Start the container and open a shell:
pct start 100
pct enter 100
Install Ollama
Inside the container:
apt update && apt install -y curl
curl -fsSL https://ollama.com/install.sh | sh
Ollama installs as a systemd service. By default it only listens on 127.0.0.1:11434. Since Open WebUI will be in a separate container, you need to change this.
nano /etc/systemd/system/ollama.service
Find the [Service] block and add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Reload and restart:
systemctl daemon-reload
systemctl restart ollama
systemctl enable ollama
Verify it's listening:
curl http://localhost:11434
# Should return: Ollama is running
Pull Your First Model
ollama pull llama3.1:8b
This downloads about 4.7 GB. For a more capable but heavier option:
ollama pull mistral:7b
ollama pull qwen2.5:14b
Test it from the CLI before moving on:
ollama run llama3.1:8b "What is Proxmox VE?"
Step 2: Create the Open WebUI LXC Container
Create a second container with these settings:
- Hostname:
open-webui - Template:
debian-12-standard - Disk: 10 GB
- CPU: 2 cores
- Memory: 2048 MB
- Network: Same bridge, static IP (e.g.,
192.168.1.51/24)
Open WebUI is a Python/Node application. The easiest way to run it in an LXC is via Docker, but since we're trying to avoid nested Docker complexity, we'll use the pip-based install instead.
Start the container and enter it:
pct start 101
pct enter 101
Install Python and Open WebUI
apt update && apt install -y python3 python3-pip python3-venv git curl
Create a dedicated user
useradd -m -s /bin/bash openwebui su - openwebui
Set up virtual environment
python3 -m venv ~/venv source ~/venv/bin/activate pip install open-webui
This will take a few minutes. Once done, configure the connection to your Ollama container:
export OLLAMA_BASE_URL=http://192.168.1.50:11434
export DATA_DIR=/home/openwebui/data
mkdir -p $DATA_DIR
Start it
open-webui serve
Open WebUI will start on port 8080. Visit http://192.168.1.51:8080 from your browser to confirm it's working.
Run Open WebUI as a Systemd Service
Exit back to root (exit from the openwebui user) and create a service file:
nano /etc/systemd/system/open-webui.service
[Unit]
Description=Open WebUI
After=network.target
[Service] User=openwebui WorkingDirectory=/home/openwebui Environment="PATH=/home/openwebui/venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" Environment="OLLAMA_BASE_URL=http://192.168.1.50:11434" Environment="DATA_DIR=/home/openwebui/data" ExecStart=/home/openwebui/venv/bin/open-webui serve Restart=always RestartSec=5
[Install] WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable --now open-webui
Now visit http://192.168.1.51:8080 and create your admin account on the first launch.
Step 3: Add Whisper for Voice Input (Optional)
Open WebUI supports speech-to-text via a compatible Whisper API. The fastest way to get this running on Proxmox LXC is with whisper.cpp compiled for CPU inference, or via faster-whisper through a lightweight Python server.
Create a third container:
- Hostname:
whisper - Template:
debian-12-standard - Disk: 10 GB
- CPU: 2–4 cores
- Memory: 2048 MB
- Network: Same bridge, static IP (e.g.,
192.168.1.52/24)
pct start 102
pct enter 102
Install Faster-Whisper API Server
apt update && apt install -y python3 python3-pip python3-venv ffmpeg
useradd -m -s /bin/bash whisper su - whisper
python3 -m venv ~/venv source ~/venv/bin/activate
faster-whisper + a lightweight OpenAI-compatible server
pip install faster-whisper flask
Create a simple OpenAI-compatible transcription endpoint:
# /home/whisper/server.py
from flask import Flask, request, jsonify
from faster_whisper import WhisperModel
import tempfile, os
app = Flask(name) model = WhisperModel("base", device="cpu", compute_type="int8")
@app.route("/v1/audio/transcriptions", methods=["POST"]) def transcribe(): audio = request.files.get("file") if not audio: return jsonify({"error": "No file"}), 400 with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp: audio.save(tmp.name) segments, _ = model.transcribe(tmp.name) text = " ".join(s.text for s in segments) os.unlink(tmp.name) return jsonify({"text": text})
if name == "main": app.run(host="0.0.0.0", port=9000)
python server.py
Create a systemd service for this the same way you did for Open WebUI, pointing to python /home/whisper/server.py.
Connect Whisper to Open WebUI
In the Open WebUI dashboard, go to Settings → Audio and set:
- Speech to Text Engine: OpenAI
- API Base URL:
http://192.168.1.52:9000/v1 - API Key: anything (our server ignores it)
- Model:
whisper-1(Open WebUI sends this; our server ignores model selection)
Click the microphone icon in a chat to test voice input.
Keeping Models Organized
Ollama stores models under /root/.ollama/models (or /home/ollama/.ollama/models if running as a user) inside the container. With large models this fills up quickly.
A practical approach: mount a Proxmox storage volume directly into the Ollama container so models live on your NAS or a dedicated storage pool.
On the Proxmox host, add a bind mount to the container config:
nano /etc/pve/lxc/100.conf
Add:
mp0: /mnt/storage/ollama-models,mp=/root/.ollama/models
Replace /mnt/storage/ollama-models with whatever path on the Proxmox host you want to use for model storage. This survives container rebuilds and lets you share models between multiple Ollama instances if needed.
Performance Tips for CPU Inference
Running models on CPU is slower than GPU, but perfectly usable for many workloads. A few things help:
Pin CPU cores to the container. In Proxmox, under the container's Resources tab, set the CPU affinity to specific physical cores. This reduces cache thrashing from the hypervisor scheduler.
Use quantized models. Ollama pulls quantized versions by default (q4_K_M quantization for most models). Stick with these unless you have a specific quality reason to go up to q8_0 or full precision.
Limit concurrent requests. By default Ollama handles one request at a time for CPU inference. Set OLLAMA_NUM_PARALLEL=1 in your service environment to make this explicit and avoid memory thrashing.
Watch ARC memory on ZFS hosts. If your Proxmox host uses ZFS, the ARC cache can compete with Ollama's model memory. Set a sane ARC max:
# On the Proxmox host
echo "options zfs zfs_arc_max=8589934592" > /etc/modprobe.d/zfs.conf
update-initramfs -u
This caps ARC at 8 GB. Adjust to your available RAM.
Securing the Stack
Open WebUI has built-in authentication, but Ollama's API has none by default. Since Ollama is bound to 0.0.0.0, any device on your LAN can query it directly.
If your Proxmox network uses VLANs, the cleanest fix is to put Ollama and Whisper on a dedicated VLAN that only Open WebUI can reach. If you're on a flat network, add a Proxmox firewall rule to restrict port 11434 access to only the Open WebUI container IP:
# In /etc/pve/firewall/100.fw (Ollama container firewall)
[RULES]
IN ACCEPT -source 192.168.1.51 -p tcp --dport 11434
IN DROP -p tcp --dport 11434
For Open WebUI, consider putting it behind an nginx reverse proxy with HTTPS if you're exposing it beyond your LAN. Caddy works well in an LXC for this:
apt install -y debian-keyring debian-archive-keyring apt-transport-https
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | tee /etc/apt/sources.list.d/caddy-stable.list
apt update && apt install caddy
Then configure /etc/caddy/Caddyfile to proxy https://ai.yourdomain.com to localhost:8080.
Updating Your AI Stack
Updating each component is straightforward:
Ollama: bash
Inside the Ollama container
curl -fsSL https://ollama.com/install.sh | sh systemctl restart ollama
Open WebUI: bash
Inside the Open WebUI container, as the openwebui user
source ~/venv/bin/activate pip install --upgrade open-webui systemctl restart open-webui
Pull new model versions: bash ollama pull llama3.1:8b # Re-pulls if a newer version exists ollama list # See all downloaded models ollama rm llama3.1:8b # Remove to free space
Conclusion
Three lightweight LXC containers, a handful of systemd services, and you have a fully self-hosted AI inference stack running on your existing Proxmox homelab. No cloud subscription, no usage limits, no data leaving your network.
The CPU-only setup described here handles day-to-day tasks well — summarizing notes, drafting emails, answering questions about your infrastructure docs. If you find yourself wanting faster generation for larger models, the same container setup scales up: add GPU device bindings to the Ollama container and let Ollama detect and use the hardware automatically.
Start with Llama 3.1 8B, get comfortable with the stack, then explore the model library at ollama.com/library. There are coding assistants, embedding models for RAG pipelines, and multimodal models that can analyze images — all runnable in the same Ollama container you just built.