Running large language models locally has gone from a fringe experiment to a legitimate homelab staple in under two years. If you're already running Proxmox, you're closer to a full self-hosted AI stack than you think — and you don't need to dedicate an entire GPU passthrough VM to get there. This guide walks you through deploying Ollama, Open WebUI, and Whisper across lightweight LXC containers, keeping your hypervisor lean while giving you a capable local AI inference environment.

Why LXC Instead of a Full VM?

The instinct is to spin up a VM for anything AI-related, especially since GPU passthrough tutorials dominate the space. But for CPU-based inference — which covers a surprising range of useful models — LXC containers are a better fit.

Containers share the host kernel, which means lower memory overhead, faster startup, and less disk I/O. A VM running Ubuntu for Ollama might consume 2–3 GB of RAM before you load a single model. An LXC container doing the same job sits closer to 300–500 MB baseline.

The tradeoff: LXC containers require a privileged setup or specific kernel capabilities for GPU access. If you're on a system with an NVIDIA or AMD GPU and want hardware acceleration, you'll need a privileged container with the right device bindings — covered below.

What You'll Build

By the end of this guide you'll have:

Ollama LXC — the model runner and inference backend
Open WebUI LXC — a ChatGPT-style frontend that connects to Ollama
Whisper LXC (optional) — local speech-to-text that Open WebUI can use for voice input

All three containers sit on the same Proxmox bridge and communicate over a private network. Open WebUI is exposed on your LAN; Ollama and Whisper stay internal.

Prerequisites

Before starting, make sure you have:

Proxmox VE 8.x or newer
At least 16 GB RAM (32 GB recommended if running 13B+ models)
50+ GB of free storage for model weights
A Debian 12 LXC template downloaded in Proxmox
Basic familiarity with the Proxmox shell

Download the Debian 12 template from the Proxmox UI under Datacenter → your node → local storage → CT Templates → Templates, then search for debian-12.

Step 1: Create the Ollama LXC Container

From the Proxmox web UI, click Create CT. Use these settings as a baseline:

Hostname: ollama
Template: debian-12-standard
Disk: 60 GB (models are large — Llama 3.1 8B is ~4.7 GB, 70B is ~40 GB)
CPU: 4–8 cores
Memory: 8192 MB minimum; 16384 MB if you plan to run 13B models
Network: Bridge vmbr0, static IP (e.g., 192.168.1.50/24)
Uncheck "Start after created"

If you want GPU access from this container, enable Privileged mode in the General tab before creating it. You cannot change this after the fact.

GPU Passthrough for the Ollama Container (Optional)

If you have an NVIDIA GPU on the host, add the following to the container config file after creation. The container ID in this example is 100:

# On the Proxmox host
nano /etc/pve/lxc/100.conf

Add these lines:

lxc.cgroup2.devices.allow: c 195:* rwm lxc.cgroup2.devices.allow: c 235:* rwm lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file

For AMD GPU (ROCm), the device path is /dev/kfd and /dev/dri. The specifics vary by driver version — check the Ollama GPU documentation for your card.

Start the container and open a shell:

pct start 100
pct enter 100

Install Ollama

Inside the container:

apt update && apt install -y curl
curl -fsSL https://ollama.com/install.sh | sh

Ollama installs as a systemd service. By default it only listens on 127.0.0.1:11434. Since Open WebUI will be in a separate container, you need to change this.

nano /etc/systemd/system/ollama.service

Find the [Service] block and add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"

Reload and restart:

systemctl daemon-reload
systemctl restart ollama
systemctl enable ollama

Verify it's listening:

curl http://localhost:11434
# Should return: Ollama is running

Pull Your First Model

ollama pull llama3.1:8b

This downloads about 4.7 GB. For a more capable but heavier option:

ollama pull mistral:7b
ollama pull qwen2.5:14b

Test it from the CLI before moving on:

ollama run llama3.1:8b "What is Proxmox VE?"

Step 2: Create the Open WebUI LXC Container

Create a second container with these settings:

Hostname: open-webui
Template: debian-12-standard
Disk: 10 GB
CPU: 2 cores
Memory: 2048 MB
Network: Same bridge, static IP (e.g., 192.168.1.51/24)

Open WebUI is a Python/Node application. The easiest way to run it in an LXC is via Docker, but since we're trying to avoid nested Docker complexity, we'll use the pip-based install instead.

Start the container and enter it:

pct start 101
pct enter 101

Install Python and Open WebUI

apt update && apt install -y python3 python3-pip python3-venv git curl

Create a dedicated user

useradd -m -s /bin/bash openwebui su - openwebui

Set up virtual environment

python3 -m venv ~/venv source ~/venv/bin/activate pip install open-webui

This will take a few minutes. Once done, configure the connection to your Ollama container:

export OLLAMA_BASE_URL=http://192.168.1.50:11434
export DATA_DIR=/home/openwebui/data
mkdir -p $DATA_DIR

Start it

open-webui serve

Open WebUI will start on port 8080. Visit http://192.168.1.51:8080 from your browser to confirm it's working.

Run Open WebUI as a Systemd Service

Exit back to root (exit from the openwebui user) and create a service file:

nano /etc/systemd/system/open-webui.service

[Unit]
Description=Open WebUI
After=network.target

[Service] User=openwebui WorkingDirectory=/home/openwebui Environment="PATH=/home/openwebui/venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" Environment="OLLAMA_BASE_URL=http://192.168.1.50:11434" Environment="DATA_DIR=/home/openwebui/data" ExecStart=/home/openwebui/venv/bin/open-webui serve Restart=always RestartSec=5

[Install] WantedBy=multi-user.target

systemctl daemon-reload
systemctl enable --now open-webui

Now visit http://192.168.1.51:8080 and create your admin account on the first launch.

Step 3: Add Whisper for Voice Input (Optional)

Open WebUI supports speech-to-text via a compatible Whisper API. The fastest way to get this running on Proxmox LXC is with whisper.cpp compiled for CPU inference, or via faster-whisper through a lightweight Python server.

Create a third container:

Hostname: whisper
Template: debian-12-standard
Disk: 10 GB
CPU: 2–4 cores
Memory: 2048 MB
Network: Same bridge, static IP (e.g., 192.168.1.52/24)

pct start 102
pct enter 102

Install Faster-Whisper API Server

apt update && apt install -y python3 python3-pip python3-venv ffmpeg

useradd -m -s /bin/bash whisper su - whisper

python3 -m venv ~/venv source ~/venv/bin/activate

faster-whisper + a lightweight OpenAI-compatible server

pip install faster-whisper flask

Create a simple OpenAI-compatible transcription endpoint:

# /home/whisper/server.py
from flask import Flask, request, jsonify
from faster_whisper import WhisperModel
import tempfile, os

app = Flask(name) model = WhisperModel("base", device="cpu", compute_type="int8")

@app.route("/v1/audio/transcriptions", methods=["POST"]) def transcribe(): audio = request.files.get("file") if not audio: return jsonify({"error": "No file"}), 400 with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp: audio.save(tmp.name) segments, _ = model.transcribe(tmp.name) text = " ".join(s.text for s in segments) os.unlink(tmp.name) return jsonify({"text": text})

if name == "main": app.run(host="0.0.0.0", port=9000)

python server.py

Create a systemd service for this the same way you did for Open WebUI, pointing to python /home/whisper/server.py.

Connect Whisper to Open WebUI

In the Open WebUI dashboard, go to Settings → Audio and set:

Speech to Text Engine: OpenAI
API Base URL: http://192.168.1.52:9000/v1
API Key: anything (our server ignores it)
Model: whisper-1 (Open WebUI sends this; our server ignores model selection)

Click the microphone icon in a chat to test voice input.

Keeping Models Organized

Ollama stores models under /root/.ollama/models (or /home/ollama/.ollama/models if running as a user) inside the container. With large models this fills up quickly.

A practical approach: mount a Proxmox storage volume directly into the Ollama container so models live on your NAS or a dedicated storage pool.

On the Proxmox host, add a bind mount to the container config:

nano /etc/pve/lxc/100.conf

Add:

mp0: /mnt/storage/ollama-models,mp=/root/.ollama/models

Replace /mnt/storage/ollama-models with whatever path on the Proxmox host you want to use for model storage. This survives container rebuilds and lets you share models between multiple Ollama instances if needed.

Performance Tips for CPU Inference

Running models on CPU is slower than GPU, but perfectly usable for many workloads. A few things help:

Pin CPU cores to the container. In Proxmox, under the container's Resources tab, set the CPU affinity to specific physical cores. This reduces cache thrashing from the hypervisor scheduler.

Use quantized models. Ollama pulls quantized versions by default (q4_K_M quantization for most models). Stick with these unless you have a specific quality reason to go up to q8_0 or full precision.

Limit concurrent requests. By default Ollama handles one request at a time for CPU inference. Set OLLAMA_NUM_PARALLEL=1 in your service environment to make this explicit and avoid memory thrashing.

Watch ARC memory on ZFS hosts. If your Proxmox host uses ZFS, the ARC cache can compete with Ollama's model memory. Set a sane ARC max:

# On the Proxmox host
echo "options zfs zfs_arc_max=8589934592" > /etc/modprobe.d/zfs.conf
update-initramfs -u

This caps ARC at 8 GB. Adjust to your available RAM.

Securing the Stack

Open WebUI has built-in authentication, but Ollama's API has none by default. Since Ollama is bound to 0.0.0.0, any device on your LAN can query it directly.

If your Proxmox network uses VLANs, the cleanest fix is to put Ollama and Whisper on a dedicated VLAN that only Open WebUI can reach. If you're on a flat network, add a Proxmox firewall rule to restrict port 11434 access to only the Open WebUI container IP:

# In /etc/pve/firewall/100.fw (Ollama container firewall)
[RULES]
IN ACCEPT -source 192.168.1.51 -p tcp --dport 11434
IN DROP -p tcp --dport 11434

For Open WebUI, consider putting it behind an nginx reverse proxy with HTTPS if you're exposing it beyond your LAN. Caddy works well in an LXC for this:

apt install -y debian-keyring debian-archive-keyring apt-transport-https
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | tee /etc/apt/sources.list.d/caddy-stable.list
apt update && apt install caddy

Then configure /etc/caddy/Caddyfile to proxy https://ai.yourdomain.com to localhost:8080.

Updating Your AI Stack

Updating each component is straightforward:

Ollama: bash

Inside the Ollama container

curl -fsSL https://ollama.com/install.sh | sh systemctl restart ollama

Open WebUI: bash

Inside the Open WebUI container, as the openwebui user

source ~/venv/bin/activate pip install --upgrade open-webui systemctl restart open-webui

Pull new model versions: bash ollama pull llama3.1:8b # Re-pulls if a newer version exists ollama list # See all downloaded models ollama rm llama3.1:8b # Remove to free space

Conclusion

Three lightweight LXC containers, a handful of systemd services, and you have a fully self-hosted AI inference stack running on your existing Proxmox homelab. No cloud subscription, no usage limits, no data leaving your network.

The CPU-only setup described here handles day-to-day tasks well — summarizing notes, drafting emails, answering questions about your infrastructure docs. If you find yourself wanting faster generation for larger models, the same container setup scales up: add GPU device bindings to the Ollama container and let Ollama detect and use the hardware automatically.

Start with Llama 3.1 8B, get comfortable with the stack, then explore the model library at ollama.com/library. There are coding assistants, embedding models for RAG pipelines, and multimodal models that can analyze images — all runnable in the same Ollama container you just built.

Self-Host Ollama and Open WebUI on Proxmox LXC