Run Multiple AI Models on Proxmox LXC Containers
Learn how to isolate multiple AI models in separate Proxmox LXC containers with resource limits, shared GPU access, and an Open WebUI hub for multi-tenant inference.
On this page
Running a single Ollama instance on Proxmox works fine for personal use, but the moment you start juggling Llama 3, Mistral, Qwen, and a code model simultaneously, things get messy fast. Models compete for VRAM, one runaway inference job can starve everything else, and debugging becomes a nightmare. The solution is straightforward: give each model stack its own LXC container with hard resource limits, then front them all with a single Open WebUI instance that routes requests intelligently.
This guide walks you through exactly that architecture — multiple isolated Ollama LXC containers, cgroup-based resource limits, optional GPU passthrough, and a reverse-proxy hub. It works equally well on a single powerful node or a small cluster.
Why Isolate AI Models in Separate LXC Containers
Before diving into commands, it's worth understanding why container-per-model beats the alternatives.
A single Ollama instance with multiple models loaded shares one process namespace. If one model's inference thread leaks memory or spins a CPU core, every other model suffers. There's no way to say "give Llama 3.3 70B a maximum of 48 GB RAM and four CPU cores" without external tooling.
Docker Compose stacks help, but LXC gives you tighter OS-level isolation with less overhead than full VMs. Each container gets its own PID namespace, network stack, and cgroup hierarchy. You can limit CPU shares, set hard memory ceilings, and even pin containers to specific NUMA nodes — all from the Proxmox UI or CLI.
Architecture Overview
Here's the layout we'll build:
- lxc-ollama-general — Llama 3.3 8B, Mistral 7B (general chat, low resource ceiling)
- lxc-ollama-heavy — Llama 3.3 70B, Qwen 72B (high-memory, restricted concurrency)
- lxc-ollama-code — Qwen Coder 32B, DeepSeek Coder (dedicated coding assistant)
- lxc-openwebui — Open WebUI hub connecting to all three Ollama endpoints
Each Ollama container exposes port 11434 on its own internal IP. Open WebUI connects to all three as separate "connections" and lets users pick which backend to route a conversation to.
[User Browser] │ ▼ [lxc-openwebui :8080] │ ├──▶ lxc-ollama-general :11434 (8B/7B models) ├──▶ lxc-ollama-heavy :11434 (70B+ models) └──▶ lxc-ollama-code :11434 (code models)
Creating the LXC Containers
We'll use Ubuntu 24.04 as the base template. Start by downloading it if you haven't already:
pveam update
pveam download local ubuntu-24.04-standard_24.04-2_amd64.tar.zst
Create the three Ollama containers. Adjust resource numbers based on your hardware — these values assume a machine with 128 GB RAM and a modern multi-core CPU:
# General models — lighter workload
pct create 200 local:vztmpl/ubuntu-24.04-standard_24.04-2_amd64.tar.zst \
--hostname lxc-ollama-general \
--memory 24576 \
--swap 4096 \
--cores 6 \
--rootfs local-lvm:40 \
--net0 name=eth0,bridge=vmbr0,ip=10.10.0.200/24,gw=10.10.0.1 \
--unprivileged 1 \
--features nesting=1
Heavy models — high memory ceiling
pct create 201 local:vztmpl/ubuntu-24.04-standard_24.04-2_amd64.tar.zst
--hostname lxc-ollama-heavy
--memory 81920
--swap 8192
--cores 12
--rootfs local-lvm:80
--net0 name=eth0,bridge=vmbr0,ip=10.10.0.201/24,gw=10.10.0.1
--unprivileged 1
--features nesting=1
Code models — dedicated inference
pct create 202 local:vztmpl/ubuntu-24.04-standard_24.04-2_amd64.tar.zst
--hostname lxc-ollama-code
--memory 40960
--swap 4096
--cores 8
--rootfs local-lvm:60
--net0 name=eth0,bridge=vmbr0,ip=10.10.0.202/24,gw=10.10.0.1
--unprivileged 1
--features nesting=1
Open WebUI hub
pct create 203 local:vztmpl/ubuntu-24.04-standard_24.04-2_amd64.tar.zst
--hostname lxc-openwebui
--memory 4096
--cores 2
--rootfs local-lvm:20
--net0 name=eth0,bridge=vmbr0,ip=10.10.0.203/24,gw=10.10.0.1
--unprivileged 1
--features nesting=1
Start all four containers:
for id in 200 201 202 203; do pct start $id; done
Installing Ollama in Each Container
SSH into each Ollama container (200, 201, 202) and run the same installation steps. We'll use a here-doc script to keep it clean:
pct exec 200 -- bash -c '
apt-get update -qq && apt-get install -y curl
curl -fsSL https://ollama.com/install.sh | sh
systemctl enable ollama
systemctl start ollama
'
Repeat for containers 201 and 202. Then pull the models appropriate to each container:
# General container — lightweight models
pct exec 200 -- ollama pull llama3.2
pct exec 200 -- ollama pull mistral
Heavy container — large models (this will take a while)
pct exec 201 -- ollama pull llama3.3:70b pct exec 201 -- ollama pull qwen2.5:72b
Code container
pct exec 202 -- ollama pull qwen2.5-coder:32b pct exec 202 -- ollama pull deepseek-coder-v2
Binding Ollama to All Interfaces
By default, Ollama only listens on 127.0.0.1. For Open WebUI to reach it from another container, you need to bind to 0.0.0.0. Edit the systemd service override in each container:
pct exec 200 -- bash -c '
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_NUM_PARALLEL=2"
EOF
systemctl daemon-reload
systemctl restart ollama
'
For the heavy container, restrict parallel inference more aggressively to avoid OOM:
pct exec 201 -- bash -c '
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=1"
EOF
systemctl daemon-reload
systemctl restart ollama
'
OLLAMA_MAX_LOADED_MODELS=1 ensures only one 70B model is warm at a time, preventing the container from trying to hold two massive models in RAM simultaneously.
Setting Hard Resource Limits with cgroups
The LXC memory limit set at creation time is enforced by cgroups v2 — the kernel will OOM-kill processes in a container before they can consume more than the allocated RAM. You can verify cgroup enforcement is active:
pct exec 201 -- cat /sys/fs/cgroup/memory.max
# Should output: 85899345920 (80 GB in bytes)
For CPU throttling, Proxmox translates --cores into cpuset pins or CFS quota depending on your cgroup version. To pin container 201 to specific physical cores (useful on NUMA systems):
# Proxmox host — pin heavy container to cores 8-19
echo 'lxc.cgroup2.cpuset.cpus = 8-19' >> /etc/pve/lxc/201.conf
pct reboot 201
CPU Shares for Fair Scheduling
If containers compete for the same cores, CPU shares determine relative priority:
# In /etc/pve/lxc/200.conf — lower priority for general models
cpuunits: 512
In /etc/pve/lxc/202.conf — higher priority for code models
cpuunits: 1024
The default is 1024. A container with 512 shares gets half the CPU time of one with 1024 when both are contending.
GPU Passthrough to LXC Containers
If your host has an NVIDIA or AMD GPU, you can pass it through to one or more LXC containers. Sharing a single GPU across multiple containers requires careful management — only one inference process can hold the VRAM at a time unless you use NVIDIA's MIG (Multi-Instance GPU) feature on supported hardware.
NVIDIA GPU Passthrough
On the Proxmox host, install the NVIDIA driver and ensure the kernel modules load:
apt install nvidia-driver firmware-misc-nonfree
# Reboot, then verify:
nvidia-smi
Add the GPU devices to your container config. Edit /etc/pve/lxc/201.conf (the heavy container is the best candidate for GPU access):
# Add these lines to /etc/pve/lxc/201.conf
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 509:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
Inside the container, install the NVIDIA userspace libraries (matching the host driver version exactly):
pct exec 201 -- bash -c '
apt install nvidia-cuda-toolkit
# Verify GPU is visible:
nvidia-smi
'
Ollama will automatically detect and use the GPU once it's visible inside the container.
Sharing GPU Across Containers
If you want GPU access in multiple containers simultaneously, the simplest approach is time-slicing: only one container's inference job runs on GPU at a time, with CPU fallback for others. Ollama handles this transparently — if VRAM is full, it offloads layers to RAM.
For true parallel GPU sharing, NVIDIA MIG on A100/H100 cards lets you partition the GPU into isolated instances. That's beyond the scope of most homelabs but worth knowing exists.
Setting Up the Open WebUI Hub
In container 203, install Open WebUI with Docker or as a Python package. Using pip keeps the footprint small:
pct exec 203 -- bash -c '
apt-get update && apt-get install -y python3-pip python3-venv
python3 -m venv /opt/openwebui
/opt/openwebui/bin/pip install open-webui
'
Create a systemd service for Open WebUI:
pct exec 203 -- bash -c 'cat > /etc/systemd/system/openwebui.service <<EOF
[Unit]
Description=Open WebUI
After=network.target
[Service] Type=simple User=root Environment="DATA_DIR=/var/lib/openwebui" ExecStart=/opt/openwebui/bin/open-webui serve --host 0.0.0.0 --port 8080 Restart=always RestartSec=5
[Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable --now openwebui '
Once Open WebUI is running, access it at http://10.10.0.203:8080. On first launch, create your admin account, then navigate to Admin Panel → Settings → Connections and add each Ollama endpoint:
http://10.10.0.200:11434— General modelshttp://10.10.0.201:11434— Heavy modelshttp://10.10.0.202:11434— Code models
Open WebUI will enumerate the available models from each connection and present them as a unified model picker.
Proxmox Firewall Rules for Container Isolation
By default, containers on the same bridge can reach each other freely. That's fine for this setup, but you should restrict which external IPs can reach the heavy container's port 11434 — you don't want random LAN devices hammering your 70B model.
Enable the Proxmox firewall on the datacenter and node level, then add rules to /etc/pve/nodes/<node>/lxc/201.fw:
[RULES]
# Only allow Ollama access from Open WebUI container
IN ACCEPT -source 10.10.0.203 -p tcp --dport 11434
IN DROP -p tcp --dport 11434
This blocks any host other than Open WebUI from hitting the heavy model endpoint directly.
Monitoring Resource Usage Across Containers
Proxmox's built-in summary view shows per-container CPU and memory in real time. For a terminal-friendly view of all four containers at once:
# On the Proxmox host
watch -n 2 'for id in 200 201 202 203; do
echo "=== CT $id ==="
pct exec $id -- sh -c "free -h | grep Mem && top -bn1 | grep ollama | head -2"
done'
For proper long-term monitoring, integrate with Prometheus using the node_exporter inside each container and the Proxmox VE exporter on the host. If you already have a Grafana stack, dashboards like Node Exporter Full work out of the box.
Practical Tips for Multi-Model Management
Model storage: Ollama stores models in /root/.ollama/models by default. For large models, mount a dedicated dataset from your ZFS pool into each container to keep the rootfs lean:
# On Proxmox host — bind-mount ZFS dataset into container
zfs create rpool/ollama-heavy
echo 'mp0: /rpool/ollama-heavy,mp=/root/.ollama/models' >> /etc/pve/lxc/201.conf
Automatic model unloading: Ollama keeps models warm in memory for 5 minutes by default. On the heavy container, reduce this to free VRAM faster:
Environment="OLLAMA_KEEP_ALIVE=60s"
Container snapshots: Before pulling a new large model, snapshot the container. If a model corrupts its cache or the pull fails mid-way, rolling back takes seconds:
pct snapshot 201 pre-qwen72b --description "Before Qwen 72B pull"
Log rotation: Ollama can generate verbose logs. Add a logrotate rule inside each container to prevent disk fill:
pct exec 200 -- bash -c 'cat > /etc/logrotate.d/ollama <<EOF
/var/log/ollama.log {
weekly
rotate 4
compress
missingok
}
EOF'
Conclusion
The per-container isolation pattern transforms a chaotic single-Ollama setup into something you can actually reason about. Each model stack has a fixed memory ceiling, its own process namespace, and independent restart behavior. When your 70B model OOMs, it takes down one container — not your coding assistant and Open WebUI along with it.
The Open WebUI hub approach also gives you a clean upgrade path: swap out one container's model lineup without touching the others, or add a fourth container for image generation models without restructuring anything. As AI inference workloads grow more demanding, this architecture scales naturally — just spin up another container, add it as a connection, and you're done.
Start with the general container if you're resource-constrained, validate the networking and resource limits work as expected, then layer in the heavy and code containers as your hardware allows.