Run Multiple AI Models on Proxmox LXC Containers

Learn how to isolate multiple AI models in separate Proxmox LXC containers with resource limits, shared GPU access, and an Open WebUI hub for multi-tenant inference.

Proxmox Pulse Proxmox Pulse
10 min read
Multiple glowing AI neural network containers connected to a central GPU unit in a dark server room.

Running a single Ollama instance on Proxmox works fine for personal use, but the moment you start juggling Llama 3, Mistral, Qwen, and a code model simultaneously, things get messy fast. Models compete for VRAM, one runaway inference job can starve everything else, and debugging becomes a nightmare. The solution is straightforward: give each model stack its own LXC container with hard resource limits, then front them all with a single Open WebUI instance that routes requests intelligently.

This guide walks you through exactly that architecture — multiple isolated Ollama LXC containers, cgroup-based resource limits, optional GPU passthrough, and a reverse-proxy hub. It works equally well on a single powerful node or a small cluster.

Why Isolate AI Models in Separate LXC Containers

Before diving into commands, it's worth understanding why container-per-model beats the alternatives.

A single Ollama instance with multiple models loaded shares one process namespace. If one model's inference thread leaks memory or spins a CPU core, every other model suffers. There's no way to say "give Llama 3.3 70B a maximum of 48 GB RAM and four CPU cores" without external tooling.

Docker Compose stacks help, but LXC gives you tighter OS-level isolation with less overhead than full VMs. Each container gets its own PID namespace, network stack, and cgroup hierarchy. You can limit CPU shares, set hard memory ceilings, and even pin containers to specific NUMA nodes — all from the Proxmox UI or CLI.

Architecture Overview

Here's the layout we'll build:

  • lxc-ollama-general — Llama 3.3 8B, Mistral 7B (general chat, low resource ceiling)
  • lxc-ollama-heavy — Llama 3.3 70B, Qwen 72B (high-memory, restricted concurrency)
  • lxc-ollama-code — Qwen Coder 32B, DeepSeek Coder (dedicated coding assistant)
  • lxc-openwebui — Open WebUI hub connecting to all three Ollama endpoints

Each Ollama container exposes port 11434 on its own internal IP. Open WebUI connects to all three as separate "connections" and lets users pick which backend to route a conversation to.

[User Browser] │ ▼ [lxc-openwebui :8080] │ ├──▶ lxc-ollama-general :11434 (8B/7B models) ├──▶ lxc-ollama-heavy :11434 (70B+ models) └──▶ lxc-ollama-code :11434 (code models)

Creating the LXC Containers

We'll use Ubuntu 24.04 as the base template. Start by downloading it if you haven't already:

pveam update
pveam download local ubuntu-24.04-standard_24.04-2_amd64.tar.zst

Create the three Ollama containers. Adjust resource numbers based on your hardware — these values assume a machine with 128 GB RAM and a modern multi-core CPU:

# General models — lighter workload
pct create 200 local:vztmpl/ubuntu-24.04-standard_24.04-2_amd64.tar.zst \
  --hostname lxc-ollama-general \
  --memory 24576 \
  --swap 4096 \
  --cores 6 \
  --rootfs local-lvm:40 \
  --net0 name=eth0,bridge=vmbr0,ip=10.10.0.200/24,gw=10.10.0.1 \
  --unprivileged 1 \
  --features nesting=1

Heavy models — high memory ceiling

pct create 201 local:vztmpl/ubuntu-24.04-standard_24.04-2_amd64.tar.zst
--hostname lxc-ollama-heavy
--memory 81920
--swap 8192
--cores 12
--rootfs local-lvm:80
--net0 name=eth0,bridge=vmbr0,ip=10.10.0.201/24,gw=10.10.0.1
--unprivileged 1
--features nesting=1

Code models — dedicated inference

pct create 202 local:vztmpl/ubuntu-24.04-standard_24.04-2_amd64.tar.zst
--hostname lxc-ollama-code
--memory 40960
--swap 4096
--cores 8
--rootfs local-lvm:60
--net0 name=eth0,bridge=vmbr0,ip=10.10.0.202/24,gw=10.10.0.1
--unprivileged 1
--features nesting=1

Open WebUI hub

pct create 203 local:vztmpl/ubuntu-24.04-standard_24.04-2_amd64.tar.zst
--hostname lxc-openwebui
--memory 4096
--cores 2
--rootfs local-lvm:20
--net0 name=eth0,bridge=vmbr0,ip=10.10.0.203/24,gw=10.10.0.1
--unprivileged 1
--features nesting=1

Start all four containers:

for id in 200 201 202 203; do pct start $id; done

Installing Ollama in Each Container

SSH into each Ollama container (200, 201, 202) and run the same installation steps. We'll use a here-doc script to keep it clean:

pct exec 200 -- bash -c '
apt-get update -qq && apt-get install -y curl
curl -fsSL https://ollama.com/install.sh | sh
systemctl enable ollama
systemctl start ollama
'

Repeat for containers 201 and 202. Then pull the models appropriate to each container:

# General container — lightweight models
pct exec 200 -- ollama pull llama3.2
pct exec 200 -- ollama pull mistral

Heavy container — large models (this will take a while)

pct exec 201 -- ollama pull llama3.3:70b pct exec 201 -- ollama pull qwen2.5:72b

Code container

pct exec 202 -- ollama pull qwen2.5-coder:32b pct exec 202 -- ollama pull deepseek-coder-v2

Binding Ollama to All Interfaces

By default, Ollama only listens on 127.0.0.1. For Open WebUI to reach it from another container, you need to bind to 0.0.0.0. Edit the systemd service override in each container:

pct exec 200 -- bash -c '
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_NUM_PARALLEL=2"
EOF
systemctl daemon-reload
systemctl restart ollama
'

For the heavy container, restrict parallel inference more aggressively to avoid OOM:

pct exec 201 -- bash -c '
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=1"
EOF
systemctl daemon-reload
systemctl restart ollama
'

OLLAMA_MAX_LOADED_MODELS=1 ensures only one 70B model is warm at a time, preventing the container from trying to hold two massive models in RAM simultaneously.

Setting Hard Resource Limits with cgroups

The LXC memory limit set at creation time is enforced by cgroups v2 — the kernel will OOM-kill processes in a container before they can consume more than the allocated RAM. You can verify cgroup enforcement is active:

pct exec 201 -- cat /sys/fs/cgroup/memory.max
# Should output: 85899345920  (80 GB in bytes)

For CPU throttling, Proxmox translates --cores into cpuset pins or CFS quota depending on your cgroup version. To pin container 201 to specific physical cores (useful on NUMA systems):

# Proxmox host — pin heavy container to cores 8-19
echo 'lxc.cgroup2.cpuset.cpus = 8-19' >> /etc/pve/lxc/201.conf
pct reboot 201

CPU Shares for Fair Scheduling

If containers compete for the same cores, CPU shares determine relative priority:

# In /etc/pve/lxc/200.conf — lower priority for general models
cpuunits: 512

In /etc/pve/lxc/202.conf — higher priority for code models

cpuunits: 1024

The default is 1024. A container with 512 shares gets half the CPU time of one with 1024 when both are contending.

GPU Passthrough to LXC Containers

If your host has an NVIDIA or AMD GPU, you can pass it through to one or more LXC containers. Sharing a single GPU across multiple containers requires careful management — only one inference process can hold the VRAM at a time unless you use NVIDIA's MIG (Multi-Instance GPU) feature on supported hardware.

NVIDIA GPU Passthrough

On the Proxmox host, install the NVIDIA driver and ensure the kernel modules load:

apt install nvidia-driver firmware-misc-nonfree
# Reboot, then verify:
nvidia-smi

Add the GPU devices to your container config. Edit /etc/pve/lxc/201.conf (the heavy container is the best candidate for GPU access):

# Add these lines to /etc/pve/lxc/201.conf
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 509:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file

Inside the container, install the NVIDIA userspace libraries (matching the host driver version exactly):

pct exec 201 -- bash -c '
apt install nvidia-cuda-toolkit
# Verify GPU is visible:
nvidia-smi
'

Ollama will automatically detect and use the GPU once it's visible inside the container.

Sharing GPU Across Containers

If you want GPU access in multiple containers simultaneously, the simplest approach is time-slicing: only one container's inference job runs on GPU at a time, with CPU fallback for others. Ollama handles this transparently — if VRAM is full, it offloads layers to RAM.

For true parallel GPU sharing, NVIDIA MIG on A100/H100 cards lets you partition the GPU into isolated instances. That's beyond the scope of most homelabs but worth knowing exists.

Setting Up the Open WebUI Hub

In container 203, install Open WebUI with Docker or as a Python package. Using pip keeps the footprint small:

pct exec 203 -- bash -c '
apt-get update && apt-get install -y python3-pip python3-venv
python3 -m venv /opt/openwebui
/opt/openwebui/bin/pip install open-webui
'

Create a systemd service for Open WebUI:

pct exec 203 -- bash -c 'cat > /etc/systemd/system/openwebui.service <<EOF
[Unit]
Description=Open WebUI
After=network.target

[Service] Type=simple User=root Environment="DATA_DIR=/var/lib/openwebui" ExecStart=/opt/openwebui/bin/open-webui serve --host 0.0.0.0 --port 8080 Restart=always RestartSec=5

[Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable --now openwebui '

Once Open WebUI is running, access it at http://10.10.0.203:8080. On first launch, create your admin account, then navigate to Admin Panel → Settings → Connections and add each Ollama endpoint:

  • http://10.10.0.200:11434 — General models
  • http://10.10.0.201:11434 — Heavy models
  • http://10.10.0.202:11434 — Code models

Open WebUI will enumerate the available models from each connection and present them as a unified model picker.

Proxmox Firewall Rules for Container Isolation

By default, containers on the same bridge can reach each other freely. That's fine for this setup, but you should restrict which external IPs can reach the heavy container's port 11434 — you don't want random LAN devices hammering your 70B model.

Enable the Proxmox firewall on the datacenter and node level, then add rules to /etc/pve/nodes/<node>/lxc/201.fw:

[RULES]
# Only allow Ollama access from Open WebUI container
IN ACCEPT -source 10.10.0.203 -p tcp --dport 11434
IN DROP -p tcp --dport 11434

This blocks any host other than Open WebUI from hitting the heavy model endpoint directly.

Monitoring Resource Usage Across Containers

Proxmox's built-in summary view shows per-container CPU and memory in real time. For a terminal-friendly view of all four containers at once:

# On the Proxmox host
watch -n 2 'for id in 200 201 202 203; do
  echo "=== CT $id ==="
  pct exec $id -- sh -c "free -h | grep Mem && top -bn1 | grep ollama | head -2"
done'

For proper long-term monitoring, integrate with Prometheus using the node_exporter inside each container and the Proxmox VE exporter on the host. If you already have a Grafana stack, dashboards like Node Exporter Full work out of the box.

Practical Tips for Multi-Model Management

Model storage: Ollama stores models in /root/.ollama/models by default. For large models, mount a dedicated dataset from your ZFS pool into each container to keep the rootfs lean:

# On Proxmox host — bind-mount ZFS dataset into container
zfs create rpool/ollama-heavy
echo 'mp0: /rpool/ollama-heavy,mp=/root/.ollama/models' >> /etc/pve/lxc/201.conf

Automatic model unloading: Ollama keeps models warm in memory for 5 minutes by default. On the heavy container, reduce this to free VRAM faster:

Environment="OLLAMA_KEEP_ALIVE=60s"

Container snapshots: Before pulling a new large model, snapshot the container. If a model corrupts its cache or the pull fails mid-way, rolling back takes seconds:

pct snapshot 201 pre-qwen72b --description "Before Qwen 72B pull"

Log rotation: Ollama can generate verbose logs. Add a logrotate rule inside each container to prevent disk fill:

pct exec 200 -- bash -c 'cat > /etc/logrotate.d/ollama <<EOF
/var/log/ollama.log {
  weekly
  rotate 4
  compress
  missingok
}
EOF'

Conclusion

The per-container isolation pattern transforms a chaotic single-Ollama setup into something you can actually reason about. Each model stack has a fixed memory ceiling, its own process namespace, and independent restart behavior. When your 70B model OOMs, it takes down one container — not your coding assistant and Open WebUI along with it.

The Open WebUI hub approach also gives you a clean upgrade path: swap out one container's model lineup without touching the others, or add a fourth container for image generation models without restructuring anything. As AI inference workloads grow more demanding, this architecture scales naturally — just spin up another container, add it as a connection, and you're done.

Start with the general container if you're resource-constrained, validate the networking and resource limits work as expected, then layer in the heavy and code containers as your hardware allows.

Share
Proxmox Pulse

Written by

Proxmox Pulse

Sysadmin-driven guides for getting the most out of Proxmox VE in production and homelab environments.

Related Articles

View all →