Proxmox Clustering: Setup, Quorum, and Live Migration

Proxmox clustering is where things get serious. A single Proxmox node is great for a homelab or small deployment, but once you need high availability, live migration, or just the ability to manage multiple hosts from a single pane of glass, you're looking at a cluster.

I've built clusters ranging from 3-node homelabs to 16-node production environments. The technology is solid, but there are concepts you need to understand before you start joining nodes together — especially around quorum, because getting that wrong can take down your entire infrastructure.

When Clustering Makes Sense (And When It Doesn't)

Cluster if you have:

3+ Proxmox nodes (or plan to get there soon)
A need for live migration (move VMs between hosts without downtime)
High availability requirements (VMs auto-restart on another node if one dies)
Multiple admins who need a unified management interface

Don't cluster if:

You have exactly 2 nodes (quorum issues — I'll explain why this is a trap)
Your nodes are in different geographic locations with unreliable links (corosync hates latency)
You just want shared storage without the management overhead (use standalone nodes with NFS)

A common mistake: people cluster two nodes thinking it gives them redundancy. It doesn't. With two nodes, losing one means losing quorum, and the surviving node will refuse to run VMs. I've gotten panicked messages from admins whose "highly available" two-node setup went completely dark because one node lost power. More on this later.

Network Requirements

Clustering adds a network dependency that doesn't exist with standalone nodes. Corosync — the cluster communication layer — needs reliable, low-latency connectivity between all nodes.

Dedicated Corosync Network

This is not optional in production. Run corosync on its own network, separate from VM traffic and management. Reasons:

VM traffic bursts can cause corosync timeouts, triggering false failovers
Corosync uses multicast or unicast UDP — it's chatty and latency-sensitive
If the corosync link goes down, the cluster loses quorum regardless of whether the nodes are actually healthy

My standard setup uses three networks per node:

Network	VLAN	Subnet	Purpose
vmbr0	untagged	192.168.1.0/24	Management + API
vmbr1	10	10.10.10.0/24	Corosync cluster
vmbr2	20	10.20.0.0/24	VM traffic + storage

Configure the corosync network on each node before creating the cluster. Edit /etc/network/interfaces:

auto eno2
iface eno2 inet static
    address 10.10.10.1/24

auto vmbr1
iface vmbr1 inet static
    address 10.10.10.1/24
    bridge-ports eno2
    bridge-stp off
    bridge-fd 0

Adjust the address for each node: .1, .2, .3, etc.

Apply the network config:

ifreload -a

Verify connectivity between nodes on the corosync network:

ping -c 3 10.10.10.2
ping -c 3 10.10.10.3

DNS and Hostname Resolution

Every node must be able to resolve every other node's hostname. The simplest approach is /etc/hosts:

192.168.1.10  pve1.lab.local pve1
192.168.1.11  pve2.lab.local pve2
192.168.1.12  pve3.lab.local pve3

Add these entries on all nodes. Corosync uses hostnames internally, and resolution failures cause cryptic cluster errors.

Creating the Cluster

On the first node (this becomes the initial cluster member):

pvecm create lab-cluster --link0 10.10.10.1

The --link0 flag specifies the corosync bind address. If you want redundant corosync links (highly recommended for production), add a second:

pvecm create lab-cluster --link0 10.10.10.1 --link1 192.168.1.10

Corosync will use link0 as primary and fail over to link1 if the primary goes down.

Verify the cluster was created:

pvecm status

Cluster information
-------------------
Name:             lab-cluster
Config Version:   1
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Mar  4 14:23:17 2026
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.5
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   1
Highest expected: 1
Total votes:      1
Quorum:           1
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.10.1 (local)

Joining Nodes

On each additional node, join the cluster. You need the root password of the first node:

pvecm add 192.168.1.10 --link0 10.10.10.2

Replace 192.168.1.10 with the management IP of the first node, and 10.10.10.2 with this node's corosync address.

If you set up dual corosync links:

pvecm add 192.168.1.10 --link0 10.10.10.2 --link1 192.168.1.11

The join process copies the cluster configuration, SSH keys, and certificates. It takes about 10-20 seconds.

After joining all nodes, verify the cluster status:

pvecm status

Cluster information
-------------------
Name:             lab-cluster
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Mar  4 14:31:42 2026
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.28
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.10.1 (local)
0x00000002          1 10.10.10.2
0x00000003          1 10.10.10.3

Three nodes, quorum of 2. This is the sweet spot.

Quorum: Why It Matters

Quorum is the minimum number of votes needed for the cluster to operate. The formula is simple:

quorum = (total_nodes / 2) + 1     # integer division

3 nodes: quorum = 2 (can lose 1 node)
5 nodes: quorum = 3 (can lose 2 nodes)
7 nodes: quorum = 4 (can lose 3 nodes)

Why You Need Odd Numbers

With an even number of nodes, a network partition can split the cluster into two equal halves, and neither half has quorum. Both sides stop running VMs. This is called a split-brain scenario, and it's exactly what quorum is designed to prevent — but with even numbers, it can prevent both halves from operating instead of just one.

With odd numbers, one side always has more votes than the other.

The Two-Node Trap

Here's the scenario I've seen play out multiple times:

You have two Proxmox nodes in a cluster. Node 2 loses power. Node 1 is perfectly healthy with all your VMs running. But node 1 only has 1 vote out of 2 expected, and quorum requires 2 votes. Node 1 stops all VMs and refuses to run them.

Your "highly available" cluster just went fully offline because one node failed. The exact opposite of what you wanted.

The reactions I've seen range from confusion to genuine panic, especially when it's a production environment with clients waiting.

Workarounds for two-node setups:

QDevice (recommended) — A lightweight third-party arbitrator that provides an additional vote. Run it on any Linux box, even a Raspberry Pi:

# On the QDevice host (a separate Linux machine)
apt install corosync-qdevice-net

# On one of the Proxmox nodes
pvecm qdevice setup 192.168.1.100

The QDevice gives you an effective 3-vote cluster with only 2 physical Proxmox nodes. If one node dies, the surviving node plus the QDevice still have quorum.

Force quorum (emergency only) — If you're already in the situation where a surviving node has lost quorum:

pvecm expected 1

This tells the cluster "1 vote is enough." Use this only as an emergency measure, and remove it when the second node returns. Running permanently with expected 1 defeats the purpose of clustering — you'll have no split-brain protection.

Fencing and STONITH

Fencing is the mechanism that ensures a failed node is truly dead before its VMs are started elsewhere. Without fencing, you risk two nodes running the same VM simultaneously, which corrupts shared storage.

STONITH (Shoot The Other Node In The Head) is the aggressive but effective approach: when a node is unresponsive, the cluster power-cycles it via IPMI/iLO/iDRAC, a managed PDU, or a similar out-of-band mechanism.

Proxmox's HA manager includes a watchdog-based fencing mechanism. Each node runs the watchdog-mux service that periodically resets a hardware or software watchdog timer. If the HA service stops feeding the watchdog (because the node crashed, hung, or lost cluster communication), the watchdog triggers a hard reset.

Configure the watchdog:

# Check current watchdog
cat /etc/default/pve-ha-manager

For production, use a hardware watchdog if available (most server-class motherboards have one):

# Check for hardware watchdog
ls /dev/watchdog*
dmesg | grep -i watchdog

The default software watchdog (softdog) works but can be bypassed by a kernel panic. Hardware watchdogs can't — they reset the machine regardless of the OS state.

HA Manager Setup

The HA (High Availability) manager automates VM failover. When a node goes down and fencing confirms it's offline, HA restarts the protected VMs on surviving nodes.

Configuring HA Groups

HA groups define which nodes can run a particular set of VMs:

Through the web UI: Datacenter > HA > Groups > Create.

Or via CLI:

ha-manager groupadd primary-nodes --nodes pve1,pve2,pve3 --restricted 0 --nofailback 0

--restricted 0: VMs can run on any cluster node, but prefer group members
--nofailback 0: VMs migrate back to preferred nodes when they come back online

Adding VMs to HA

ha-manager add vm:100 --group primary-nodes --state started --max_restart 3 --max_relocate 2

This tells the HA manager:

VM 100 should always be running (--state started)
Try restarting it up to 3 times on the current node before giving up
Relocate to another node up to 2 times
Prefer nodes in the primary-nodes group

Verify HA status:

ha-manager status

quorum OK, master pve1 (active since Thu Mar  4 14:45:22 2026)
vm:100   pve1   started
vm:101   pve2   started
vm:102   pve1   started

Testing HA Failover

Don't wait for a real failure to discover your HA setup doesn't work. Test it deliberately:

Note which node each HA-managed VM is running on
Trigger a controlled failover by stopping the pve-ha-crm service:

# On pve1
systemctl stop pve-ha-crm

Or, for a more realistic test, reboot the node:

reboot

Watch the other nodes' logs:

# On pve2
journalctl -u pve-ha-crm -f

You should see messages about the fencing timeout elapsing, the node being fenced, and VMs being relocated:

Mar  4 15:02:34 pve2 pve-ha-crm[1842]: node 'pve1': state changed from 'online' to 'unknown'
Mar  4 15:02:49 pve2 pve-ha-crm[1842]: node 'pve1': state changed from 'unknown' to 'fence'
Mar  4 15:03:11 pve2 pve-ha-crm[1842]: node 'pve1': fencing successful
Mar  4 15:03:12 pve2 pve-ha-crm[1842]: relocate vm:100 to node 'pve2'
Mar  4 15:03:15 pve2 pve-ha-crm[1842]: service vm:100: state changed from 'fence' to 'started' (node = pve2)

The default fence timeout is 60 seconds. You can tune this, but going much lower than 30 seconds risks false positives.

Live Migration

Live migration is the reason most people set up a cluster. Moving a running VM from one node to another with zero downtime.

How It Works

The source node starts copying the VM's memory to the destination node
While copying, the VM keeps running and dirtying pages
The source re-sends any pages that changed since the last pass (iterative convergence)
When the remaining dirty pages are small enough, the VM is briefly paused (typically 20-100ms)
Final pages are transferred, the VM resumes on the destination node
The source releases resources

For the guest, the pause is imperceptible. Established TCP connections survive. Running services don't notice.

Requirements for Live Migration

Shared storage is mandatory. Both nodes must access the same storage backend for the VM's disks. The options:

Ceph (RBD) — Distributed storage across your Proxmox nodes. Best for clusters that want self-contained storage without external dependencies. Requires at least 3 nodes and dedicated SSDs/NVMes for OSDs.
NFS — Simple and reliable. Mount the same NFS export on all nodes. Works well with a TrueNAS or Synology as the storage server.
iSCSI — Block-level shared storage. Pair with LVM for multi-VM support on a single iSCSI target.
GlusterFS — Distributed filesystem, similar concept to Ceph but filesystem-based rather than block-based.

Local storage (LVM, ZFS on local disks, directory storage) doesn't support live migration — the destination node can't access the source node's local disks. You can do offline migration with local storage, which copies the disk over the network, but that means VM downtime.

Performing Live Migration

Via the web UI: right-click a VM > Migrate > select target node > Online migration.

Via CLI:

qm migrate 100 pve2 --online

Monitor the progress:

# On the source node, watch the task log
tail -f /var/log/pve/tasks/active

Typical output during migration:

2026-03-04 15:12:33 starting migration of VM 100 to node 'pve2' (192.168.1.11)
2026-03-04 15:12:33 using storage 'ceph-pool' on both source and target
2026-03-04 15:12:34 start migrate disk drive-scsi0 (ceph-pool:vm-100-disk-0) -- alreadyass same storage
2026-03-04 15:12:34 starting VM 100 on remote node 'pve2'
2026-03-04 15:12:36 migration active, transferred 0 bytes of 8.0 GiB VM-state
2026-03-04 15:12:38 migration active, transferred 524.3 MiB of 8.0 GiB VM-state
2026-03-04 15:12:40 migration active, transferred 2.1 GiB of 8.0 GiB VM-state
2026-03-04 15:12:44 migration active, transferred 6.8 GiB of 8.0 GiB VM-state
2026-03-04 15:12:45 migration status: completed
2026-03-04 15:12:45 migration finished successfully (duration 00:00:12)

12 seconds for an 8 GB VM on a 10 GbE link. The VM was running the entire time.

Migration Performance Tips

Network bandwidth matters. Migration speed is directly proportional to the bandwidth between nodes. On 1 GbE, migrating a 32 GB VM takes forever and might even fail if the VM is dirtying memory faster than 1 Gbps. I won't run live migration on anything less than 10 GbE.

Set a dedicated migration network to avoid saturating your management link:

# On each node, in /etc/pve/datacenter.cfg
migration: network=10.20.0.0/24,type=secure

Reduce memory churn. VMs that are actively writing large amounts of memory (database commits, video encoding) are harder to migrate because dirty pages accumulate faster than they can be transferred. The migration will converge eventually, but the downtime window during final switchover might be longer.

CPU compatibility. The destination node must support all CPU features the VM is using. If your VM is configured with cpu: host and the source node is a newer CPU generation than the destination, migration will fail. For heterogeneous clusters, use a specific CPU model like x86-64-v3 instead of host.

Shared Storage with Ceph

If you don't have external storage infrastructure, Ceph is the natural choice for Proxmox clusters. Each node contributes local disks to a distributed storage pool.

Minimum Requirements

3 nodes (Ceph needs 3 replicas by default)
Dedicated SSD/NVMe on each node for Ceph OSDs (don't share with the OS disk)
10 GbE between nodes (Ceph replication is bandwidth-heavy)
1-2 GB RAM per TB of raw Ceph storage

Quick Ceph Setup

Install Ceph on all nodes via the web UI: Datacenter > node > Ceph > Install.

Or via CLI on each node:

pveceph install --repository no-subscription

Create a Ceph monitor on each node:

pveceph mon create

Create OSDs from your dedicated disks:

# On each node, for each dedicated disk
pveceph osd create /dev/nvme1n1

Create a storage pool:

pveceph pool create vm-pool --size 3 --min_size 2 --pg_autoscale_mode on

This creates a pool with 3 replicas (data survives 2 node failures) and a minimum of 2 replicas to accept writes (operates with 1 node down).

Verify the cluster health:

ceph status

  cluster:
    id:     a1b2c3d4-e5f6-7890-abcd-ef1234567890
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum pve1,pve2,pve3 (age 2h)
    mgr: pve1(active, since 2h), standbys: pve2, pve3
    osd: 6 osds: 6 up (since 2h), 6 in (since 2h)

  data:
    pools:   1 pools, 128 pgs
    objects: 1.24k objects, 4.8 GiB
    usage:   18 GiB used, 5.5 TiB / 5.5 TiB avail
    pgs:     128 active+clean

HEALTH_OK is what you want. With Ceph running, you can now create VMs on the vm-pool storage and live migrate them between any nodes in the cluster.

Cluster Maintenance

Rolling Updates

Update Proxmox one node at a time. For each node:

Migrate all VMs off the node (or let HA relocate them)
Run the update:

apt update && apt full-upgrade -y

Reboot if a kernel update was installed
Verify the node rejoins the cluster cleanly
Migrate some VMs back
Move to the next node

Never update all nodes simultaneously. If the update breaks something, you want working nodes to fall back on.

Removing a Node

If you need to permanently remove a node from the cluster:

Migrate all VMs and containers off the node
Remove any HA resources assigned to it
On the node being removed:

pvecm delnode pve3

Wait — don't run delnode on the node you're removing. Run it on a remaining node:

# On pve1 or pve2, NOT on pve3
pvecm delnode pve3

Then on the removed node, reset it to standalone:

systemctl stop pve-cluster corosync
pmxcfs -l
rm -f /etc/corosync/*
rm -f /etc/pve/corosync.conf
killall pmxcfs
systemctl start pve-cluster

I've seen people run delnode on the wrong node and end up with a broken cluster. Always double-check which node you're running commands on.

Wrapping Up

Proxmox clustering is powerful but demands respect for the fundamentals. The technology works — I've had clusters running for years with automatic failover saving me from hardware failures multiple times. But every one of those successes was built on proper planning: odd node counts, dedicated corosync networks, shared storage, and tested failover procedures.

If you're starting fresh, my advice is: build a 3-node cluster, use Ceph or NFS for shared storage, set up a couple of test VMs with HA, and deliberately break things. Pull power cables, kill network interfaces, force-reboot nodes. Find out how your cluster behaves under failure conditions in your lab, not in production at 2 AM.

And if you're running two nodes — get a QDevice. Seriously. Your future self will thank you.