Proxmox High Availability Setup for Automatic VM Failover
Set up Proxmox HA Manager to automatically restart VMs after a node failure. Covers fencing requirements, HA group config, and live failover testing on PVE 9.1.
On this page
Proxmox High Availability Manager restarts your VMs automatically on a surviving node within about 60-90 seconds of detecting a node failure — no manual intervention, no SSH session at 3am. By the end of this guide, you'll have a working HA cluster with properly configured fencing, HA groups, and a tested failover. I'm running this on a three-node Proxmox VE 9.1 cluster with Ceph shared storage, but the procedure is identical for iSCSI or NFS-backed clusters.
Key Takeaways
- 3 nodes minimum: Two-node clusters can't maintain quorum after a single failure — HA needs a majority vote to proceed.
- Fencing is mandatory: Without a working watchdog or IPMI fence agent, Proxmox HA will refuse to restart VMs to avoid split-brain data corruption.
- Shared storage required: VMs must live on storage accessible from all nodes — Ceph, iSCSI, NFS, or shared ZFS over FC.
- Recovery takes 60-90 seconds: The delay is deliberate — Proxmox waits for fencing confirmation before restarting anything.
- Test with a hard power-off: A graceful shutdown doesn't replicate a real failure scenario.
How Proxmox HA Actually Works
Proxmox HA runs on two daemons: pve-ha-lrm (Local Resource Manager, one per node) and pve-ha-crm (Cluster Resource Manager, one elected leader per cluster). The CRM watches resource states; the LRM executes commands on its local node.
When a node goes down, the sequence is:
- Corosync marks the node unreachable after missed heartbeats.
- The CRM waits for fencing confirmation — either a watchdog reset or an IPMI power-cycle that proves the failed node is genuinely off.
- Once fenced, the CRM issues relocate or restart commands for all HA-managed VMs.
- The LRM on a surviving node starts each VM from the shared storage pool.
Step 2 is where most misconfigured HA setups stall. Without fencing, the CRM correctly refuses to restart VMs — the original node might still be running and holding disk locks, and starting a second instance would corrupt the VM's filesystem.
Why the Three-Node Minimum Matters
Corosync requires a majority (quorum) to operate. With two nodes, losing one leaves you at exactly 50% — no majority, cluster services halt. With three nodes, losing one leaves you at 66% — quorum maintained, HA proceeds normally.
You can work around a two-node cluster with a lightweight qdevice (a tie-breaker service running on something like a Raspberry Pi), but three nodes is the cleaner path. If you're starting from scratch, the guide on building a private Proxmox cloud at home walks through the full multi-node cluster setup prerequisites.
What You Need Before Enabling HA
Check all of these before touching the HA configuration panel. Missing any one of them produces an HA setup that looks active but silently fails when you actually need it.
Cluster:
- Three or more PVE 9.1 nodes in the same cluster
- Corosync heartbeat latency under 5ms — use a dedicated cluster NIC if you can
- Synchronized time on all nodes: run
chronyc trackingand confirm offset under 100ms
Storage:
- Target VMs must use shared storage: Ceph RBD, iSCSI, NFS, or Fibre Channel
- Local storage (
local-lvm,local-zfs) silently disqualifies a VM from HA eligibility
Fencing:
- A hardware watchdog device at
/dev/watchdogor/dev/watchdog0 - Or IPMI/iDRAC/iLO configured as a fence agent with tested, working credentials
Verify your watchdog device is present:
ls /dev/watchdog*
If nothing appears, load the software fallback as a stopgap (acceptable for testing, not for production):
modprobe softdog
echo "softdog" >> /etc/modules
Configure the Hardware Watchdog with watchdog-mux
Proxmox ships watchdog-mux, a daemon that multiplexes the watchdog device so multiple HA processes can share it safely. It must be running on every cluster node.
Check and enable it:
systemctl status watchdog-mux
systemctl enable --now watchdog-mux
Verify the LRM connected to it:
journalctl -u pve-ha-lrm --since "5 minutes ago" | grep -i watchdog
You should see a line confirming the LRM opened /run/watchdog-mux.sock. Errors here mean fencing is broken and recovery will hang indefinitely.
The watchdog timeout is configurable:
# /etc/default/pve-ha-manager
HA_WATCHDOG_TIMEOUT=60
The 60-second default is appropriate for most setups. Shorter values increase sensitivity to transient network blips; longer values delay recovery.
Setting Up IPMI Fencing for Bare-Metal Nodes
For bare-metal servers with IPMI — which covers most enterprise hardware and many homelab boards — IPMI fencing is more reliable than a software watchdog alone. It gives you hard power control even when the OS is completely unresponsive.
Install the fence agents package on all nodes:
apt install fence-agents
Test your BMC credentials before configuring anything:
fence_ipmilan -a 192.168.1.52 -l admin -p yourpassword -o status
Expected output: Status: ON. If this fails, fix IPMI access first — there is no point configuring HA fencing around a broken BMC connection. While you're securing IPMI access, make sure it's restricted to your management VLAN; the Proxmox hardening guide has practical firewall rules for exactly this scenario.
Configure the fence agent per-node via the Proxmox API:
pvesh set /nodes/pve2/config \
--fence-plugin ipmilan \
--fence-ipmi-ip 192.168.1.52 \
--fence-ipmi-user admin \
--fence-ipmi-password yourpassword
How to Create HA Groups and Enroll VMs
Create an HA Group
HA groups control which nodes are eligible to run a set of VMs and in what priority order. Navigate to Datacenter → HA → Groups → Add, or use the API:
pvesh create /cluster/ha/groups \
--group critical-vms \
--nodes "pve1:3,pve2:2,pve3:1"
The trailing number is priority — higher wins. Equal priority means Proxmox picks the surviving node arbitrarily.
| Option | Effect |
|---|---|
restricted |
VMs only ever run on nodes listed in this group |
nofailback |
VMs don't migrate back when the preferred node recovers |
| Node priority | Determines which surviving node receives the VM first |
Start with one group containing all nodes at equal priority. Tune after watching real failovers.
Add VMs and Containers to the HA Group
In the web UI: select a VM, click More → Manage HA. Or with the CLI:
pvesh create /cluster/ha/resources \
--sid vm:101 \
--group critical-vms \
--state started \
--max_restart 3 \
--max_relocate 3
--state started: the desired state HA will actively maintain--max_restart: restart attempts on the current node before escalating to relocation--max_relocate: relocation attempts across nodes before marking the resource failed
LXC containers use --sid ct:102. Confirm all enrolled resources:
pvesh get /cluster/ha/resources
Before adding a VM, always verify its disk is on shared storage:
qm config 101 | grep -E "^(scsi|virtio|ide|sata)"
# You want output like:
# scsi0: ceph-pool:vm-101-disk-0,size=32G
# Not:
# scsi0: local-lvm:vm-101-disk-0,size=32G
A VM on local-lvm appears enrolled and healthy in the HA panel, then silently fails to recover when you need it most. There is no warning at enrollment time.
How to Test HA Failover the Right Way
Do not use systemctl poweroff to test failover. A clean shutdown lets the node announce its departure to the cluster, which changes how the CRM handles the transition — it's not a realistic crash simulation.
Use a hard power-off instead. From a machine with IPMI access:
ipmitool -H 192.168.1.51 -U admin -P yourpassword chassis power off
Alternatively, on a dedicated test node, force a kernel panic:
# WARNING: This immediately crashes the system. Test nodes only.
echo c > /proc/sysrq-trigger
Watch recovery in real time from a surviving node:
watch -n2 "pvesh get /cluster/ha/status/current"
Expected timeline:
- 0-30s: Corosync detects the absent node, CRM initiates fencing
- 30-60s: Watchdog resets the failed node, or IPMI confirms power-off
- 60-90s: CRM issues relocation commands; LRM brings VMs online on the surviving node
If the status stays in recovery past 90 seconds, the CRM is waiting on a fencing confirmation that never arrived:
journalctl -u pve-ha-crm -f
The log will tell you exactly which fence operation stalled. It's almost always either watchdog-mux not running on every node after a reboot, or stale IPMI credentials.
Common HA Mistakes to Avoid
VM on local storage. Enrolled in HA, appears healthy, fails silently on recovery. Verify storage before adding any resource.
Skipping the IPMI fence test. fence_ipmilan ... -o status takes 10 seconds to run. Skipping it takes hours to debug when HA stalls at 3am.
Two nodes without a qdevice. One failure, no quorum, HA freezes. Either add a third node or deploy corosync-qnetd on a lightweight device before relying on HA for anything real.
NTP drift. Corosync is sensitive to clock skew. Offset over a few hundred milliseconds triggers spurious node-unreachable events. Run timedatectl status on each node and confirm NTP is active and synced.
max_restart set to 1. A VM that needs 45 seconds to complete its startup health check will relocate unnecessarily on the first failed check. Set max_restart to at least 3 for non-trivial workloads.
No N-1 capacity planning. HA restarts VMs, but if surviving nodes are already at 90% RAM utilization, the VMs fail to start anyway. For a three-node cluster with 128 GB per node, plan as though any single node may be absent — cap total allocated RAM at 256 GB.
Conclusion
With watchdog-mux confirmed running, shared storage in place, and VMs enrolled in HA groups, Proxmox automatically recovers critical workloads within 90 seconds of a node failure. Fencing isn't bureaucratic overhead — it's the safety mechanism that makes corruption-free restarts possible. Run the hard power-off test before you declare success.
Once HA is protecting your VMs at the infrastructure level, add point-in-time recovery at the data level: schedule regular backups via Proxmox Backup Server so that even a storage failure has a fallback beyond the last snapshot.