Proxmox Disaster Recovery Drill for Your Homelab

Run a full Proxmox homelab disaster recovery drill: restore VMs and LXCs from Proxmox Backup Server, rebuild a failed ZFS pool, and get your actual RTO numbers.

Proxmox Pulse Proxmox Pulse
10 min read
Server rack under red emergency lighting with blue data streams representing disaster recovery restoration.

Most homelabs have a backup schedule configured. Far fewer have actually tested restoring from those backups when something breaks. I learned this the hard way when a ZFS pool degraded during a BIOS update and I discovered my PBS replication job had been silently failing for three weeks. Running a quarterly DR drill is what separates "I think my backups work" from actually knowing they do. This guide walks you through a complete recovery exercise using Proxmox VE 9.1 and Proxmox Backup Server 3.3: restoring a VM and an LXC from PBS, replacing a failed ZFS disk, and timing each step so you have a real recovery time objective you can defend.

Key Takeaways

  • Test restores, not just backups: A backup you have never restored from is not a backup you actually have.
  • PBS CLI is scriptable: qmrestore and pct restore give you repeatable, automatable recovery steps that the web UI cannot.
  • ZFS resilver time is your real constraint: A 4 TB vdev on spinning rust takes 8–12 hours to resilver — that number belongs in your RTO calculation.
  • NVMe-to-NVMe is fast: Expect under 3 minutes to restore a 20 GB VM from local PBS on NVMe storage.
  • Run the drill destructively: Mentally walking through steps teaches you nothing. Delete the VM and time the restore.

What You Need Before You Start

This drill requires:

  • Proxmox VE 9.1 host with SSH access
  • Proxmox Backup Server 3.3 configured as a storage backend in Proxmox
  • At least one complete VM backup and one LXC backup stored in PBS
  • A non-critical test VM (VMID 105 throughout this guide) and test container (CTID 200) you can destroy safely
  • A test ZFS pool or spare loop-device environment for the disk failure exercise

If you are still configuring your backup schedule, read through Automated Backups with Proxmox Backup Server first — this guide picks up where that one ends. If you are building out a larger homelab that this DR plan needs to cover, Build a Private Cloud at Home with Proxmox VE covers the infrastructure groundwork.

One mindset rule before you begin: the drill only counts if it is realistic. Destroying a test VM and timing the restore teaches you something concrete. Skipping the destruction and pretending teaches you nothing useful.

Part 1: How to Restore a VM from PBS

Verify the Backup Exists and Is Intact

Before destroying anything, confirm the backup is actually there and verified:

proxmox-backup-client snapshots --repository pbs-user@pbs-host:datastore-name

Or open the PBS web UI at https://pbs-host:8007, navigate to the datastore, and confirm the VM snapshot has a green verify checkmark. A snapshot without a recent verify result is a snapshot of unknown integrity.

Also confirm the PBS storage backend is reachable from the Proxmox host:

pvesm status

Look for your PBS storage ID in the output. If it shows inactive, fix the connectivity issue before proceeding — you cannot restore from an unreachable datastore.

Create the Failure

Stop the VM and destroy it, including its disks:

qm stop 105
qm destroy 105 --destroy-unreferenced-disks 1

Verify it is gone:

qm list

VMID 105 should no longer appear. The clock starts now.

Restore via qmrestore

qmrestore pbs:backup/vm/105/2026-05-10T03:00:00Z 105 \
  --storage local-zfs \
  --unique 0

Breaking this down:

  • pbs: — the storage ID you configured for PBS in Proxmox; confirm the exact name with pvesm status
  • backup/vm/105/2026-05-10T03:00:00Z — the snapshot path; get the exact timestamp from the PBS UI or proxmox-backup-client snapshots
  • --storage local-zfs — where to restore the disk image; must match an existing, writable storage pool
  • --unique 0 — preserves the original VMID and MAC address, which is critical for a genuine recovery so the VM gets its original IP

For a 20 GB Ubuntu Server VM on NVMe-to-NVMe with local PBS, this typically completes in 2–3 minutes. On HDD-to-HDD expect 8–12 minutes for the same image size.

Boot the VM and verify:

qm start 105
qm status 105

SSH in or check the noVNC console. Time from deletion to working login is your VM restore RTO. Write the number down.

The --unique Flag Gotcha

If you restore with --unique 1, Proxmox generates a new VMID and a new MAC address. Your VM will pull a different IP from DHCP and lose its original identity. That behavior is correct for cloning — it is wrong for disaster recovery. Always use --unique 0 for real recovery scenarios.

If PBS is using a self-signed certificate and the restore fails with a fingerprint error:

pvesm set pbs-storage --fingerprint AA:BB:CC:DD:EE:...

Get the fingerprint from the PBS web UI under Administration → Certificates.

Part 2: How to Restore an LXC Container

The process mirrors VM restore but uses pct.

Destroy the test container:

pct stop 200
pct destroy 200

Restore it:

pct restore 200 pbs:backup/ct/200/2026-05-10T03:00:00Z \
  --storage local-zfs \
  --unprivileged 1

The --unprivileged 1 flag must match the original container's mode. Restoring an unprivileged container as privileged (or vice versa) will corrupt UID mappings inside the container and cause filesystem permission failures. Before destroying the original, take a screenshot of its configuration in the UI or copy the config:

cat /etc/pve/lxc/200.conf

For a 5 GB Debian LXC, restore time is typically under 60 seconds on local NVMe. Start the container and confirm it is healthy:

pct start 200
pct exec 200 -- systemctl is-system-running

A result of running or degraded means the container is up. degraded typically means a non-critical service such as user@1000.service failed — harmless inside containers and not a sign of a failed restore.

Part 3: ZFS Pool Disk Replacement Walkthrough

This is the scenario homelab admins fear most: a physical disk fails and takes the pool down. If you are running a mirror or RAIDZ, you can recover without data loss. Here is the drill using safe loop devices — no real disks at risk.

Simulate a Degraded Pool

# Create a 4 GB mirror pool using loop device files
truncate -s 4G /tmp/disk1.img /tmp/disk2.img
zpool create testpool mirror /tmp/disk1.img /tmp/disk2.img
zpool status testpool

You should see state: ONLINE with both vdevs listed. Now simulate a disk failure:

zpool offline testpool /tmp/disk1.img
zpool status testpool

The output now shows state: DEGRADED. In a real failure, the missing disk shows UNAVAIL or FAULTED without you doing anything — the drive just stops responding.

Replace the Failed Disk

In a real scenario you would have powered down, swapped the bad drive for a new one, and identified it with:

ls -la /dev/disk/by-id/ | grep -v part

Always use /dev/disk/by-id/ paths for zpool replace, never /dev/sdX — device names shift after reboots when drives are added or removed.

For the drill with loop devices:

truncate -s 4G /tmp/disk1-new.img
zpool replace testpool /tmp/disk1.img /tmp/disk1-new.img

Watch the resilver:

watch -n 5 zpool status testpool

Typical output during resilver:

  scan: resilver in progress since Tue May 12 09:15:00 2026
        1.80G scanned at 450M/s, 900M issued at 225M/s, 4.00G total
        0 errors, no checkpoint, no checkpoint requested

For a 1 TB NVMe mirror, resilver takes around 15–20 minutes. For a 4 TB HDD mirror, budget 6–12 hours depending on I/O load.

Two things that matter during a real resilver:

  • Reduce VM I/O load on the host. Heavy guest disk activity during resilver can slow it by 3–4x.
  • On Proxmox VE 9 with ZFS 2.3, you can prioritize the resilver by setting:
zfs set resilver_delay=0 testpool

This removes the artificial delay ZFS inserts between resilver I/Os, trading slightly higher host I/O pressure for a faster completion time. Worth doing during a real failure event.

Clean up the drill environment:

zpool destroy testpool
rm /tmp/disk1.img /tmp/disk2.img /tmp/disk1-new.img

Calculating Your Real RTO

With the drill complete, you have real numbers. Here is a sample RTO table from my Intel N100 homelab node with NVMe storage and local PBS on the same LAN:

Scenario Observed Time
20 GB VM restore, NVMe-to-NVMe, local PBS 2 min 40 sec
5 GB LXC restore, NVMe-to-NVMe, local PBS 48 sec
20 GB VM restore, HDD-to-HDD, local PBS 11 min 20 sec
20 GB VM restore, NVMe-to-NVMe, remote PBS over 1 Gbps 6 min 10 sec
1 TB ZFS resilver, NVMe mirror 18 min
4 TB ZFS resilver, HDD mirror ~9 hours

Your numbers will differ. The point is that you now have your numbers, not theoretical estimates from a vendor datasheet. If your VM restore RTO is 45 minutes because PBS is sitting behind a slow VPN, that is a design problem to fix before an incident, not during one.

Gotchas I Have Actually Hit

PBS storage config lost after a node rebuild. If the Proxmox host itself is destroyed and rebuilt from scratch, the PBS storage backend configuration in /etc/pve/storage.cfg is gone. Keep a copy of that file in a git repo or on a USB drive. It is a small file with a catastrophic impact when missing.

VMID conflict on restore. If you forgot to fully destroy the original VM before restoring, qmrestore exits with a conflict error. Always run qm list before restoring to confirm the VMID is free.

Snapshot lock files. Occasionally a failed restore leaves a lock behind. If you see VM 105 is locked (backup) when trying to start the VM:

qm unlock 105

ZFS pool import after a full node loss. If the entire Proxmox host is gone and you are restoring onto new hardware, you first need to import the pool before PBS or Proxmox can see it:

zpool import -f your-pool-name

The -f flag tells ZFS to import a pool that was last active on a different system ID, which is exactly what a hardware replacement looks like.

Automating Partial DR Verification

Running a full manual drill quarterly is realistic. For the weeks between, automate what you can. PBS has a built-in verify job that checks chunk integrity:

proxmox-backup-manager verify-job create \
  --datastore your-datastore \
  --schedule "weekly" \
  --ignore-verified 1 \
  --outdated-after 7

This re-verifies chunks older than 7 days weekly, catching silent corruption before you need the data. It is not a restore test — it confirms chunk hashes, not that the VM actually boots — but it catches the most common failure mode, which is quietly corrupted backup data.

For more automated testing, I run a cron job that restores a minimal "canary" Debian LXC to a scratch CTID weekly, pings its loopback, and destroys it. If the ping fails, an alert fires. Pair that pattern with the approach described in Automate Proxmox VE with Ansible Full VM Playbooks and the entire canary test becomes a single Ansible play you can trigger on demand.

Conclusion

After running this drill you will have real RTO numbers, a tested recovery procedure, and at least one surprise about your backup configuration you did not know before you started. The drill itself takes 30–60 minutes. That investment pays for itself the first time something actually breaks at 2 AM. The logical next step is ensuring those backups survive a physical site loss — configure PBS replication to an offsite node so your playbook still works even if the primary hardware is gone entirely, not just individual VMs.

Share
Proxmox Pulse

Written by

Proxmox Pulse

Sysadmin-driven guides for getting the most out of Proxmox VE in production and homelab environments.

Related Articles

View all →