Proxmox Disaster Recovery Drill for Your Homelab
Run a full Proxmox homelab disaster recovery drill: restore VMs and LXCs from Proxmox Backup Server, rebuild a failed ZFS pool, and get your actual RTO numbers.
On this page
Most homelabs have a backup schedule configured. Far fewer have actually tested restoring from those backups when something breaks. I learned this the hard way when a ZFS pool degraded during a BIOS update and I discovered my PBS replication job had been silently failing for three weeks. Running a quarterly DR drill is what separates "I think my backups work" from actually knowing they do. This guide walks you through a complete recovery exercise using Proxmox VE 9.1 and Proxmox Backup Server 3.3: restoring a VM and an LXC from PBS, replacing a failed ZFS disk, and timing each step so you have a real recovery time objective you can defend.
Key Takeaways
- Test restores, not just backups: A backup you have never restored from is not a backup you actually have.
- PBS CLI is scriptable:
qmrestoreandpct restoregive you repeatable, automatable recovery steps that the web UI cannot. - ZFS resilver time is your real constraint: A 4 TB vdev on spinning rust takes 8–12 hours to resilver — that number belongs in your RTO calculation.
- NVMe-to-NVMe is fast: Expect under 3 minutes to restore a 20 GB VM from local PBS on NVMe storage.
- Run the drill destructively: Mentally walking through steps teaches you nothing. Delete the VM and time the restore.
What You Need Before You Start
This drill requires:
- Proxmox VE 9.1 host with SSH access
- Proxmox Backup Server 3.3 configured as a storage backend in Proxmox
- At least one complete VM backup and one LXC backup stored in PBS
- A non-critical test VM (VMID
105throughout this guide) and test container (CTID200) you can destroy safely - A test ZFS pool or spare loop-device environment for the disk failure exercise
If you are still configuring your backup schedule, read through Automated Backups with Proxmox Backup Server first — this guide picks up where that one ends. If you are building out a larger homelab that this DR plan needs to cover, Build a Private Cloud at Home with Proxmox VE covers the infrastructure groundwork.
One mindset rule before you begin: the drill only counts if it is realistic. Destroying a test VM and timing the restore teaches you something concrete. Skipping the destruction and pretending teaches you nothing useful.
Part 1: How to Restore a VM from PBS
Verify the Backup Exists and Is Intact
Before destroying anything, confirm the backup is actually there and verified:
proxmox-backup-client snapshots --repository pbs-user@pbs-host:datastore-name
Or open the PBS web UI at https://pbs-host:8007, navigate to the datastore, and confirm the VM snapshot has a green verify checkmark. A snapshot without a recent verify result is a snapshot of unknown integrity.
Also confirm the PBS storage backend is reachable from the Proxmox host:
pvesm status
Look for your PBS storage ID in the output. If it shows inactive, fix the connectivity issue before proceeding — you cannot restore from an unreachable datastore.
Create the Failure
Stop the VM and destroy it, including its disks:
qm stop 105
qm destroy 105 --destroy-unreferenced-disks 1
Verify it is gone:
qm list
VMID 105 should no longer appear. The clock starts now.
Restore via qmrestore
qmrestore pbs:backup/vm/105/2026-05-10T03:00:00Z 105 \
--storage local-zfs \
--unique 0
Breaking this down:
pbs:— the storage ID you configured for PBS in Proxmox; confirm the exact name withpvesm statusbackup/vm/105/2026-05-10T03:00:00Z— the snapshot path; get the exact timestamp from the PBS UI orproxmox-backup-client snapshots--storage local-zfs— where to restore the disk image; must match an existing, writable storage pool--unique 0— preserves the original VMID and MAC address, which is critical for a genuine recovery so the VM gets its original IP
For a 20 GB Ubuntu Server VM on NVMe-to-NVMe with local PBS, this typically completes in 2–3 minutes. On HDD-to-HDD expect 8–12 minutes for the same image size.
Boot the VM and verify:
qm start 105
qm status 105
SSH in or check the noVNC console. Time from deletion to working login is your VM restore RTO. Write the number down.
The --unique Flag Gotcha
If you restore with --unique 1, Proxmox generates a new VMID and a new MAC address. Your VM will pull a different IP from DHCP and lose its original identity. That behavior is correct for cloning — it is wrong for disaster recovery. Always use --unique 0 for real recovery scenarios.
If PBS is using a self-signed certificate and the restore fails with a fingerprint error:
pvesm set pbs-storage --fingerprint AA:BB:CC:DD:EE:...
Get the fingerprint from the PBS web UI under Administration → Certificates.
Part 2: How to Restore an LXC Container
The process mirrors VM restore but uses pct.
Destroy the test container:
pct stop 200
pct destroy 200
Restore it:
pct restore 200 pbs:backup/ct/200/2026-05-10T03:00:00Z \
--storage local-zfs \
--unprivileged 1
The --unprivileged 1 flag must match the original container's mode. Restoring an unprivileged container as privileged (or vice versa) will corrupt UID mappings inside the container and cause filesystem permission failures. Before destroying the original, take a screenshot of its configuration in the UI or copy the config:
cat /etc/pve/lxc/200.conf
For a 5 GB Debian LXC, restore time is typically under 60 seconds on local NVMe. Start the container and confirm it is healthy:
pct start 200
pct exec 200 -- systemctl is-system-running
A result of running or degraded means the container is up. degraded typically means a non-critical service such as user@1000.service failed — harmless inside containers and not a sign of a failed restore.
Part 3: ZFS Pool Disk Replacement Walkthrough
This is the scenario homelab admins fear most: a physical disk fails and takes the pool down. If you are running a mirror or RAIDZ, you can recover without data loss. Here is the drill using safe loop devices — no real disks at risk.
Simulate a Degraded Pool
# Create a 4 GB mirror pool using loop device files
truncate -s 4G /tmp/disk1.img /tmp/disk2.img
zpool create testpool mirror /tmp/disk1.img /tmp/disk2.img
zpool status testpool
You should see state: ONLINE with both vdevs listed. Now simulate a disk failure:
zpool offline testpool /tmp/disk1.img
zpool status testpool
The output now shows state: DEGRADED. In a real failure, the missing disk shows UNAVAIL or FAULTED without you doing anything — the drive just stops responding.
Replace the Failed Disk
In a real scenario you would have powered down, swapped the bad drive for a new one, and identified it with:
ls -la /dev/disk/by-id/ | grep -v part
Always use /dev/disk/by-id/ paths for zpool replace, never /dev/sdX — device names shift after reboots when drives are added or removed.
For the drill with loop devices:
truncate -s 4G /tmp/disk1-new.img
zpool replace testpool /tmp/disk1.img /tmp/disk1-new.img
Watch the resilver:
watch -n 5 zpool status testpool
Typical output during resilver:
scan: resilver in progress since Tue May 12 09:15:00 2026
1.80G scanned at 450M/s, 900M issued at 225M/s, 4.00G total
0 errors, no checkpoint, no checkpoint requested
For a 1 TB NVMe mirror, resilver takes around 15–20 minutes. For a 4 TB HDD mirror, budget 6–12 hours depending on I/O load.
Two things that matter during a real resilver:
- Reduce VM I/O load on the host. Heavy guest disk activity during resilver can slow it by 3–4x.
- On Proxmox VE 9 with ZFS 2.3, you can prioritize the resilver by setting:
zfs set resilver_delay=0 testpool
This removes the artificial delay ZFS inserts between resilver I/Os, trading slightly higher host I/O pressure for a faster completion time. Worth doing during a real failure event.
Clean up the drill environment:
zpool destroy testpool
rm /tmp/disk1.img /tmp/disk2.img /tmp/disk1-new.img
Calculating Your Real RTO
With the drill complete, you have real numbers. Here is a sample RTO table from my Intel N100 homelab node with NVMe storage and local PBS on the same LAN:
| Scenario | Observed Time |
|---|---|
| 20 GB VM restore, NVMe-to-NVMe, local PBS | 2 min 40 sec |
| 5 GB LXC restore, NVMe-to-NVMe, local PBS | 48 sec |
| 20 GB VM restore, HDD-to-HDD, local PBS | 11 min 20 sec |
| 20 GB VM restore, NVMe-to-NVMe, remote PBS over 1 Gbps | 6 min 10 sec |
| 1 TB ZFS resilver, NVMe mirror | 18 min |
| 4 TB ZFS resilver, HDD mirror | ~9 hours |
Your numbers will differ. The point is that you now have your numbers, not theoretical estimates from a vendor datasheet. If your VM restore RTO is 45 minutes because PBS is sitting behind a slow VPN, that is a design problem to fix before an incident, not during one.
Gotchas I Have Actually Hit
PBS storage config lost after a node rebuild. If the Proxmox host itself is destroyed and rebuilt from scratch, the PBS storage backend configuration in /etc/pve/storage.cfg is gone. Keep a copy of that file in a git repo or on a USB drive. It is a small file with a catastrophic impact when missing.
VMID conflict on restore. If you forgot to fully destroy the original VM before restoring, qmrestore exits with a conflict error. Always run qm list before restoring to confirm the VMID is free.
Snapshot lock files. Occasionally a failed restore leaves a lock behind. If you see VM 105 is locked (backup) when trying to start the VM:
qm unlock 105
ZFS pool import after a full node loss. If the entire Proxmox host is gone and you are restoring onto new hardware, you first need to import the pool before PBS or Proxmox can see it:
zpool import -f your-pool-name
The -f flag tells ZFS to import a pool that was last active on a different system ID, which is exactly what a hardware replacement looks like.
Automating Partial DR Verification
Running a full manual drill quarterly is realistic. For the weeks between, automate what you can. PBS has a built-in verify job that checks chunk integrity:
proxmox-backup-manager verify-job create \
--datastore your-datastore \
--schedule "weekly" \
--ignore-verified 1 \
--outdated-after 7
This re-verifies chunks older than 7 days weekly, catching silent corruption before you need the data. It is not a restore test — it confirms chunk hashes, not that the VM actually boots — but it catches the most common failure mode, which is quietly corrupted backup data.
For more automated testing, I run a cron job that restores a minimal "canary" Debian LXC to a scratch CTID weekly, pings its loopback, and destroys it. If the ping fails, an alert fires. Pair that pattern with the approach described in Automate Proxmox VE with Ansible Full VM Playbooks and the entire canary test becomes a single Ansible play you can trigger on demand.
Conclusion
After running this drill you will have real RTO numbers, a tested recovery procedure, and at least one surprise about your backup configuration you did not know before you started. The drill itself takes 30–60 minutes. That investment pays for itself the first time something actually breaks at 2 AM. The logical next step is ensuring those backups survive a physical site loss — configure PBS replication to an offsite node so your playbook still works even if the primary hardware is gone entirely, not just individual VMs.