Setting Up ZFS on Proxmox: Pools, Datasets, and Best Practices

Complete guide to setting up ZFS storage on Proxmox VE. Covers pool creation, mirror and RAIDZ configurations, datasets, snapshots, and performance tuning.

Proxmox Pulse Proxmox Pulse
15 min read
proxmox zfs storage raid snapshots
Cascading hard drive platters with orange data streams flowing into unified storage pool

Why ZFS on Proxmox

Proxmox ships with ZFS support out of the box. No additional kernel modules, no DKMS builds that break on every kernel update, no third-party repos. You can select ZFS as your root filesystem during installation, and it just works. After years of running ext4-on-hardware-RAID and dealing with silent bit rot, failed rebuilds, and RAID controllers that cost more than the drives they manage, ZFS feels like the storage system that Linux always deserved.

The killer features for a hypervisor are atomic snapshots (instant, consistent VM backups without stopping the guest), built-in checksumming (every block is verified on read, so bit rot gets caught instead of silently corrupting your data), and native replication. I run a two-node Proxmox cluster with hourly ZFS replication between them, and failover takes minutes, not hours.

Pool Types and When to Use Them

ZFS pools are built from vdevs (virtual devices), and the vdev type determines your redundancy and performance characteristics. Here's the practical breakdown with real capacity numbers for 4TB drives.

Mirror (RAID1 equivalent)

Two or more drives with identical copies of data. You lose 50% of raw capacity but get excellent read performance (reads can be served from any mirror member) and fast resilience. Rebuild times are proportional to used space, not total drive capacity.

zpool create tank mirror /dev/sdb /dev/sdc

4 x 4TB drives as 2 mirror vdevs:

  • Raw: 16TB
  • Usable: ~7.3TB (2 mirrors, each ~3.64TB usable)
  • Can lose one drive per mirror pair
  • Rebuild time: ~2-3 hours for a 4TB drive

I use mirrors for my primary VM storage. The rebuild speed alone is worth the capacity trade-off. A RAIDZ2 rebuild on large drives can take 16+ hours, and your pool is degraded that entire time.

RAIDZ (RAID5 equivalent)

Single parity distributed across all drives. You can lose one drive. Good capacity efficiency but write performance suffers because every write touches all drives for parity calculation.

zpool create tank raidz /dev/sdb /dev/sdc /dev/sdd /dev/sde

4 x 4TB drives in RAIDZ:

  • Raw: 16TB
  • Usable: ~10.9TB (3 drives of data, 1 of parity)
  • Can lose 1 drive
  • Rebuild time: ~8-16 hours for 4TB drives

RAIDZ2 (RAID6 equivalent)

Double parity. Can lose two drives simultaneously. This is what I recommend for most storage pools with 4+ drives. The capacity overhead is worth the safety margin, especially with large drives where rebuild times are measured in days.

zpool create bulk raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg

6 x 4TB drives in RAIDZ2:

  • Raw: 24TB
  • Usable: ~14.5TB (4 drives of data, 2 of parity)
  • Can lose 2 drives simultaneously
  • Rebuild time: ~12-24 hours per drive

Stripe (RAID0 equivalent)

No redundancy. One drive dies, you lose everything. Never use this for anything you care about. That said, I keep a striped pool for scratch space -- temporary build artifacts, ISOs I'm going to delete anyway, that sort of thing.

zpool create scratch /dev/sdb /dev/sdc

2 x 4TB drives striped:

  • Raw: 8TB
  • Usable: ~7.3TB
  • Can lose 0 drives
  • Best sequential throughput

Combining Vdevs

Real-world pools often combine multiple vdevs. A common pattern is striped mirrors (RAID10 equivalent):

zpool create fast \
  mirror /dev/sdb /dev/sdc \
  mirror /dev/sdd /dev/sde \
  mirror /dev/sdf /dev/sdg

This gives you six drives in three mirror pairs, striped together. You get the write performance of three drives, the read performance of six, and can lose one drive per pair. This is my go-to configuration for VM storage where performance matters.

Creating Pools from the CLI

Identifying Your Drives

First, figure out which drives you're working with. I always use /dev/disk/by-id/ paths instead of /dev/sdX because the sd lettering can change between reboots.

root@proxmox:~# ls -la /dev/disk/by-id/ | grep -v part
lrwxrwxrwx 1 root root  9 Mar  1 08:00 ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HXXXX -> ../../sdb
lrwxrwxrwx 1 root root  9 Mar  1 08:00 ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HYYYY -> ../../sdc
lrwxrwxrwx 1 root root  9 Mar  1 08:00 ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HZZZZ -> ../../sdd
lrwxrwxrwx 1 root root  9 Mar  1 08:00 ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HWWWW -> ../../sde
lrwxrwxrwx 1 root root  9 Mar  1 08:00 ata-Samsung_SSD_870_EVO_500GB_S5YNXXXXXXXX -> ../../sda

Creating a Mirror Pool with Proper Paths

zpool create -o ashift=12 \
  tank mirror \
  ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HXXXX \
  ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HYYYY

The -o ashift=12 is critical for modern drives. It tells ZFS to use 4K sectors. If you forget this on a 4K-native drive (which is basically all drives made after 2014), performance will be abysmal. ZFS won't let you change ashift after pool creation without destroying and recreating the pool, so get it right the first time.

Verifying the Pool

root@proxmox:~# zpool status tank
  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME                                         STATE     READ WRITE CKSUM
        tank                                         ONLINE       0     0     0
          mirror-0                                   ONLINE       0     0     0
            ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HXXXX  ONLINE       0     0     0
            ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HYYYY  ONLINE       0     0     0

errors: No known data errors

root@proxmox:~# zpool list tank
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank  3.64T   124K  3.64T        -         -     0%     0%  1.00x    ONLINE  -

Adding ZFS Storage to Proxmox

Once the pool exists, you need to tell Proxmox about it. You can do this from the GUI under Datacenter > Storage > Add > ZFS, or from the CLI:

pvesm add zfspool tank-storage --pool tank/data --content images,rootdir

The --content flag determines what Proxmox will store there:

  • images -- VM disk images (zvols)
  • rootdir -- LXC container root filesystems (datasets)
  • vztmpl -- container templates
  • iso -- ISO images
  • backup -- vzdump backups

For backups, I use a separate dataset with different compression settings:

zfs create tank/backups
pvesm add zfspool tank-backups --pool tank/backups --content backup

Datasets vs Zvols

This trips people up. ZFS has two types of storage entities, and Proxmox uses both for different purposes.

Datasets

A dataset is a filesystem -- it has a mount point, supports files and directories, and can be accessed like any regular directory. Proxmox uses datasets for LXC container root filesystems.

root@proxmox:~# zfs list -t filesystem
NAME                         USED  AVAIL     REFER  MOUNTPOINT
tank                        1.23T  2.41T      256K  /tank
tank/data                   1.23T  2.41T      256K  /tank/data
tank/data/subvol-101-disk-0 8.32G  2.41T     8.32G  /tank/data/subvol-101-disk-0
tank/data/subvol-102-disk-0 4.17G  2.41T     4.17G  /tank/data/subvol-102-disk-0
tank/backups                 156G  2.41T      156G  /tank/backups

Notice the naming convention: subvol-{CTID}-disk-{N}. Proxmox manages these automatically.

Zvols

A zvol is a block device. It appears as /dev/zvol/tank/data/vm-100-disk-0 and can be presented as a raw disk to a VM. Proxmox uses zvols for VM disk images because VMs need block devices, not filesystems.

root@proxmox:~# zfs list -t volume
NAME                        USED  AVAIL     REFER  MOUNTPOINT
tank/data/vm-100-disk-0    32.4G  2.41T     32.1G  -
tank/data/vm-103-disk-0    64.2G  2.41T     51.8G  -

Zvols don't have mount points (hence the -). They're used through the block device layer.

When to Use Each

You don't get to choose, really -- Proxmox decides based on the guest type:

  • VM -> zvol (always)
  • LXC container -> dataset (always, when using ZFS storage)

But when you're creating datasets manually for non-Proxmox use (file shares, backup targets, application data), always use datasets. Zvols are only useful when something specifically needs a block device.

Compression: Just Enable It

zfs set compression=lz4 tank

Do this on the pool level and every dataset inherits it. LZ4 compression is so fast that it actually improves performance on most workloads -- the CPU compresses data faster than the drives can write uncompressed data. I've never seen a case where LZ4 compression hurt performance.

Check your compression ratio:

root@proxmox:~# zfs get compressratio tank
NAME  PROPERTY       VALUE  SOURCE
tank  compressratio  1.67x  -

1.67x means I'm using 40% less disk space than I would without compression. For VM disks and container filesystems, you'll typically see 1.3x-2.0x. Database-heavy workloads compress even better.

For backup datasets specifically, you might want zstd instead of lz4 -- it compresses better at the cost of more CPU time, which is fine for backups that are write-once-read-rarely:

zfs set compression=zstd tank/backups

ARC Cache Sizing

This is the number one "gotcha" with ZFS on Proxmox. The ARC (Adaptive Replacement Cache) is ZFS's read cache, and by default it will consume up to 50% of system RAM. On a 64GB server, that's 32GB that your VMs can't use.

Check current ARC usage:

root@proxmox:~# cat /proc/spl/kstat/zfs/arcstats | grep "^size"
size                            4    17182638080

That's about 16GB. On my 64GB node, that's too much. I set a hard limit:

# /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=4294967296

That's 4GB (4 * 1024^3 bytes). After creating or editing this file:

update-initramfs -u
reboot

My rule of thumb: give the ARC 1-2GB per TB of storage, up to about 25% of total RAM. For a 64GB node with 8TB of storage, 8GB of ARC is generous. The rest of the RAM should go to your VMs and containers.

You can verify the limit after reboot:

root@proxmox:~# cat /proc/spl/kstat/zfs/arcstats | grep "c_max"
c_max                           4    4294967296

Watch Out for This

I've seen people set zfs_arc_max too low (like 512MB) and then wonder why their ZFS performance is terrible. The ARC needs enough room to cache metadata at minimum. Going below 1GB on any pool of reasonable size will cause performance problems, especially with lots of small files or frequent random reads.

L2ARC and SLOG: When They Help, When They Don't

L2ARC (Level 2 ARC)

The L2ARC extends your read cache to a fast SSD. In theory, this gives you SSD-speed reads for data that doesn't fit in RAM. In practice, the L2ARC has significant overhead: it consumes ~70-80 bytes of RAM per cached block to store the index. A 500GB L2ARC device can consume 4-8GB of RAM just for its index.

# Adding an L2ARC device
zpool add tank cache /dev/disk/by-id/ata-Samsung_SSD_870_EVO_500GB_S5YNXXXXXXXX

When L2ARC helps:

  • Large working sets that exceed RAM (databases, file servers)
  • Sequential read workloads where data is accessed repeatedly
  • You have plenty of spare RAM for the L2ARC index

When L2ARC is a waste:

  • Your working set fits in RAM (the ARC already handles it)
  • Write-heavy workloads (L2ARC only caches reads)
  • You're RAM-constrained (the index overhead makes things worse)

For most Proxmox homelab setups with 32-64GB of RAM, the L2ARC isn't worth it. That SSD is better used as a dedicated fast-storage tier for VM disks.

SLOG (Separate Log Device)

The SLOG is for synchronous writes. ZFS uses the ZIL (ZFS Intent Log) to ensure data integrity for sync writes -- it writes to the ZIL first, acknowledges the write to the application, then lazily writes to the pool. By default, the ZIL lives on the pool disks. A SLOG moves the ZIL to a fast device.

# Adding a SLOG device (use a high-endurance SSD or Optane)
zpool add tank log mirror \
  /dev/disk/by-id/nvme-Intel_Optane_XXXX1 \
  /dev/disk/by-id/nvme-Intel_Optane_XXXX2

Always mirror your SLOG. If you lose a single SLOG device, you lose all uncommitted synchronous writes since the last transaction group commit (up to 30 seconds of data by default). Mirroring protects against this.

When SLOG helps:

  • NFS with sync writes (NFSv3 with sync exports, ESXi NFS datastores)
  • Databases with fsync/O_SYNC heavy workloads (PostgreSQL, MySQL with InnoDB)
  • iSCSI targets

When SLOG does nothing:

  • Async writes (most VM workloads, file copies)
  • When your pool is already on SSDs (the SSDs are fast enough)

Most homelab Proxmox users don't need a SLOG. VMs typically use async writes, and the pool disks handle the ZIL just fine.

Snapshot Management

Snapshots are arguably ZFS's best feature for a hypervisor. They're instant (literally -- a snapshot takes microseconds regardless of dataset size) and space-efficient (they only consume space as the original data changes).

Manual Snapshots

# Snapshot a dataset
zfs snapshot tank/data/subvol-101-disk-0@before-upgrade

# Snapshot a zvol (VM disk)
zfs snapshot tank/data/vm-100-disk-0@pre-kernel-update

# Snapshot everything under a dataset recursively
zfs snapshot -r tank/data@daily-2026-03-01

Listing Snapshots

root@proxmox:~# zfs list -t snapshot -o name,used,refer,creation
NAME                                               USED  REFER  CREATION
tank/data/subvol-101-disk-0@before-upgrade        4.12G  8.32G  Fri Mar  1  9:15 2026
tank/data/vm-100-disk-0@pre-kernel-update          892M  32.1G  Fri Mar  1  9:20 2026
tank/data@daily-2026-03-01                            0B   256K  Sat Mar  1  0:00 2026

The USED column shows how much space the snapshot consumes -- this is data that has changed since the snapshot was taken. A fresh snapshot uses 0 bytes.

Rolling Back

# Roll back to a snapshot (destroys all changes since)
zfs rollback tank/data/subvol-101-disk-0@before-upgrade

Fair warning: you can only roll back to the most recent snapshot unless you use -r to destroy intermediate snapshots. This is destructive and irreversible.

Automated Snapshot Schedules

I use a simple cron job with zfs-auto-snapshot or a custom script:

# /etc/cron.d/zfs-snapshots
0 * * * * root zfs snapshot -r tank/data@hourly-$(date +\%Y\%m\%d\%H) 2>/dev/null
0 0 * * * root zfs snapshot -r tank/data@daily-$(date +\%Y\%m\%d) 2>/dev/null
0 0 * * 0 root zfs snapshot -r tank/data@weekly-$(date +\%Y\%m\%d) 2>/dev/null

And a cleanup job to prevent snapshot accumulation:

# /etc/cron.d/zfs-snapshot-cleanup
30 * * * * root /usr/local/bin/zfs-prune-snapshots.sh
#!/bin/bash
# /usr/local/bin/zfs-prune-snapshots.sh
# Keep 24 hourly, 30 daily, 12 weekly snapshots

# Delete hourly snapshots older than 24 hours
zfs list -H -t snapshot -o name | grep "@hourly-" | head -n -24 | xargs -r -n1 zfs destroy

# Delete daily snapshots older than 30 days
zfs list -H -t snapshot -o name | grep "@daily-" | head -n -30 | xargs -r -n1 zfs destroy

# Delete weekly snapshots older than 12 weeks
zfs list -H -t snapshot -o name | grep "@weekly-" | head -n -12 | xargs -r -n1 zfs destroy

Scrub Schedules

Scrubs are ZFS's way of proactively checking data integrity. A scrub reads every block in the pool, verifies checksums, and repairs any corruption from redundant copies. Without scrubs, bit rot sits there silently until you try to read the corrupted block.

Proxmox sets up a monthly scrub by default via a systemd timer:

root@proxmox:~# systemctl status zfs-scrub@tank.timer
     Loaded: loaded (/lib/systemd/system/zfs-scrub@.timer; enabled)
     Active: active (waiting)
    Trigger: Sun 2026-04-05 00:00:00 UTC

For spinning drives, I run scrubs monthly. For SSDs, every two weeks is fine since scrubs complete much faster. You can trigger a scrub manually:

zpool scrub tank

And monitor its progress:

root@proxmox:~# zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Sat Mar  1 00:00:01 2026
        2.31T scanned at 456M/s, 1.87T issued at 369M/s
        0B repaired, 80.95% done, 00:20:34 to go
config:

        NAME                                         STATE     READ WRITE CKSUM
        tank                                         ONLINE       0     0     0
          mirror-0                                   ONLINE       0     0     0
            ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HXXXX  ONLINE       0     0     0
            ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HYYYY  ONLINE       0     0     0

errors: No known data errors

The 0B repaired is what you want to see. If it shows data being repaired, your drives are developing problems and you should plan replacements.

Monitoring Pool Health

Beyond scrubs, keep an eye on your pool's overall health. Here's what I check weekly:

# Pool status -- should say ONLINE with no errors
zpool status -v

# Space usage -- watch for pools above 80% full
zpool list

# I/O stats -- useful for diagnosing performance issues
zpool iostat -v tank 5 3
                                        capacity     operations     bandwidth
pool                                  alloc   free   read  write   read  write
----------------------------------------  -----  -----  -----  -----  -----  -----
tank                                  1.23T  2.41T     45     82  5.62M  12.3M
  mirror-0                            1.23T  2.41T     45     82  5.62M  12.3M
    ata-WDC_WD40EFAX-68JH4N1_WD-WXH..      -      -     23     41  2.81M  6.15M
    ata-WDC_WD40EFAX-68JH4N1_WD-WXH..      -      -     22     41  2.81M  6.15M

Critical Alert: Pool Capacity

ZFS performance degrades significantly above 80% capacity due to how copy-on-write allocates blocks. At 90%, you'll notice real slowdowns. At 95%, you're in emergency territory. I've seen pools at 98% that took 10x longer for basic operations.

Set up a simple monitoring script:

#!/bin/bash
# /usr/local/bin/zfs-capacity-check.sh
THRESHOLD=80

zpool list -H -o name,capacity | while read pool cap; do
  cap_num=${cap%\%}
  if [ "$cap_num" -gt "$THRESHOLD" ]; then
    echo "WARNING: Pool $pool is at ${cap} capacity" | \
      mail -s "ZFS Pool Warning: $pool" admin@example.com
  fi
done

Replacing a Failed Drive

When a drive fails (not if -- when), ZFS makes the replacement process straightforward:

# Identify the failed drive
root@proxmox:~# zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices has been removed by the FMA.

        NAME                                         STATE     READ WRITE CKSUM
        tank                                         DEGRADED     0     0     0
          mirror-0                                   DEGRADED     0     0     0
            ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HXXXX  ONLINE       0     0     0
            ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HYYYY  UNAVAIL      0     0     0

# Replace with a new drive
zpool replace tank \
  ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HYYYY \
  ata-WDC_WD40EFAX-68JH4N1_WD-WX72D90HNEWW

# Monitor resilver progress
zpool status tank

The resilver (rebuild) will run in the background. Your pool stays online and functional the entire time -- it's just degraded until the resilver completes. Don't panic. Don't reboot. Just let it finish.

Where to Go From Here

Once your pools are set up and healthy, the next steps are configuring Proxmox's backup system (vzdump) to use your ZFS storage, setting up ZFS replication between nodes for disaster recovery, and tuning recordsize for specific workloads (128K default is fine for VMs, but databases benefit from 8K-16K matching their page size).

ZFS rewards you for investing time in understanding it. The pool I set up on my first Proxmox node three years ago is still running -- it's survived two drive replacements, a motherboard swap, and a Proxmox major version upgrade from 7.x to 8.x. The data has been silently checksummed, compressed, and snapshotted the entire time. That kind of reliability is why I won't run a hypervisor without it.

Share
Proxmox Pulse

Written by

Proxmox Pulse

Sysadmin-driven guides for getting the most out of Proxmox VE in production and homelab environments.

Related Articles

View all →