Smashing ZFS Storage

April 11, 2022

ZFS is reliable, and rock solid. Here's why, and how to test smashh it!

1. Understanding ZFS Architecture

1.1. Core Components

ZFS operates through three main layers:

SPA (Storage Pool Allocator): Manages physical storage and presents unified address space.
DMU (Data Management Unit): Handles transactions and ensures ACID properties.
ARC (Adaptive Replacement Cache): Smart caching that adapts to workload patterns.

1.2. ARC Operation

The ARC uses four lists for intelligent caching:

T1: Recently used blocks (accessed once)
T2: Frequently used blocks (accessed multiple times)
B1/B2: Ghost lists tracking evicted entries

Check your current ARC status:

# View ARC statistics
cat /proc/spl/kstat/zfs/arcstats

# Check ARC size limits
echo "ARC max: $(cat /sys/module/zfs/parameters/zfs_arc_max)"
echo "ARC min: $(cat /sys/module/zfs/parameters/zfs_arc_min)"

# arc stat command (monitor L2arc)
arcstat -f read,hits,miss,hit%,l2read,l2hits,l2miss,l2hit%,arcsz,l2size 5

# very detail 
arc_summary

1.3. L2ARC vs ARC

L2ARC extends ARC to SSDs but operates as a simple FIFO ring buffer, not adaptive replacement. It caches blocks that are about to be evicted from ARC.

# Monitor L2ARC if configured
zpool iostat -v 1

2. ZFS Data Ingestion for Proxmox/KVM

2.1. VM Storage Options

For Proxmox/KVM workloads, you have two main options:

ZVOLs (Block devices) - Manual Creation:

# Create zvol for VM disk - these WON'T appear in Proxmox UI
# Using default volblocksize (16K since OpenZFS v2.2)
zfs create -V 50G tank/vm-101-disk-0

# For specific workloads, adjust volblocksize as needed
zfs create -V 50G -o volblocksize=16K tank/vm-102-disk-0  # Database VMs

# Manual ZVOLs appear as block devices
ls -la /dev/zvol/tank/vm-*
# /dev/zvol/tank/vm-101-disk-0
# /dev/zvol/tank/vm-102-disk-0

Important: Manual ZVOLs created via command line won't show in Proxmox GUI storage management. You must manage them entirely via CLI or use Proxmox's ZFS storage integration.

Proxmox-managed ZFS storage:

# Add ZFS storage in Proxmox (this goes in /etc/pve/storage.cfg)
# Proxmox will manage ZVOLs automatically through the GUI
pvesm add zfspool local-zfs -pool tank -content images,rootdir

Raw disk images on datasets:

# Create dataset for VM images - these appear in Proxmox GUI
zfs create tank/vmimages
pvesm add dir vmstore -path /tank/vmimages -content images,iso,vztmpl
# VMs stored as files: /tank/vmimages/101/vm-101-disk-0.raw

2.4. Managing Manual ZVOLs with VMs

When you create ZVOLs manually, you need to configure VMs to use them:

# Create ZVOL for VM
zfs create -V 50G -o volblocksize=16K tank/vm-103-disk-0

# Add to VM configuration manually
echo "scsi0: /dev/zvol/tank/vm-103-disk-0,cache=none,discard=on" >> /etc/pve/qemu-server/103.conf

# Or use qm command
qm set 103 --scsi0 /dev/zvol/tank/vm-103-disk-0,cache=none,discard=on

# ZVOL won't appear in Proxmox GUI storage management
# Manage snapshots, clones, etc. via ZFS commands:
zfs snapshot tank/vm-103-disk-0@backup-$(date +%Y%m%d)
zfs clone tank/vm-103-disk-0@backup-20241201 tank/vm-104-disk-0

Recent testing shows raw disk images often outperform ZVOLs by 6x for small random I/O:

# Optimize dataset for VM images
zfs set recordsize=64K tank/vmimages
zfs set compression=lz4 tank/vmimages
zfs set xattr=sa tank/vmimages
zfs set logbias=throughput tank/vmimages

# For zvols, optimize at creation
zfs create -V 50G -o volblocksize=8K -o compression=lz4 tank/vm-disk

2.3. Data Flow in VM Workloads

VM I/O flows: Guest OS → QEMU → Host Kernel → ZFS

Key considerations:

Async writes: Buffered in ARC, flushed every 5 seconds via TXG
Sync writes: Must hit stable storage before acknowledgment
Cache modes: Use cache=none in QEMU to let ZFS handle all caching

# Monitor TXG sync times
watch 'zpool iostat -v 1 1'

# Check for sync write pressure
zpool iostat -v 1

3. Creating Production Test Pools

3.1. Production Disk Pool Creation

WARNING: This will destroy all data on specified disks. Only use spare/test disks.

# List available disks
lsblk
fdisk -l

# Create various topologies for testing with real disks
# ENSURE these are spare disks - this destroys all data!
zpool create testmirror mirror /dev/sdc /dev/sdd
zpool create testraidz1 raidz /dev/sde /dev/sdf /dev/sdg  
zpool create testraidz2 raidz2 /dev/sdh /dev/sdi /dev/sdj /dev/sdk

# Verify pool creation
zpool status
zpool list

3.2. Add Cache and Log Devices

# Add NVMe SSD as cache device (L2ARC)
zpool add testmirror cache /dev/nvme0n1

# Add smaller SSD as log device (SLOG) - only for sync-heavy workloads
# SLOG should be mirrored in production
zpool add testmirror log mirror /dev/sdb1 /dev/sdb2

# Verify topology
zpool status testmirror

4. Simulating ZFS Failures

4.1. Simulating Disk Failures

Like Ceph, there are hard and soft failure scenarios. ZFS handles both gracefully with proper redundancy.

Prerequisites:

Redundant ZFS pool (mirror or raidz) with spare disks
Root access to the system
Test data with known checksums
CRITICAL: Ensure you're using test/spare disks only

4.2. Soft Failure Simulation (Clean Offline)

Procedure:

Take real device offline cleanly:

# Offline physical device in mirror
zpool offline testmirror /dev/sdc

# Check degraded status  
zpool status testmirror

Expected output:

  pool: testmirror
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
config:
        NAME         STATE     READ WRITE CKSUM
        testmirror   DEGRADED     0     0     0
          mirror-0   DEGRADED     0     0     0
            /dev/sdc OFFLINE      0     0     0
            /dev/sdd ONLINE       0     0     0

Bring device back online:

zpool online testmirror /dev/sdc
zpool status testmirror  # Should show resilvering

4.3. Hard Failure Simulation (Physical Removal)

Procedure:

Create test VM with active workload:

# Create ZVOL for test VM (using default 16K volblocksize)
zfs create -V 20G testmirror/vm-999-disk-0

# Create test VM configuration
qm create 999 --name zfs-test-vm --memory 2048 --cores 2 --net0 virtio,bridge=vmbr0
qm set 999 --scsi0 /dev/zvol/testmirror/vm-999-disk-0,cache=none,discard=on
qm set 999 --cdrom local:iso/ubuntu-22.04-live-server-amd64.iso
qm set 999 --boot order=scsi0

# Start VM and install/configure it with test workload
qm start 999

# Once VM is running, create continuous I/O workload inside VM
# SSH into VM and run:
# nohup dd if=/dev/zero of=/home/test-large-file bs=1M count=1000 &
# nohup find / -type f -exec wc -l {} \; > /dev/null 2>&1 &

# Monitor VM status and ZFS pool
qm status 999  # Should show 'running'
zpool iostat testmirror 1 5  # Observe I/O activity

Simulate physical disk failure during VM operation:

# Remove device from system while VM is running (simulates hardware failure)
echo 1 > /sys/block/sdc/device/delete

# Check pool status - should detect failure but continue operation
zpool status testmirror

# Verify VM continues running without interruption
qm status 999  # Should still show 'running'

# Check VM console for any I/O errors (there should be none)
qm terminal 999

Monitor detection and recovery:

# Watch pool operating in degraded mode while VM continues
watch 'zpool status testmirror'

# Expected output shows DEGRADED state but ONLINE operation:
#   pool: testmirror
#  state: DEGRADED
# config:
#         NAME         STATE     READ WRITE CKSUM
#         testmirror   DEGRADED     0     0     0
#           mirror-0   DEGRADED     0     0     0
#             /dev/sdc UNAVAIL      0     0     0  cannot open
#             /dev/sdd ONLINE       0     0     0

# Verify VM performance during degraded operation
qm monitor 999
# (qm) info block
# Check for I/O statistics and verify no errors

# Test VM responsiveness
ping VM_IP  # Should respond normally
ssh user@VM_IP "uptime"  # Should show continued operation

# Monitor VM I/O during degraded state
zpool iostat -v testmirror 1 10  # Watch I/O patterns

Expected Behavior:

VM continues running without interruption
No I/O errors in VM guest OS
Pool operates in DEGRADED state but remains functional
Performance may be slightly reduced but still acceptable
ZFS transparently handles the failure without VM awareness

4.4. Disk Replacement Procedure

Procedure:

Replace failed disk while VM continues operation:

# After physical disk replacement, rescan SCSI bus
echo "- - -" > /sys/class/scsi_host/host0/scan

# Replace failed device with new one
zpool replace testmirror /dev/sdc /dev/sdl

# Monitor resilver progress while VM remains active
watch 'zpool status -v testmirror'

# Expected output during resilver:
#   pool: testmirror
#  state: DEGRADED
# status: One or more devices is currently being resilvered.
#  scan: resilver in progress since [timestamp]
#        X.XXG scanned at X.XXG/s, X.XXG issued at X.XXG/s, X.XXG total
#        X.X% done, X.XXh to go

# Verify VM performance during resilver
qm status 999  # Should remain 'running'

# Test VM I/O performance during resilver (expect some impact)
ssh user@VM_IP "dd if=/dev/zero of=/tmp/test bs=1M count=100 oflag=direct"

# Monitor resilver completion
while zpool status testmirror | grep -q "resilver"; do 
    echo "Resilver progress: $(zpool status testmirror | grep resilver)"
    echo "VM status: $(qm status 999)"
    sleep 60
done

echo "Resilver complete - verifying VM integrity"
qm status 999  # Should show 'running'
zpool status testmirror  # Should show ONLINE

# Clean up test VM after successful test
qm stop 999
qm destroy 999
zfs destroy testmirror/vm-999-disk-0

Production Considerations:

VMs experience minimal impact during single disk failure
Resilver operations may cause 10-30% performance degradation
Monitor VM response times during resilver operations
Schedule resilvers during maintenance windows for critical systems

4.6. L2ARC Cache Device Failure Simulation

Procedure: Test L2ARC device failure scenarios - cache failures should not affect pool integrity:

# Prerequisites: Pool with L2ARC device configured
zpool status testmirror  # Verify cache device present

# Create workload to populate L2ARC
echo "Populating L2ARC cache..."
for i in {1..50}; do
    dd if=/dev/urandom of=/testmirror/testfs/cachefile$i bs=1M count=20
done

# Verify L2ARC is populated
zpool iostat -v testmirror 1 5
cat /proc/spl/kstat/zfs/arcstats | grep l2_

# Test read performance with L2ARC active
echo "Testing read performance with L2ARC..."
fio --direct=1 --iodepth=32 --rw=randread --ioengine=libaio --bs=64k \
    --size=500M --numjobs=2 --runtime=60 --group_reporting \
    --filename=/testmirror/testfs/l2arc-test --name=L2ARC_Active_Test

# Simulate L2ARC device failure
echo "Simulating L2ARC device failure..."
if [[ -b /dev/nvme0n1 ]]; then
    # Physical device failure simulation
    echo 1 > /sys/block/nvme0n1/device/delete
else
    # File-based cache failure simulation
    rm /mnt/zfs-test/cache.img 2>/dev/null || true
fi

# Check pool status - should remain ONLINE with cache device UNAVAIL
zpool status testmirror

# Expected output shows:
# pool: testmirror
# state: ONLINE
# config:
#         NAME                      STATE     READ WRITE CKSUM
#         testmirror                ONLINE       0     0     0
#           mirror-0                ONLINE       0     0     0
#             /dev/sdc              ONLINE       0     0     0
#             /dev/sdd              ONLINE       0     0     0
#         cache
#           /dev/nvme0n1            UNAVAIL      0     0     0

# Test read performance without L2ARC
echo "Testing read performance without L2ARC..."
fio --direct=1 --iodepth=32 --rw=randread --ioengine=libaio --bs=64k \
    --size=500M --numjobs=2 --runtime=60 --group_reporting \
    --filename=/testmirror/testfs/l2arc-test --name=L2ARC_Failed_Test

# Remove failed cache device from pool
zpool remove testmirror /dev/nvme0n1

# Verify pool operates normally without cache
zpool status testmirror  # Should show no cache devices
zpool iostat -v testmirror

# Add replacement cache device if available
if [[ -b /dev/nvme1n1 ]]; then
    zpool add testmirror cache /dev/nvme1n1
    echo "Replacement cache device added"
fi

echo "L2ARC failure test complete - pool integrity maintained"

Expected Behavior:

Pool remains ONLINE despite cache device failure
No data loss occurs
Read performance may degrade temporarily
Cache device can be safely removed and replaced
Critical: L2ARC failures never affect pool integrity

4.7. Multiple Device Failures

Procedure: Test RAIDZ resilience with real disks:

# RAIDZ1 can survive 1 disk failure
echo 1 > /sys/block/sde/device/delete  # Pool remains online

# Simulate second failure - pool becomes unavailable
echo 1 > /sys/block/sdf/device/delete
zpool status testraidz1  # Will show FAULTED state

Procedure: Test RAIDZ resilience with real disks:

# RAIDZ1 can survive 1 disk failure
echo 1 > /sys/block/sde/device/delete  # Pool remains online

# Simulate second failure - pool becomes unavailable
echo 1 > /sys/block/sdf/device/delete
zpool status testraidz1  # Will show FAULTED state

5. Production Monitoring

5.1. Critical Metrics to Monitor

# Pool health check
zpool status -v

# Capacity monitoring (alert at 80%)
zpool list

# I/O latency tracking  
zpool iostat -v 1

# ARC efficiency
cat /proc/spl/kstat/zfs/arcstats | grep -E "hits|misses|hit_rate"

5.2. Automated Monitoring Script

#!/bin/bash
# /usr/local/bin/zfs-monitor.sh

# Check pool status
for pool in $(zpool list -H -o name); do
    status=$(zpool list -H -o health $pool)
    if [ "$status" != "ONLINE" ]; then
        echo "ALERT: Pool $pool status: $status" | mail -s "ZFS Alert" [email protected]
    fi
    
    # Check capacity
    cap=$(zpool list -H -o capacity $pool | tr -d '%')
    if [ $cap -gt 80 ]; then
        echo "WARNING: Pool $pool at ${cap}% capacity" | mail -s "ZFS Capacity" [email protected]
    fi
done

5.3. ZED Configuration

Only if your mail command can send email to the world, just configure the email address in the zed.rc file as needed, and you should start getting email.

Configure ZFS Event Daemon for automated responses:

# Edit ZED configuration
vim /etc/zfs/zed.d/zed.rc

# Key settings:
ZED_EMAIL_ADDR="[email protected]"
ZED_EMAIL_PROG="mail"
ZED_NOTIFY_VERBOSE=1
ZED_SPARE_ON_IO_ERRORS=1
ZED_SPARE_ON_CHECKSUM_ERRORS=10

# Enable and start ZED
systemctl enable zfs-zed
systemctl start zfs-zed

Configure ZFS Event Daemon for automated responses:

# Edit ZED configuration
vim /etc/zfs/zed.d/zed.rc

# Key settings:
ZED_EMAIL_ADDR="[email protected]"
ZED_EMAIL_PROG="mail"
ZED_NOTIFY_VERBOSE=1
ZED_SPARE_ON_IO_ERRORS=1
ZED_SPARE_ON_CHECKSUM_ERRORS=10

# Enable and start ZED
systemctl enable zfs-zed
systemctl start zfs-zed

6. Performance Tuning for VM Workloads

6.1. ARC Sizing for Virtualization Hosts

# Conservative: 25% of RAM (adjust based on VM memory requirements)
echo $(($(grep MemTotal /proc/meminfo | awk '{print $2}') * 1024 * 25 / 100)) > /sys/module/zfs/parameters/zfs_arc_max

# Check current ARC usage
grep "size" /proc/spl/kstat/zfs/arcstats

6.2. Recordsize Optimization

# For VM datasets (guest filesystem dependent)
zfs set recordsize=16K tank/databases     # Database and general VMs
zfs set recordsize=1M tank/backups        # Large sequential files

# Monitor record size effectiveness
zfs get recordsize,written,referenced tank/vmimages

7. Recovery Verification

7.1. VM-Based Data Integrity Testing

#!/bin/bash
# Create multiple test VMs for comprehensive failure testing

# Function to create test VM with known data
create_test_vm() {
    local vmid=$1
    local pool=$2
    
    # Create ZVOL for VM (using default 16K volblocksize)
    zfs create -V 10G ${pool}/vm-${vmid}-disk-0
    
    # Create VM
    qm create $vmid --name "zfs-test-vm-$vmid" --memory 1024 --cores 1 \
        --net0 virtio,bridge=vmbr0 --onboot 0
    qm set $vmid --scsi0 /dev/zvol/${pool}/vm-${vmid}-disk-0,cache=none,discard=on
    qm set $vmid --cdrom local:iso/ubuntu-22.04-live-server-amd64.iso
    
    echo "Created test VM $vmid on pool $pool"
}

# Create test VMs before failure scenarios
create_test_vm 990 testmirror
create_test_vm 991 testmirror
create_test_vm 992 testmirror

# Start VMs and create unique test data in each
for vmid in 990 991 992; do
    qm start $vmid
    
    # Wait for VM to boot and create test data
    sleep 30
    
    # SSH into each VM and create unique test files with known checksums
    ssh user@VM_${vmid}_IP "
        echo 'Creating test data in VM $vmid'
        dd if=/dev/urandom of=/home/testfile-$vmid bs=1M count=100
        md5sum /home/testfile-$vmid > /home/checksum-$vmid.txt
        sync
    "
done

# After failure scenarios, verify VM data integrity
verify_vm_integrity() {
    local vmid=$1
    
    echo "Verifying integrity of VM $vmid"
    qm status $vmid
    
    if [[ $(qm status $vmid) == *"running"* ]]; then
        ssh user@VM_${vmid}_IP "
            echo 'Verifying checksum in VM $vmid'
            md5sum -c /home/checksum-$vmid.txt
            if [ \$? -eq 0 ]; then
                echo 'VM $vmid: Data integrity PASSED'
            else
                echo 'VM $vmid: Data integrity FAILED'
            fi
        "
    else
        echo "VM $vmid is not running - attempting to start"
        qm start $vmid
        sleep 30
        ssh user@VM_${vmid}_IP "md5sum -c /home/checksum-$vmid.txt"
    fi
}

# Verify all VMs after recovery
echo "=== VM Data Integrity Verification ==="
for vmid in 990 991 992; do
    verify_vm_integrity $vmid
done

# Cleanup test VMs
cleanup_test_vms() {
    for vmid in 990 991 992; do
        qm stop $vmid 2>/dev/null
        qm destroy $vmid
        zfs destroy testmirror/vm-${vmid}-disk-0
    done
}

# Uncomment to cleanup after testing
# cleanup_test_vms

7.2. Performance Impact Assessment - VM Workload (Cheat Sheet)

# Cheat Sheet - ZFS Performance 
# This mimics real VM I/O patterns with 4K blocks and high queue depths

# Login to the VM
TESTDIR="/testmirror"
mkdir ${TESTDIR}

# Baseline tests before failure
echo "=== BASELINE PERFORMANCE TESTS ==="

# Random write test (VM boot, application startup)
fio --direct=1 --iodepth=128 --rw=randwrite --ioengine=libaio --bs=4k \
    --size=1G --numjobs=1 --runtime=300 --group_reporting \
    --filename=${TESTDIR}/iotest-randwrite --name=Baseline_Random_Write
rm ${TESTDIR}/iotest-*; sync

# Random read test (database queries, file access)
fio --direct=1 --iodepth=128 --rw=randread --ioengine=libaio --bs=4k \
    --size=1G --numjobs=1 --runtime=300 --group_reporting \
    --filename=${TESTDIR}/iotest-randread --name=Baseline_Random_Read
rm ${TESTDIR}/iotest-*; sync

# Mixed random I/O (typical VM workload - 70% read, 30% write)
fio --direct=1 --iodepth=128 --rw=randrw --rwmixread=70 --ioengine=libaio \
    --bs=4k --size=1G --numjobs=1 --runtime=300 --group_reporting \
    --filename=${TESTDIR}/iotest-randmix --name=Baseline_Mixed_Random
rm ${TESTDIR}/iotest-*; sync

# Sequential write test (large file transfers, backups)
fio --direct=1 --iodepth=128 --rw=write --ioengine=libaio --bs=4k \
    --size=1G --numjobs=1 --runtime=300 --group_reporting \
    --filename=${TESTDIR}/iotest-seqwrite --name=Baseline_Sequential_Write
rm ${TESTDIR}/iotest-*; sync

# Sequential read test (streaming, large file access)
fio --direct=1 --iodepth=128 --rw=read --ioengine=libaio --bs=4k \
    --size=1G --numjobs=1 --runtime=300 --group_reporting \
    --filename=${TESTDIR}/iotest-seqread --name=Baseline_Sequential_Read
rm ${TESTDIR}/iotest-*; sync

# Test during degraded operation (one disk offline)
echo "=== DEGRADED POOL PERFORMANCE TESTS ==="
zpool offline testmirror /dev/sdc

# Critical test: Random write performance during degraded operation
fio --direct=1 --iodepth=128 --rw=randwrite --ioengine=libaio --bs=4k \
    --size=1G --numjobs=1 --runtime=300 --group_reporting \
    --filename=${TESTDIR}/iotest-degraded-randwrite --name=Degraded_Random_Write
rm ${TESTDIR}/iotest-*; sync

# Mixed workload during degraded state
fio --direct=1 --iodepth=128 --rw=randrw --rwmixread=70 --ioengine=libaio \
    --bs=4k --size=1G --numjobs=1 --runtime=300 --group_reporting \
    --filename=${TESTDIR}/iotest-degraded-mixed --name=Degraded_Mixed_Random
rm ${TESTDIR}/iotest-*; sync

# Bring pool back online and test during resilver
echo "=== RESILVER PERFORMANCE TESTS ==="
zpool online testmirror /dev/sdc

# Test performance during resilver (most critical for production)
fio --direct=1 --iodepth=64 --rw=randrw --rwmixread=70 --ioengine=libaio \
    --bs=4k --size=512M --numjobs=1 --runtime=120 --group_reporting \
    --filename=${TESTDIR}/iotest-resilver-mixed --name=Resilver_Mixed_Random &

# Monitor resilver progress
while zpool status testmirror | grep -q "resilver"; do 
    echo "Resilver in progress: $(zpool status testmirror | grep resilver)"
    sleep 30
done
wait  # Wait for fio to complete

rm ${TESTDIR}/iotest-*; sync

# Final performance test after recovery
echo "=== POST-RECOVERY PERFORMANCE TESTS ==="
fio --direct=1 --iodepth=128 --rw=randrw --rwmixread=70 --ioengine=libaio \
    --bs=4k --size=1G --numjobs=1 --runtime=300 --group_reporting \
    --filename=${TESTDIR}/iotest-recovered-mixed --name=Recovered_Mixed_Random
rm ${TESTDIR}/iotest-*; sync

8. Production Notes

8.1. Disk Identification

CRITICAL: Always verify disk identification before testing:

# Use multiple methods to identify disks
lsblk -o NAME,SIZE,MODEL,SERIAL
smartctl -i /dev/sdc
ls -la /dev/disk/by-id/ | grep sdc

# Check if disk contains important data
blkid /dev/sdc
mount | grep sdc
lsof | grep sdc

8.2. Safe Testing Environment

# Create isolated test environment
# Use dedicated test disks - never production storage
# Label test pools clearly
zpool create -o comment="TEST POOL - SAFE TO DESTROY" testmirror mirror /dev/sdc /dev/sdd

This hands-on approach lets you thoroughly test ZFS failure scenarios using realistic Proxmox VM workloads without risking production data. The key is understanding that ZFS self-healing only works with proper redundancy - never test these scenarios on single-device pools in production.

Key of good sleep:

VMs continue running transparently during single disk failures
Resilver operations cause measurable but acceptable performance impact
Data integrity is automatically maintained without VM awareness
L2ARC failures have no impact on VM operation or data safety

Hard Notes: ZFS can protect against drive failures but not against operator errors.