_-_Old_man_sleeping.jpg)
Monitoring is the key to happy consumer.
Monitoring with Grafana and Prometheus
ZFS Exporter Setup:
# Install zfs_exporter for Prometheus metrics
cd /opt
git clone https://github.com/pdf/zfs_exporter.git
cd zfs_exporter
go build # to get most latest metrics, or just simply wget
# Create systemd service
cat > /etc/systemd/system/zfs-exporter.service << 'EOF'
[Unit]
Description=ZFS Prometheus Exporter
After=network.target
[Service]
Type=simple
User=root
ExecStart=/opt/zfs_exporter/zfs_exporter --web.listen-address=:9254
Restart=always
[Install]
WantedBy=multi-user.target
EOF
systemctl enable --now zfs-exporter
Prometheus Configuration:
# Add to prometheus.yml
- job_name: 'zfs'
static_configs:
- targets: ['localhost:9254']
scrape_interval: 30s
Critical Grafana Dashboard Panels:
# Pool Health Status
zfs_pool_state{pool="$pool"} != 0
# Pool Capacity Utilization (Alert at 80%)
(zfs_pool_allocated_bytes{pool="$pool"} / zfs_pool_size_bytes{pool="$pool"}) * 100
# ARC Hit Ratio (Alert below 85%)
rate(zfs_arc_hits_total[5m]) / (rate(zfs_arc_hits_total[5m]) + rate(zfs_arc_misses_total[5m])) * 100
# Checksum Errors (Alert on any)
increase(zfs_pool_checksum_errors_total{pool="$pool"}[5m])
# Fragmentation Percentage
zfs_pool_fragmentation_percent{pool="$pool"}
# Read/Write IOPS
rate(zfs_pool_read_ops_total{pool="$pool"}[5m])
rate(zfs_pool_write_ops_total{pool="$pool"}[5m])
# Read/Write Latency
zfs_pool_read_latency_seconds{pool="$pool"}
zfs_pool_write_latency_seconds{pool="$pool"}
Alerting Rules (alerts.yml):
groups:
- name: zfs-alerts
rules:
- alert: ZFSPoolDegraded
expr: zfs_pool_state != 0
for: 0m
labels:
severity: critical
annotations:
summary: "ZFS pool {{ $labels.pool }} is degraded"
description: "Pool state: {{ $value }}"
- alert: ZFSPoolCapacityHigh
expr: (zfs_pool_allocated_bytes / zfs_pool_size_bytes) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "ZFS pool {{ $labels.pool }} capacity is {{ $value }}%"
- alert: ZFSARCHitRateLow
expr: (rate(zfs_arc_hits_total[5m]) / (rate(zfs_arc_hits_total[5m]) + rate(zfs_arc_misses_total[5m]))) * 100 < 85
for: 10m
labels:
severity: warning
annotations:
summary: "ZFS ARC hit ratio is {{ $value }}%"
- alert: ZFSChecksumErrors
expr: increase(zfs_pool_checksum_errors_total[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "ZFS checksum errors detected on pool {{ $labels.pool }}"
- alert: ZFSFragmentationHigh
expr: zfs_pool_fragmentation_percent > 50
for: 30m
labels:
severity: warning
annotations:
summary: "ZFS pool {{ $labels.pool }} fragmentation is {{ $value }}%"
Reference Dashboards: