Monitor ZFS for Better Sleep

May 16, 2022

Monitor ZFS for Better Sleep

Monitoring is the key to happy consumer.

Monitoring with Grafana and Prometheus

ZFS Exporter Setup:

# Install zfs_exporter for Prometheus metrics
cd /opt
git clone https://github.com/pdf/zfs_exporter.git
cd zfs_exporter
go build # to get most latest metrics, or just simply wget

# Create systemd service
cat > /etc/systemd/system/zfs-exporter.service << 'EOF'
[Unit]
Description=ZFS Prometheus Exporter
After=network.target

[Service]
Type=simple
User=root
ExecStart=/opt/zfs_exporter/zfs_exporter --web.listen-address=:9254
Restart=always

[Install]
WantedBy=multi-user.target
EOF

systemctl enable --now zfs-exporter

Prometheus Configuration:

# Add to prometheus.yml
- job_name: 'zfs'
  static_configs:
    - targets: ['localhost:9254']
  scrape_interval: 30s

Critical Grafana Dashboard Panels:

# Pool Health Status
zfs_pool_state{pool="$pool"} != 0

# Pool Capacity Utilization (Alert at 80%)
(zfs_pool_allocated_bytes{pool="$pool"} / zfs_pool_size_bytes{pool="$pool"}) * 100

# ARC Hit Ratio (Alert below 85%)
rate(zfs_arc_hits_total[5m]) / (rate(zfs_arc_hits_total[5m]) + rate(zfs_arc_misses_total[5m])) * 100

# Checksum Errors (Alert on any)
increase(zfs_pool_checksum_errors_total{pool="$pool"}[5m])

# Fragmentation Percentage
zfs_pool_fragmentation_percent{pool="$pool"}

# Read/Write IOPS
rate(zfs_pool_read_ops_total{pool="$pool"}[5m])
rate(zfs_pool_write_ops_total{pool="$pool"}[5m])

# Read/Write Latency
zfs_pool_read_latency_seconds{pool="$pool"}
zfs_pool_write_latency_seconds{pool="$pool"}

Alerting Rules (alerts.yml):

groups:
- name: zfs-alerts
  rules:
  - alert: ZFSPoolDegraded
    expr: zfs_pool_state != 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "ZFS pool {{ $labels.pool }} is degraded"
      description: "Pool state: {{ $value }}"

  - alert: ZFSPoolCapacityHigh
    expr: (zfs_pool_allocated_bytes / zfs_pool_size_bytes) * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "ZFS pool {{ $labels.pool }} capacity is {{ $value }}%"

  - alert: ZFSARCHitRateLow
    expr: (rate(zfs_arc_hits_total[5m]) / (rate(zfs_arc_hits_total[5m]) + rate(zfs_arc_misses_total[5m]))) * 100 < 85
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "ZFS ARC hit ratio is {{ $value }}%"

  - alert: ZFSChecksumErrors
    expr: increase(zfs_pool_checksum_errors_total[5m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "ZFS checksum errors detected on pool {{ $labels.pool }}"

  - alert: ZFSFragmentationHigh
    expr: zfs_pool_fragmentation_percent > 50
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: "ZFS pool {{ $labels.pool }} fragmentation is {{ $value }}%"

Reference Dashboards:


Profile picture

Written by Nicolas Julian Seseorang yang mencoba berkarya. Chit Chat with me in Twitter