How to Recover Kubernetes Cluster After Cilium Network Policy Lockout - etcd Disaster Recovery Guide

September 13, 2025

TL;DR: Learn how to recover from a complete Kubernetes cluster lockout caused by misconfigured Cilium network policies. This guide covers etcd disaster recovery using --force-new-cluster, fixing lost quorum, and preventing similar incidents.

In this comprehensive guide, I'll show you how one Cilium network policy configuration mistake completely locked me out of my production Kubernetes cluster, and the exact step-by-step process I used to recover without any data loss. If you're running Kubernetes with Cilium CNI and working with network policies, this post will help you avoid and recover from similar disasters.

Background: Our Kubernetes Cluster Setup

I run a 3-node Kubernetes cluster with the following configuration:

Deployment tool: Kubespray
Infrastructure: VPS with public IPs only (no private networking)
CNI: Cilium
etcd: External etcd cluster (systemd service)
Use case: Production workloads requiring strict network security

Because our cluster runs on VPS with only public IPs exposed, network security through Cilium network policies is critical. My colleague was implementing security hardening and needed to restrict node-exporter metrics endpoint (port 9100) access using Cilium's clusterwide network policies.

What seemed like a simple security improvement turned into a complete cluster outage within seconds.

The Problem: Kubernetes Cluster Completely Locked Out

What Went Wrong With Cilium Network Policy

Here's the exact Cilium clusterwide network policy that caused the complete cluster lockout:

apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: restrict-node-exporter-hostport-9100
spec:
  nodeSelector:
    matchLabels: {}  # Applied to ALL nodes - this was the problem
  ingress:
  - fromCIDR:
    - 103.xxx.xxx.xxx/32
    toPorts: 
    - ports:
      - port: "9100"
        protocol: TCP
  - fromEntities:
    - cluster
    toPorts:
    - ports:
      - port: "9100"
        protocol: TCP

Understanding Cilium Clusterwide Network Policy Behavior

The critical mistake: Cilium clusterwide network policies with empty nodeSelector operate in whitelist mode, not blacklist mode. This is a common misunderstanding that causes many Kubernetes cluster outages.

By only specifying port 9100 in the ingress rules, we unintentionally created a default-deny policy for ALL other ports across the entire cluster, including:

Port 6443 - Kubernetes API server (no kubectl access)
Port 2379/2380 - etcd cluster communication (quorum lost)
Port 10250 - kubelet API (node communication broken)
Port 53 - DNS (service discovery failed)

This created a perfect deadlock:

Cannot delete the network policy → need API server access
Cannot restore API server → need etcd working
Cannot fix etcd → need cluster communication (blocked by Cilium)
Cannot disable Cilium → need etcd to delete DaemonSet

Kubernetes Cluster Troubleshooting: Identifying the Issue

Symptom 1: API Server Unreachable

$ kubectl get nodes
Unable to connect to the server: dial tcp 103.xxx.xxx.xxx:6443: connect: connection refused

$ curl -k https://127.0.0.1:6443/healthz
curl: (7) Failed to connect to 127.0.0.1 port 6443: Connection refused

Symptom 2: etcd Lost Quorum - No Leader

$ ETCDCTL_API=3 etcdctl endpoint status -w table
+----------------+------------------+---------+---------+-----------+------------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM  |
+----------------+------------------+---------+---------+-----------+------------+
| 127.0.0.1:2379 | 760112fd749b508a |   3.5.6 |  497 MB |   false   |   13384    |
+----------------+------------------+---------+---------+-----------+------------+

ERROR: etcdserver: no leader

The etcd cluster lost quorum because nodes couldn't communicate on ports 2379/2380. Without a Raft leader, etcd refuses all write operations - meaning we cannot delete the problematic network policy even if we had API access.

Complete Kubernetes Disaster Recovery Process

Step 1: Stop Cilium Agent to Disable Network Policies

First attempt to disable Cilium network policy enforcement:

# Find cilium-agent container using nerdctl
$ nerdctl ps | grep cilium-agent | grep -v pause

# Stop the cilium-agent container
$ nerdctl stop <cilium-container-id>

Result: Cilium stopped but API server was already crashed. etcd still had no leader because the damage was already done.

Step 2: Attempt Cilium Uninstall

$ cilium uninstall
🔥 Deleting agent DaemonSet...
🔥 Deleting operator Deployment...
✅ Cilium was successfully uninstalled.

Problem: This command requires etcd to work, creating a circular dependency. We needed a different approach.

Step 3: Backup etcd Before Recovery (Critical Step)

Always backup etcd before attempting any disaster recovery operations:

ETCDCTL_API=3 etcdctl snapshot save /root/etcd-backup-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/node-argocd-1.pem \
  --key=/etc/ssl/etcd/ssl/node-argocd-1-key.pem

Output:

Snapshot saved at /root/etcd-backup-20251013-110002.db

This backup is your safety net. If the recovery process fails, you can restore from this snapshot.

Step 4: etcd Disaster Recovery - Force New Cluster

This is the nuclear option for etcd disaster recovery. The --force-new-cluster flag forces a node to become a single-member cluster with itself as the leader.

What --force-new-cluster does:

Removes all other members from the cluster configuration
Keeps all existing data intact (does NOT delete data)
Forces this node to become the Raft leader
Allows the cluster to operate as single-node temporarily

# Stop etcd systemd service
$ systemctl stop etcd

# Start etcd with force-new-cluster flag
$ etcd --force-new-cluster \
  --data-dir=/var/lib/etcd \
  --name=etcd2 \
  --advertise-client-urls=https://103.xxx.xxx.xxx:2379 \
  --listen-client-urls=https://103.xxx.xxx.xxx:2379,https://127.0.0.1:2379 \
  --cert-file=/etc/ssl/etcd/ssl/member-argocd-1.pem \
  --key-file=/etc/ssl/etcd/ssl/member-argocd-1-key.pem \
  --trusted-ca-file=/etc/ssl/etcd/ssl/ca.pem \
  --client-cert-auth=true &

Success! etcd became leader:

{"level":"info","ts":"2025-10-13T11:28:12.212+0700","msg":"760112fd749b508a became leader at term 13385"}
{"level":"info","ts":"2025-10-13T11:28:12.215+0700","msg":"published local member to cluster through raft"}
{"level":"info","ts":"2025-10-13T11:28:12.218+0700","msg":"serving client traffic securely"}

Now etcd had a leader, and the Kubernetes API server automatically came back online!

Step 5: Delete the Problematic Cilium Network Policy

With API access restored, we could finally delete the bad network policy:

$ kubectl get ciliumclusterwidenetworkpolicies
NAME                                   AGE
restrict-node-exporter-hostport-9100   94m

$ kubectl delete ciliumclusterwidenetworkpolicies/restrict-node-exporter-hostport-9100
ciliumclusterwidenetworkpolicy.cilium.io "restrict-node-exporter-hostport-9100" deleted

Step 6: Restore etcd Cluster - Add Members Back

Running with single-node etcd is risky. We needed to restore the 3-node cluster for high availability.

Add first member back to etcd cluster:

ETCDCTL_API=3 etcdctl member add etcd1 --peer-urls=https://103.xxx.xxx.xxx:2380 \
  --endpoints=127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/node-argocd-1.pem \
  --key=/etc/ssl/etcd/ssl/node-argocd-1-key.pem

Output:

Member d25f65265e3477c7 added to cluster ed92624dcc0aa007
ETCD_NAME="etcd1"
ETCD_INITIAL_CLUSTER="etcd2=https://103.xxx.xxx.xxx:2380,etcd1=https://103.xxx.xxx.xxx:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

On the second node, rejoin the cluster:

$ ssh [email protected]

# Stop etcd and backup old data
$ systemctl stop etcd
$ mv /var/lib/etcd /var/lib/etcd.old

# Update etcd configuration to join existing cluster
$ sed -i 's/ETCD_INITIAL_CLUSTER_STATE=.*/ETCD_INITIAL_CLUSTER_STATE=existing/' /etc/etcd.env
$ sed -i 's/ETCD_INITIAL_CLUSTER=.*/ETCD_INITIAL_CLUSTER=etcd2=https:\/\/103.xxx.xxx.xxx:2380,etcd1=https:\/\/103.xxx.xxx.xxx:2380/' /etc/etcd.env

# Start etcd to rejoin cluster
$ systemctl start etcd

Repeat this process for the third node.

Step 7: Verify Complete Kubernetes Cluster Recovery

Check etcd cluster members:

$ ETCDCTL_API=3 etcdctl member list -w table
+------------------+---------+-------+------------------------------+------------------------------+------------+
|        ID        | STATUS  | NAME  |          PEER ADDRS          |         CLIENT ADDRS         | IS LEARNER |
+------------------+---------+-------+------------------------------+------------------------------+------------+
| 760112fd749b508a | started | etcd2 | https://103.xxx.xxx.xxx:2380 | https://103.xxx.xxx.xxx:2379 |      false |
| d25f65265e3477c7 | started | etcd1 | https://103.xxx.xxx.xxx:2380 | https://103.xxx.xxx.xxx:2379 |      false |
| da70198d6c2536f2 | started | etcd3 | https://103.xxx.xxx.xxx:2380 | https://103.xxx.xxx.xxx:2379 |      false |
+------------------+---------+-------+------------------------------+------------------------------+------------+

Verify etcd health and leader election:

$ ETCDCTL_API=3 etcdctl endpoint status -w table
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|       ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 103.xxx.xxx.xxx:2379 | 760112fd749b508a |   3.5.6 |  497 MB |      true |      false |     13388 | 2079555217 |         2079555217 |        |
| 103.xxx.xxx.xxx:2379 | d25f65265e3477c7 |   3.5.6 |  497 MB |     false |      false |     13388 | 2079555218 |         2079555218 |        |
| 103.xxx.xxx.xxx:2379 | da70198d6c2536f2 |   3.5.6 |  497 MB |     false |      false |     13388 | 2079555220 |         2079555220 |        |
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Perfect! All three etcd members are healthy, synchronized (same RAFT term), with proper leader election. Kubernetes cluster fully recovered with zero data loss!

The Correct Cilium Network Policy Configuration

Here's how to properly configure Cilium clusterwide network policy without locking yourself out:

apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: restrict-node-exporter-hostport
spec:
  nodeSelector:
    matchLabels: {}
  ingress:
  # Allow node-exporter access from monitoring server only
  - fromCIDR:
    - 103.xxx.xxx.xxx/32
    toPorts: 
    - ports:
      - port: "9100"
        protocol: TCP
  
  # CRITICAL: Allow all cluster-internal communication
  - fromEntities:
    - cluster
    - host
  
  # CRITICAL: Explicitly allow Kubernetes control plane ports
  - toPorts:
    - ports:
      - port: "6443"  # Kubernetes API server
        protocol: TCP
      - port: "2379"  # etcd client port
        protocol: TCP
      - port: "2380"  # etcd peer port
        protocol: TCP
      - port: "10250" # kubelet API
        protocol: TCP
      - port: "53"    # CoreDNS
        protocol: TCP
      - port: "53"    # CoreDNS
        protocol: UDP

Key differences:

Added fromEntities: [cluster, host] to allow all internal cluster communication
Explicitly whitelisted all critical Kubernetes ports
Maintained the port 9100 restriction to specific monitoring IP

Kubernetes Network Policy Best Practices

1. Understand Cilium Policy Modes

Cilium clusterwide network policies with nodeSelector work as whitelist mode:

Only explicitly allowed traffic is permitted
All other traffic is denied by default
Very different from traditional firewall deny rules

2. Always Whitelist Critical Kubernetes Ports

When using any network policies, ALWAYS allow:

6443 - Kubernetes API server (kubectl access)
2379/2380 - etcd cluster communication (critical for consensus)
10250 - kubelet API (node management)
53 - CoreDNS (service discovery)

3. Test Network Policies in Staging Environment

Never test Cilium clusterwide network policies directly in production. Always:

Create identical staging cluster
Apply policies in staging first
Verify all critical services remain accessible
Monitor for 24 hours before production rollout

4. Prefer Namespaced Network Policies

Instead of CiliumClusterwideNetworkPolicy, use namespaced CiliumNetworkPolicy when possible:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy  # Namespaced - safer
metadata:
  name: restrict-node-exporter
  namespace: monitoring
spec:
  endpointSelector:
    matchLabels:
      app: node-exporter
  # Only affects pods in monitoring namespace

Namespaced policies reduce blast radius and prevent cluster-wide outages.

5. Implement Kubernetes Monitoring and Alerts

Set up monitoring for:

etcd cluster health (alert on "no leader")
etcd member count changes
API server availability
Network policy changes (audit log)

These alerts would have caught the issue within seconds instead of after the lockout occurred.

6. Document etcd Disaster Recovery Procedures

Every Kubernetes administrator should know:

How to take etcd snapshots
How to restore from etcd backup
When and how to use etcd --force-new-cluster
How to rejoin etcd members after recovery

Emergency Kubernetes Recovery Commands

Save these commands for disaster recovery situations:

etcd Snapshot Backup

ETCDCTL_API=3 etcdctl snapshot save /root/etcd-backup-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/node-argocd-1.pem \
  --key=/etc/ssl/etcd/ssl/node-argocd-1-key.pem

etcd Health Check

ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/node-argocd-1.pem \
  --key=/etc/ssl/etcd/ssl/node-argocd-1-key.pem

etcd Force New Cluster (Last Resort)

systemctl stop etcd
etcd --force-new-cluster \
  --data-dir=/var/lib/etcd \
  --name=<node-name> \
  --advertise-client-urls=https://<ip>:2379 \
  --listen-client-urls=https://<ip>:2379,https://127.0.0.1:2379 \
  [additional TLS flags] &

Add etcd Member

ETCDCTL_API=3 etcdctl member add <name> --peer-urls=https://<ip>:2380

Restore etcd from Snapshot

ETCDCTL_API=3 etcdctl snapshot restore <backup-file> \
  --data-dir=/var/lib/etcd-restore

Frequently Asked Questions (FAQ)

Q: Will `etcd --force-new-cluster` delete my data?

No. The --force-new-cluster flag only modifies cluster membership configuration. All your Kubernetes resources, configurations, and data remain intact in the etcd data directory.

Q: Can I recover without `--force-new-cluster`?

If you have lost etcd quorum and cannot restore communication between nodes, --force-new-cluster is the only option. It's specifically designed for disaster recovery when the cluster cannot elect a leader.

Q: Should I use Cilium clusterwide policies at all?

Yes, but carefully. Clusterwide policies are powerful for enforcing security across your entire cluster, but they require thorough testing and understanding. For most use cases, namespaced policies are safer.

Q: What if I don't have an etcd backup?

The --force-new-cluster method works without a backup because it uses the existing data directory. However, always maintain regular etcd backups for other disaster scenarios like disk failure or data corruption.

Conclusion: Key Takeaways for Kubernetes Administrators

One misconfigured Cilium network policy brought down our entire production Kubernetes cluster. But with proper understanding of etcd operations and disaster recovery procedures, we recovered completely without any data loss.

Critical lessons:

Cilium clusterwide policies with nodeSelector create whitelist mode - understand this before applying
Always explicitly allow critical Kubernetes ports (6443, 2379, 2380, 10250, 53)
Test network policies in staging environments, never directly in production
Know how to use etcd --force-new-cluster before you need it
Maintain regular etcd backups and document recovery procedures
Implement monitoring and alerting for etcd cluster health

The etcd --force-new-cluster command saved us from rebuilding the entire cluster from scratch. Every Kubernetes administrator should understand this critical disaster recovery tool.