
TL;DR: Learn how to recover from a complete Kubernetes cluster lockout caused by misconfigured Cilium network policies. This guide covers etcd disaster recovery using --force-new-cluster
, fixing lost quorum, and preventing similar incidents.
In this comprehensive guide, I'll show you how one Cilium network policy configuration mistake completely locked me out of my production Kubernetes cluster, and the exact step-by-step process I used to recover without any data loss. If you're running Kubernetes with Cilium CNI and working with network policies, this post will help you avoid and recover from similar disasters.
Background: Our Kubernetes Cluster Setup
I run a 3-node Kubernetes cluster with the following configuration:
- Deployment tool: Kubespray
- Infrastructure: VPS with public IPs only (no private networking)
- CNI: Cilium
- etcd: External etcd cluster (systemd service)
- Use case: Production workloads requiring strict network security
Because our cluster runs on VPS with only public IPs exposed, network security through Cilium network policies is critical. My colleague was implementing security hardening and needed to restrict node-exporter metrics endpoint (port 9100) access using Cilium's clusterwide network policies.
What seemed like a simple security improvement turned into a complete cluster outage within seconds.
The Problem: Kubernetes Cluster Completely Locked Out
What Went Wrong With Cilium Network Policy
Here's the exact Cilium clusterwide network policy that caused the complete cluster lockout:
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
name: restrict-node-exporter-hostport-9100
spec:
nodeSelector:
matchLabels: {} # Applied to ALL nodes - this was the problem
ingress:
- fromCIDR:
- 103.xxx.xxx.xxx/32
toPorts:
- ports:
- port: "9100"
protocol: TCP
- fromEntities:
- cluster
toPorts:
- ports:
- port: "9100"
protocol: TCP
Understanding Cilium Clusterwide Network Policy Behavior
The critical mistake: Cilium clusterwide network policies with empty nodeSelector
operate in whitelist mode, not blacklist mode. This is a common misunderstanding that causes many Kubernetes cluster outages.
By only specifying port 9100 in the ingress rules, we unintentionally created a default-deny policy for ALL other ports across the entire cluster, including:
- Port 6443 - Kubernetes API server (no kubectl access)
- Port 2379/2380 - etcd cluster communication (quorum lost)
- Port 10250 - kubelet API (node communication broken)
- Port 53 - DNS (service discovery failed)
This created a perfect deadlock:
- Cannot delete the network policy → need API server access
- Cannot restore API server → need etcd working
- Cannot fix etcd → need cluster communication (blocked by Cilium)
- Cannot disable Cilium → need etcd to delete DaemonSet
Kubernetes Cluster Troubleshooting: Identifying the Issue
Symptom 1: API Server Unreachable
$ kubectl get nodes
Unable to connect to the server: dial tcp 103.xxx.xxx.xxx:6443: connect: connection refused
$ curl -k https://127.0.0.1:6443/healthz
curl: (7) Failed to connect to 127.0.0.1 port 6443: Connection refused
Symptom 2: etcd Lost Quorum - No Leader
$ ETCDCTL_API=3 etcdctl endpoint status -w table
+----------------+------------------+---------+---------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM |
+----------------+------------------+---------+---------+-----------+------------+
| 127.0.0.1:2379 | 760112fd749b508a | 3.5.6 | 497 MB | false | 13384 |
+----------------+------------------+---------+---------+-----------+------------+
ERROR: etcdserver: no leader
The etcd cluster lost quorum because nodes couldn't communicate on ports 2379/2380. Without a Raft leader, etcd refuses all write operations - meaning we cannot delete the problematic network policy even if we had API access.
Complete Kubernetes Disaster Recovery Process
Step 1: Stop Cilium Agent to Disable Network Policies
First attempt to disable Cilium network policy enforcement:
# Find cilium-agent container using nerdctl
$ nerdctl ps | grep cilium-agent | grep -v pause
# Stop the cilium-agent container
$ nerdctl stop <cilium-container-id>
Result: Cilium stopped but API server was already crashed. etcd still had no leader because the damage was already done.
Step 2: Attempt Cilium Uninstall
$ cilium uninstall
🔥 Deleting agent DaemonSet...
🔥 Deleting operator Deployment...
✅ Cilium was successfully uninstalled.
Problem: This command requires etcd to work, creating a circular dependency. We needed a different approach.
Step 3: Backup etcd Before Recovery (Critical Step)
Always backup etcd before attempting any disaster recovery operations:
ETCDCTL_API=3 etcdctl snapshot save /root/etcd-backup-$(date +%Y%m%d-%H%M%S).db \
--endpoints=127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-argocd-1.pem \
--key=/etc/ssl/etcd/ssl/node-argocd-1-key.pem
Output:
Snapshot saved at /root/etcd-backup-20251013-110002.db
This backup is your safety net. If the recovery process fails, you can restore from this snapshot.
Step 4: etcd Disaster Recovery - Force New Cluster
This is the nuclear option for etcd disaster recovery. The --force-new-cluster
flag forces a node to become a single-member cluster with itself as the leader.
What --force-new-cluster
does:
- Removes all other members from the cluster configuration
- Keeps all existing data intact (does NOT delete data)
- Forces this node to become the Raft leader
- Allows the cluster to operate as single-node temporarily
# Stop etcd systemd service
$ systemctl stop etcd
# Start etcd with force-new-cluster flag
$ etcd --force-new-cluster \
--data-dir=/var/lib/etcd \
--name=etcd2 \
--advertise-client-urls=https://103.xxx.xxx.xxx:2379 \
--listen-client-urls=https://103.xxx.xxx.xxx:2379,https://127.0.0.1:2379 \
--cert-file=/etc/ssl/etcd/ssl/member-argocd-1.pem \
--key-file=/etc/ssl/etcd/ssl/member-argocd-1-key.pem \
--trusted-ca-file=/etc/ssl/etcd/ssl/ca.pem \
--client-cert-auth=true &
Success! etcd became leader:
{"level":"info","ts":"2025-10-13T11:28:12.212+0700","msg":"760112fd749b508a became leader at term 13385"}
{"level":"info","ts":"2025-10-13T11:28:12.215+0700","msg":"published local member to cluster through raft"}
{"level":"info","ts":"2025-10-13T11:28:12.218+0700","msg":"serving client traffic securely"}
Now etcd had a leader, and the Kubernetes API server automatically came back online!
Step 5: Delete the Problematic Cilium Network Policy
With API access restored, we could finally delete the bad network policy:
$ kubectl get ciliumclusterwidenetworkpolicies
NAME AGE
restrict-node-exporter-hostport-9100 94m
$ kubectl delete ciliumclusterwidenetworkpolicies/restrict-node-exporter-hostport-9100
ciliumclusterwidenetworkpolicy.cilium.io "restrict-node-exporter-hostport-9100" deleted
Step 6: Restore etcd Cluster - Add Members Back
Running with single-node etcd is risky. We needed to restore the 3-node cluster for high availability.
Add first member back to etcd cluster:
ETCDCTL_API=3 etcdctl member add etcd1 --peer-urls=https://103.xxx.xxx.xxx:2380 \
--endpoints=127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-argocd-1.pem \
--key=/etc/ssl/etcd/ssl/node-argocd-1-key.pem
Output:
Member d25f65265e3477c7 added to cluster ed92624dcc0aa007
ETCD_NAME="etcd1"
ETCD_INITIAL_CLUSTER="etcd2=https://103.xxx.xxx.xxx:2380,etcd1=https://103.xxx.xxx.xxx:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
On the second node, rejoin the cluster:
$ ssh [email protected]
# Stop etcd and backup old data
$ systemctl stop etcd
$ mv /var/lib/etcd /var/lib/etcd.old
# Update etcd configuration to join existing cluster
$ sed -i 's/ETCD_INITIAL_CLUSTER_STATE=.*/ETCD_INITIAL_CLUSTER_STATE=existing/' /etc/etcd.env
$ sed -i 's/ETCD_INITIAL_CLUSTER=.*/ETCD_INITIAL_CLUSTER=etcd2=https:\/\/103.xxx.xxx.xxx:2380,etcd1=https:\/\/103.xxx.xxx.xxx:2380/' /etc/etcd.env
# Start etcd to rejoin cluster
$ systemctl start etcd
Repeat this process for the third node.
Step 7: Verify Complete Kubernetes Cluster Recovery
Check etcd cluster members:
$ ETCDCTL_API=3 etcdctl member list -w table
+------------------+---------+-------+------------------------------+------------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+-------+------------------------------+------------------------------+------------+
| 760112fd749b508a | started | etcd2 | https://103.xxx.xxx.xxx:2380 | https://103.xxx.xxx.xxx:2379 | false |
| d25f65265e3477c7 | started | etcd1 | https://103.xxx.xxx.xxx:2380 | https://103.xxx.xxx.xxx:2379 | false |
| da70198d6c2536f2 | started | etcd3 | https://103.xxx.xxx.xxx:2380 | https://103.xxx.xxx.xxx:2379 | false |
+------------------+---------+-------+------------------------------+------------------------------+------------+
Verify etcd health and leader election:
$ ETCDCTL_API=3 etcdctl endpoint status -w table
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 103.xxx.xxx.xxx:2379 | 760112fd749b508a | 3.5.6 | 497 MB | true | false | 13388 | 2079555217 | 2079555217 | |
| 103.xxx.xxx.xxx:2379 | d25f65265e3477c7 | 3.5.6 | 497 MB | false | false | 13388 | 2079555218 | 2079555218 | |
| 103.xxx.xxx.xxx:2379 | da70198d6c2536f2 | 3.5.6 | 497 MB | false | false | 13388 | 2079555220 | 2079555220 | |
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Perfect! All three etcd members are healthy, synchronized (same RAFT term), with proper leader election. Kubernetes cluster fully recovered with zero data loss!
The Correct Cilium Network Policy Configuration
Here's how to properly configure Cilium clusterwide network policy without locking yourself out:
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
name: restrict-node-exporter-hostport
spec:
nodeSelector:
matchLabels: {}
ingress:
# Allow node-exporter access from monitoring server only
- fromCIDR:
- 103.xxx.xxx.xxx/32
toPorts:
- ports:
- port: "9100"
protocol: TCP
# CRITICAL: Allow all cluster-internal communication
- fromEntities:
- cluster
- host
# CRITICAL: Explicitly allow Kubernetes control plane ports
- toPorts:
- ports:
- port: "6443" # Kubernetes API server
protocol: TCP
- port: "2379" # etcd client port
protocol: TCP
- port: "2380" # etcd peer port
protocol: TCP
- port: "10250" # kubelet API
protocol: TCP
- port: "53" # CoreDNS
protocol: TCP
- port: "53" # CoreDNS
protocol: UDP
Key differences:
- Added
fromEntities: [cluster, host]
to allow all internal cluster communication - Explicitly whitelisted all critical Kubernetes ports
- Maintained the port 9100 restriction to specific monitoring IP
Kubernetes Network Policy Best Practices
1. Understand Cilium Policy Modes
Cilium clusterwide network policies with nodeSelector
work as whitelist mode:
- Only explicitly allowed traffic is permitted
- All other traffic is denied by default
- Very different from traditional firewall deny rules
2. Always Whitelist Critical Kubernetes Ports
When using any network policies, ALWAYS allow:
- 6443 - Kubernetes API server (kubectl access)
- 2379/2380 - etcd cluster communication (critical for consensus)
- 10250 - kubelet API (node management)
- 53 - CoreDNS (service discovery)
3. Test Network Policies in Staging Environment
Never test Cilium clusterwide network policies directly in production. Always:
- Create identical staging cluster
- Apply policies in staging first
- Verify all critical services remain accessible
- Monitor for 24 hours before production rollout
4. Prefer Namespaced Network Policies
Instead of CiliumClusterwideNetworkPolicy
, use namespaced CiliumNetworkPolicy
when possible:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy # Namespaced - safer
metadata:
name: restrict-node-exporter
namespace: monitoring
spec:
endpointSelector:
matchLabels:
app: node-exporter
# Only affects pods in monitoring namespace
Namespaced policies reduce blast radius and prevent cluster-wide outages.
5. Implement Kubernetes Monitoring and Alerts
Set up monitoring for:
- etcd cluster health (alert on "no leader")
- etcd member count changes
- API server availability
- Network policy changes (audit log)
These alerts would have caught the issue within seconds instead of after the lockout occurred.
6. Document etcd Disaster Recovery Procedures
Every Kubernetes administrator should know:
- How to take etcd snapshots
- How to restore from etcd backup
- When and how to use
etcd --force-new-cluster
- How to rejoin etcd members after recovery
Emergency Kubernetes Recovery Commands
Save these commands for disaster recovery situations:
etcd Snapshot Backup
ETCDCTL_API=3 etcdctl snapshot save /root/etcd-backup-$(date +%Y%m%d-%H%M%S).db \
--endpoints=127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-argocd-1.pem \
--key=/etc/ssl/etcd/ssl/node-argocd-1-key.pem
etcd Health Check
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-argocd-1.pem \
--key=/etc/ssl/etcd/ssl/node-argocd-1-key.pem
etcd Force New Cluster (Last Resort)
systemctl stop etcd
etcd --force-new-cluster \
--data-dir=/var/lib/etcd \
--name=<node-name> \
--advertise-client-urls=https://<ip>:2379 \
--listen-client-urls=https://<ip>:2379,https://127.0.0.1:2379 \
[additional TLS flags] &
Add etcd Member
ETCDCTL_API=3 etcdctl member add <name> --peer-urls=https://<ip>:2380
Restore etcd from Snapshot
ETCDCTL_API=3 etcdctl snapshot restore <backup-file> \
--data-dir=/var/lib/etcd-restore
Frequently Asked Questions (FAQ)
Q: Will etcd --force-new-cluster
delete my data?
No. The --force-new-cluster
flag only modifies cluster membership configuration. All your Kubernetes resources, configurations, and data remain intact in the etcd data directory.
Q: Can I recover without --force-new-cluster
?
If you have lost etcd quorum and cannot restore communication between nodes, --force-new-cluster
is the only option. It's specifically designed for disaster recovery when the cluster cannot elect a leader.
Q: Should I use Cilium clusterwide policies at all?
Yes, but carefully. Clusterwide policies are powerful for enforcing security across your entire cluster, but they require thorough testing and understanding. For most use cases, namespaced policies are safer.
Q: What if I don't have an etcd backup?
The --force-new-cluster
method works without a backup because it uses the existing data directory. However, always maintain regular etcd backups for other disaster scenarios like disk failure or data corruption.
Conclusion: Key Takeaways for Kubernetes Administrators
One misconfigured Cilium network policy brought down our entire production Kubernetes cluster. But with proper understanding of etcd operations and disaster recovery procedures, we recovered completely without any data loss.
Critical lessons:
- Cilium clusterwide policies with
nodeSelector
create whitelist mode - understand this before applying - Always explicitly allow critical Kubernetes ports (6443, 2379, 2380, 10250, 53)
- Test network policies in staging environments, never directly in production
- Know how to use
etcd --force-new-cluster
before you need it - Maintain regular etcd backups and document recovery procedures
- Implement monitoring and alerting for etcd cluster health
The etcd --force-new-cluster
command saved us from rebuilding the entire cluster from scratch. Every Kubernetes administrator should understand this critical disaster recovery tool.