Understanding Socket Load Balancing in Cilium: A Deep Dive in to how eBPF is used

Understanding Socket Load Balancing in Cilium: A Deep Dive in to how eBPF is used
In Kubernetes, Cilium uses a clever method for load balancing pod-to-service traffic known as the socket-level load balancer. This is part of the KubeProxy Replacement (KPR) initiative.
Load balancing is a critical component in modern Kubernetes clusters, ensuring efficient distribution of network traffic across multiple backend services. Let's explore how this complex system works under the hood.
How It Works
The socket-level load balancer utilizes BPF programs, specifically BPF_PROG_TYPE_CGROUP_SOCK and BPF_PROG_TYPE_CGROUP_SOCK_ADDR, to intercept socket events like connect or sendmsg syscalls. When a socket tries to connect to a service IP, the program modifies the IP to direct traffic to a backend pod. it's essential to understand two key components:
Socket Cookies - Unique identifiers assigned to sockets
Generated via
bpf_get_socket_cookieUsed for tracking connections across the load balancer.
Load Balancing Maps(LRU Hash map) - Store forwarding state
Track NAT translations
Enable efficient lookup and routing
Example Scenario
Consider a service A with the ClusterIP 100.0.0.1 and backend pods 20.0.0.1 and 20.0.0.2. When a pod executes curl some.service.svc, it connects to 100.0.0.1, but the load balancer redirects it to a backend pod, resulting in:
curl some.service.svc -v
* Trying 20.0.0.2:80...
* Connected to google.com (20.0.0.2) port 80 (#0)
Even though DNS resolves some.service.svc to 100.0.0.1, the connection is made to 20.0.0.2, demonstrating client-side load balancing.
Handling TCP and UDP
For TCP connections, this redirection happens once during the connect syscall, and the stateful nature of TCP ensures continuous operation. However, UDP can have both connected and unconnected sockets, requiring reverse NAT (revNAT) to match reply traffic with requests.
Reverse NAT and LRU Hashmaps
The revNAT mechanism ensures that UDP traffic appears to come from the service IP. We store reverse NAT information in an LRU (Least Recently Used) hashmap, which evicts the least recently used entries when the map is full.
Load Balancing Flow
Let's visualize the complete request/response cycle:
Detailed Implementation Steps
Step 1:
Pod Initiates Request When a client pod wants to communicate with a service, it sends a request to the service's ClusterIP (e.g., 100.0.0.1:80).
Packet Values (Outgoing from Pod):
Source IP: 10.0.0.5 (client pod)
Source Port: 50000 (assigned by the kernel)
Destination IP: 100.0.0.1 (service ClusterIP)
Destination Port: 80 (service port)
Socket: A UDP socket is created with a unique socket cookie, say 12345, obtained via
bpf_get_socket_cookie.
Step 2:
The sendmsg syscall (for unconnected UDP) triggers an eBPF program of type BPF_PROG_TYPE_CGROUP_SOCK_ADDR. Intercepts the request and performs several actions:
Checks if the destination IP is a service IP.
Selects a backend pod (e.g., 20.0.0.2:80) based on load balancing.
Rewrites the destination IP/port to the backend pod's IP/port.
Adds a reverse NAT entry to
cilium_lb4_reverse_skBPF map to handle reply traffic.
Key: {cookie: 12345, address: 20.0.0.2, port: 80, pad: 0}Value: {address: 100.0.0.1, port: 80, rev_nat_index: 1}
Packet Values (After Rewrite, Sent to Backend):
Source IP: 10.0.0.5 (unchanged)
Source Port: 50000 (unchanged)
Destination IP: 20.0.0.2 (backend pod)
Destination Port: 80 (backend port)
Step 3:
Backend Responds The backend pod (20.0.0.2:80) sends a UDP response back to the client pod.
Packet Values (Reply from Backend):
Source IP: 20.0.0.2 (backend pod)
Source Port: 80 (backend port)
Destination IP: 10.0.0.5 (client pod)
Destination Port: 50000 (client's port)
Step 4:
The recvmsg syscall triggers an eBPF program to process the incoming reply packet
Actions performed by eBPF:
- The program looks up
cilium_lb4_reverse_skusing the key{cookie: 12345, address: 20.0.0.2, port: 80}. It finds the value{address: 100.0.0.1, port: 50000, rev_nat_index: 1}.and then rewrites the source IP/port to100.0.0.1:50000, ensuring the client sees the response as coming from the service.
Packet Values (Delivered to Pod):
Source IP: 100.0.0.1 (service ClusterIP, rewritten)
Source Port: 80 (service port, rewritten)
Destination IP: 10.0.0.5 (client pod, unchanged)
Destination Port: 50000 (client's port, unchanged)
Below scenario can arise (Hence there is one more Step 5 to clean it):
Another socket (cookie 67890) connects to the same service, also hashing to the same bucket in cilium_lb4_reverse_sk causing Hash Collision Issue If multiple sockets connect to the same service, they might hash to the same bucket in the reverse NAT map, causing performance degradation. another socket (cookie 67890) connects to the same service, also hashing to the same bucket in cilium_lb4_reverse_sk (e.g., bucket 5, assuming hash(12345,IP,port) % 10 = 5 and hash(67890,IP,port) % 10 = 5
Then the map State would look like:
- Bucket 5(Hash Maps are represented as buckets with linked list of {key:value}): [{12345,20.0.0.2,80}:{100.0.0.1,50000, 1}]--> [{67890 ,20.0.0.2,80}: {100.0.0.1,50000, 1}]
Problem:
Lookups for 12345 or 67890 require traversing the linked list in bucket 5, slowing down recvmsg operations.
With many stale entries, collisions become frequent, degrading performance.
Step 5:
The socket (12345) closes, triggering the eBPF program on cgroup/sock_release to deletes its entry from the reverse NAT map, cleaning up stale entries. So the eBPF program retrieves the socket cookie (12345) and deletes the entry from the map.
Updated Map State:
- Bucket 5: [{67890, 20.0.0.2, 80} : {100.0.0.1, 50000, 1}]
Conclusion
Cilium in Kubernetes uses eBPF to implement a socket-level load balancer as part of the KubeProxy Replacement initiative. This approach redirects pod traffic to service endpoints efficiently, using components like socket cookies for connection tracking and LRU hash maps for storing NAT translations. The process involves intercepting socket events, handling TCP/UDP traffic, and managing reverse NAT for UDP. The article details the behind-the-scenes operations of this load balancing mechanism, highlighting potential issues like hash collisions and cleanup strategies to maintain performance.



