Topics Discussed:
- VXLAN
- VXLAN Packet
- Flannel
- DEMO
- Packet Tracing in Kubernetes
The session starts by taking about the VXLAN. The routing and bridge networking has
already been covered in the first session [https://www.youtube.com/watch?v=EqCFIc-NxRg].
There are certain other additional components in VXLAN than others.
In this session also the container and the VEth pair have been created. Here there will be a separate network for node and container. Network A is for node networking and Network B is the virtual networking inside nodes for containers.
VXLAN is a packet encapsulation technique where each packet is encapsulated over another packet and sent over the network. The packet will reach the destination by traversing over the physical network. The packet will get decapsulated at the destination node and goes to the container. We have to consider the Maximum transfer unit (MTU) while the packet encapsulation.
VXLAN Packet
The figure shows a VXLAN packet, the actual packet is shown as “original Inner Ethernet frame(1500 Bytes), rest of the packets are added while encapsulation. Another important aspect is the MTU, which is 1500 by default but on using, MTU value inside container cause fragmentation. But the size of actual VXLAN packets is 1550. So that only be sent as 1500 and 50 fragmented packets or frame. Inorder to avoid this fragmentation the MTU will be set to 1450 at the interface in container and other internal
interface. The physical interface will be 1500.
The figure shows the MTU value setting.
Flannel
Its a plugin that follows CNI. It can run as an individual process.Mostly seen in k8s as its CLI complaint. A CNI is a specification and libraries for writing the plugin to configure network interface in linux containers. Flannel can be used to communicate between nodes, in kubernetes.
VXLAN demo :
1. Create two containers.
2. Add a Veth Pair
3. Create bridge
4. Add Veth pair ends to bridge
5. Setup etcd
6. Start flannel.
7. Configure IPs and packet forwarding.
In the demo we will be creating 2 containers in 2 nodes and a bridge will be created which will not get connected to any external devices. One end of the Veth pair from the container will be connected to the bridge. Rest are the VXLAN components after that we need a key-value pair. ETCD, why this etcd is needed? Because, the flannel demon is running on each node. The dats like ip address, MAC address etc need to be coordinated in a central place for all the flannel daemons running in all the nodes. So we use etcd. Hence in demo a flannel will be started after pointing to the ETCD.
DEMO :
Connect to the two nodes via ssh. Install and setup ETCD in node01, For setting up ETCD, first download the binary for ETCD and start it so that it will be running in background with nohup command , also specify the ip addresses to listen. To verify just see the file /var/log/etcd.log, using the command ps -ef | grep etcd or with the etcdctl commands.
Note: All the commands are added in github.
https://github.com/ansilh/Talks/blob/main/devops-malayalam-08jun2023/CNF_Part_2.pdf
Now setup flannel in both the nodes . For that first download the binary, its a custom binary. So before starting flannel create this configuration with etcdctl command. Then start it in each node, after mentioning the IP address it should listen to which ETCD it should be pointed. The flannel will look for the configuration in etcd at the location /coreos.com/network/config .
Now create the containers in the same manner as in PART-1 in both nodes using the busybox binary. Also create Veth pair in both nodes. Attach one end to the conatiners. On checking we could see the MTU for the interface inside the container will be set to 1500. So inorder to avoid fragmentation set the MTU inside the container to 1450 in both nodes.
#ip link set dev vethNS mtu 1450
#ip link set dev vethlocal mtu 1450
Now IP addresses need to be assigned to the container. While starting the flannel we have already mentioned the network in config file. So the entire network for flannel will be under this subnet. But flannel will automatically assign separate subnet for each node. That subnet can be found at the location “/var/run/flannel/subnet.env” in each node. We need to determine which address is to be assigned to the containers based on this subnet. For determining the IP address we can use the below command.
#awk -F “=” ‘$1 ~ /^FLANNEL_SUBNET/{print $2}’ /var/run/flannel/subnet.env | awk -F “.” ‘{print $1″.”$2″.”$3″.”10}’
Now assign that Ip address derived from the above command to the respective containers. We cannot ping to the node1 container from node2 container as it doesn’t know how to sent that packet which is configured in it. Hence it will show network unreachable. For that a gateway is needed. So a bridge will be created. We can use same commands as in Part1. While assigning the IP address to bridge it should be from flannels subnet on the node.
Now check by pinging to the gateway. Still cannot ping to container. Now add route / default gateway to container and enable packet forwarding. Now we get ping results.
There are some steps to run in background which enables the packet to reach the destination container. On checking the node interfaces list, we could see a flannel1 interface which is automatically created by the flannel. This interface is created for doing the encapsulation. This flannel1 interface is not a normal interface but a VXLAN interface with VXLAN id, port and all, which can be viewed in detail with the “ ip -d link” command. If we see the process for that port we could not see any process id but can notice its listening that means the port is directly handled by the kernel. For routing the ARP entry and bridge
forwarding database entry is used.
The packets from the container first reaches the Veth local and then to the bridge where the packet need to be encapsulated. For that it needs the MAC address which we get it from the ARP table. The entries in ARP table is updated by the flannel daemon. The data for the interface and Ip address will be provided by the Forwarding database. All this will be more clear with the following commands. Then the packet is encapsulated and sent to the physical interface. It will be more clear if we see the output from the below commands.
#ip neigh show dev
#bridge fdb show
Packet Tracing in Kubernetes :
To see the packets are transferring between kubernetes pods. First connect to a K8s cluster. Flannel is used as CNI in that cluster. The flannel is running as daemon set in that cluster. If there is any network related issue in your cluster check the issues by describing the node. The MAC address we see previously will be added as annotations in node by flannel.
Consider two pods are created in the cluster. In the demo we tried to trace the packet
generated from one pod destined to other pod with the help of ping command. For that exec
into one of the pod. Then ping into other pod it will return the results. For tracing that packet
we need to convert the ip address of the pod into a hexadecimal number. After that we need
to filter that hexadecimal Ip in the packet.
#VXLAN_SRC_DST=$(echo “10.42.1.7” | awk -F “.” ‘{printf “0x%02X%02X%02X%02X\n”,
$1,$2,$3,$4}’)
#tcpdump -n -vvvv -i enp0s3 “port 8472 and ( ip[62:4]==${VXLAN_SRC_DST} or
ip[66:4]==$(VXLAN_SRC_DST))”
Tcpdump is a user level command. A package originally exist in kernel so to execute filtering with tcpdump it should be copied to user level which cause delay in server and overhead. To avoid this berkeley packet filer is used.
For more info refer:
https://blog.cloudflare.com/bpf-the-forgotten-bytecode/