Automatic Linux source based routing

The Problem

On a Linux server with multiple network interfaces configured on different subnets, you will often find that the interface(s) which do not have the default route pointing toward them will not handle incoming connections properly due to the asymmetrical return path.

Example:

$ ip addr
...
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:fc:45:9d brd ff:ff:ff:ff:ff:ff
    inet 162.253.43.134/24 brd 162.253.43.255 scope global ens3
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fefc:459d/64 scope link
       valid_lft forever preferred_lft forever
3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:2b:e7:89 brd ff:ff:ff:ff:ff:ff
    inet 10.13.96.161/19 brd 10.13.127.255 scope global ens4
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe2b:e789/64 scope link
       valid_lft forever preferred_lft forever

$ ip route
default via 162.253.43.1 dev ens3
10.0.0.0/8 via 10.13.96.1 dev ens4
10.13.96.0/19 dev ens4  proto kernel  scope link  src 10.13.96.161
162.253.42.0/24 dev ens3  scope link
162.253.43.0/24 dev ens3  proto kernel  scope link  src 162.253.43.134
169.254.169.254 via 162.253.43.1 dev ens3

From my remote workstation I can ping 162.253.43.134 just fine.

$ ping -c1 162.253.43.134
PING 162.253.43.134 (162.253.43.134): 56 data bytes
64 bytes from 162.253.43.134: icmp_seq=0 ttl=60 time=5.825 ms

However, I can’t access 10.13.96.161.

$ ping 10.13.96.161
PING 10.13.96.161 (10.13.96.161): 56 data bytes
Request timeout for icmp_seq 0

Why not? Let’s check the incoming packet.

$ sudo tcpdump -n -i ens4 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes
00:30:40.011105 IP 64.31.0.51 > 10.13.96.161: ICMP echo request, id 56759, seq 0, length 64

The packet comes in to the private interface, but since the system has net.ipv4.conf.all.rp_filter enabled by default, the packet is simply dropped since the outgoing path (ens3, public, where the default gateway points) is not the receiving interface.

The Solution

To solve this problem of asymmetrical routing, we need to add a source-based routing rule to the system so it will route all return traffic sourced from the ens4 private subnet 10.13.96.161/19 back out the correct interface.

First, create a routing table for your secondary interface

$ sudo echo '500     ens4' >> /etc/iproute2/rt_tables

Then drop the following script in /opt/if-post-up-source-route:

#!/bin/sh -e
#this script requires a routing table named $IFACE (ie. bond0) exists in /etc/iproute2/rt_tables
#the $IFACE routing table is used to place the default route in for the source routing table
#ip route list table $IFACE will list the routing table for this interface

cidr_to_netmask() {
    # Number of args to shift, 255..255, first non-255 byte, zeroes
    set -- $(( 5 - ($1 / 8) )) 255 255 255 255 $(( (255 << (8 - ($1 % 8))) & 255 )) 0 0 0
    [ $1 -gt 1 ] && shift $1 || shift
    NETMASK=${1-0}.${2-0}.${3-0}.${4-0}
}

set_netinfo() {
    local IPCIDR=$(ip addr show $IFACE | grep "inet\b" | awk '{print $2}' | head -n1)
    IPADDR=$(echo "$IPCIDR" | cut -d/ -f1)
    cidr_to_netmask $(echo "$IPCIDR" | cut -d/ -f2)

    OLDIFS=$IFS
    IFS=.
    set -- $IPADDR
    local IPADDR1=$1
    local IPADDR2=$2
    local IPADDR3=$3
    local IPADDR4=$4

    set -- $NETMASK
    local NETMASK1=$1
    local NETMASK2=$2
    local NETMASK3=$3
    local NETMASK4=$4
    IFS=$OLDIFS
    NETWORK=$(printf "%d.%d.%d.%d\n" "$((IPADDR1 & NETMASK1))" "$((IPADDR2 & NETMASK2))" "$((IPADDR3 & NETMASK3))" "$((IPADDR4 & NETMASK4))")
    GATEWAY=$(printf "%d.%d.%d.%d\n" "$((IPADDR1 & NETMASK1))" "$((IPADDR2 & NETMASK2))" "$((IPADDR3 & NETMASK3))" "$(((IPADDR4 & NETMASK4)+1))")
}

set_netinfo
ip route flush table "$IFACE"
ip route add "$NETWORK/$NETMASK" dev "$IFACE" proto kernel scope link table "$IFACE"
ip route add default via "$GATEWAY" dev "$IFACE" table "$IFACE"
ip rule del lookup "$IFACE" || true
ip rule add from "$NETWORK/$NETMASK" lookup "$IFACE"

Make the script executable:

$ sudo chmod +x /opt/if-post-up-source-route

Then edit your /etc/network/interfaces file containing the ens4 configuration. Add a post-up /opt/if-post-up-source-route line to the interface configuration. Mine looks like:

auto ens4
iface ens4 inet dhcp
  post-up /opt/if-post-up-source-route

Restart the interface:

As always when restarting network interfaces, make sure you have a working out of band management method such as IPMI in case the interface fails to restart

$ sudo ifdown ens4 && sudo ifup ens4

Test the result

Check the ens4 routing table:

$ ip route list table ens4
default via 10.13.96.1 dev ens4
10.13.96.0/19 dev ens4  proto kernel  scope link

Check the ip rule output for the ens4 source-based rule:

$ ip rule
0:	from all lookup local
32765:	from 10.13.96.0/19 lookup ens4
32766:	from all lookup main
32767:	from all lookup default

Test the source route:

$ ping -c1 10.13.96.161
PING 10.13.96.161 (10.13.96.161): 56 data bytes
64 bytes from 10.13.96.161: icmp_seq=0 ttl=60 time=5.511 ms

$ sudo tcpdump -n -i ens4 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes
00:47:31.496909 IP 64.31.0.51 > 10.13.96.161: ICMP echo request, id 26040, seq 0, length 64
00:47:31.496969 IP 10.13.96.161 > 64.31.0.51: ICMP echo reply, id 26040, seq 0, length 64

This configuration is persistent across reboots. Simply repeat the steps above if you have multiple interfaces that require source routing.