Saturday, March 26, 2011

Kernel Packet Traveling Diagram

Intro

On the LARTC mailing list, there was a long discussion about how a packet is handled by the kernel. Finally, there was a post by Leonardo Balliache that I copied onto this page. I hope this helps people to better understand how it all works. I added extra info aboute the IMQ device, I hope I didn't make any mistakes. All info/updates/corrections are welcome.

Kernel Packet Traveling Diagram

Network
                    -----------+-----------
                               |
                  +--------------------------+
          +-------+-------+        +---------+---------+
          |    IPCHAINS   |        |      IPTABLES     |
          |     INPUT     |        |     PREROUTING    |
          +-------+-------+        | +-------+-------+ |
                  |                | |   conntrack   | |
                  |                | +-------+-------+ |
                  |                | |    mangle     | | <- MARK WRITE  
                  |                | +-------+-------+ |
                  |                | |      IMQ      | |
                  |                | +-------+-------+ |
                  |                | |      nat      | | <- DEST REWRITE
                  |                | +-------+-------+ |     DNAT or REDIRECT or DE-MASQUERADE
                  |                +---------+---------+
                  +------------+-------------+
                               |
                       +-------+-------+
                       |      QOS      |
                       |    INGRESS    |
                       +-------+-------+
                               |
         packet is for +-------+-------+ packet is for
          this machine |     INPUT     | another address
        +--------------+    ROUTING    +--------------+
        |              |    + PDBB     |              |
        |              +---------------+              |
+-------+-------+                                     |
|   IPTABLES    |                                     |
|     INPUT     |                                     |
| +-----+-----+ |                                     |
| |   mangle  | |                                     |
| +-----+-----+ |                                     |
| |   filter  | |                                     |
| +-----+-----+ |                                     |
+-------+-------+                                     |
        |                               +---------------------------+
+-------+-------+                       |                           |
|     Local     |               +-------+-------+           +-------+-------+
|    Process    |               |    IPCHAINS   |           |    IPTABLES   |
+-------+-------+               |    FORWARD    |           |    FORWARD    |
        |                       +-------+-------+           | +-----+-----+ |
+-------+-------+                       |                   | |  mangle   | | <- MARK WRITE
|    OUTPUT     |                       |                   | +-----+-----+ |
|    ROUTING    |                       |                   | |  filter   | |
+-------+-------+                       |                   | +-----+-----+ |
        |                               |                   +-------+-------+
+-------+-------+                       |                           |
|    IPTABLES   |                       +---------------------------+
|     OUTPUT    |                                     |
| +-----------+ |                                     |
| | conntrack | |                                     |
| +-----+-----+ |                                     |
| |   mangle  | | <- MARK WRITE                       |
| +-----+-----+ |                                     |
| |    nat    | | <-DEST REWRITE                      |
| +-----+-----+ |     DNAT or REDIRECT                |
| |   filter  | |                                     |
| +-----+-----+ |                                     |
+-------+-------+                                     |
        |                                             |
        +----------------------+----------------------+
                               |
                  +------------+------------+
                  |                         |
          +-------+-------+       +---------+---------+
          |    IPCHAINS   |       |      IPTABLES     |
          |     OUTPUT    |       |    POSTROUTING    |
          +-------+-------        | +-------+-------+ |
                  |               | |    mangle     | | <- MARK WRITE  
                  |               | +-------+-------+ |
                  |               | |      nat      | | <- SOURCE REWRITE
                  |               | +-------+-------+ |      SNAT or MASQUERADE
                  |               | |      IMQ      | |
                  |               | +-------+-------+ |
                  |               +---------+---------+
                  +------------+------------+
                               |
                        +------+------+
                        |     QOS     |
                        |    EGRESS   |
                        +------+------+
                               |
                    -----------+-----------
                            Network
  • Name of firewall chain (in bold)
  • Controlled by iptables/ipchains (in blue)
  • Controlled by ip/tc (in red)

My remarks on the diagram

  • Output routing : the local process selects a source address and a route. This route is attached to the packet and used later.
  • Postrouting : there is also rerouting possible if netfilter changes some parts of the packets like address, tos, ... .
  • RPDB : routing policy database, controlled by ip. That's also the place where the kernel does source validation and nexthop decision.
  • IMQ : Packets put in the imq device travel also thru the "EGRESS" part of the diagram so you can use htb/cbq to control the packets in the imq device.
  • ipchains : Yes, there is some ipchains code in kernel 2.4. If you load the ipchains module, you can't use iptables anymore. You can even load the ipfwadm module if you want ipfwadm support. So it's iptables, or ipchains, or ipfwadm, but no combination is possible.
  • mangle : since kernel 2.4.18, you have a mangle table in all 5 netfilter hooks.
  • IMQ in input comes before nat so IMQ does not know the real ip address. Ingress comes after nat, so ingress knows the real ip address.

Leonardo notes

  • The input routing determines local/forward.
  • ip rule (routing policy database RPDB) is input routing, more correctly, part of the input routing.
  • The output routing is performed from "higher layer".
  • nexthop and output device are determined both from the input and the output routing.
  • The forwarding process is called at input routing by functions from specific places in the code. It executes after input routing and does not perform nexthop/outdev selection. It's the process of receiving and sending the same packet, but in the context of all these hooks the code that sends ICMP redirects (demanded from input routing), decrements the IP TTL, performs dumb NAT and calls the filter chain. This code is used only for forwarded packets.
  • Sometimes the word "Forwarding" with "big F", is used for referencing both, the routing and forwarding process.

Updates

I remove conntrack from POSTROUTING. More info on http://iptables-tutorial.frozentux.net/iptables-tutorial.html#STATEMACHINE. See last part of section 4 : "All connection tracking is handled in the PREROUTING chain, except locally generated packets which are handled in the OUTPUT chain. What this means is that iptables will do all recalculation of states and so on within the PREROUTING chain. If we send the initial packet in a stream, the state gets set to NEW within the OUTPUT chain, and when we receive a return packet, the state gets changed in the PREROUTING chain to ESTABLISHED, and so on. If the first packet is not originated by ourself, the NEW state is set within the PREROUTING chain of course. So, all state changes and calculations are done within the PREROUTING and OUTPUT chains of the nat table."

I received this email :
Since I've recently failed in a few iproute2 experiments, I have the following comment on the packet travelling guide: It is incomplete in respect to locally generated packets.
The traversal guide states that routing actually happens before the packet enters the netfilter OUTPUT queue. However, this is not all that happens in current kernels. If the packet is somehow modified while traversing the output queue (for example, by putting a fwmark on it), netfilter recognizes that the packet needs to be routed again and does so. So, there is possibly another 'OUTPUT ROUTING' rectangle after the netfilter OUTPUT chain.
However, if I'm not mistaken, there are some problems with that (as per my recent posting to the lartc- and netdev-list, it seems that the source address is chosen before the netfilter OUTPUT is traversed, and subsequent the subsequently chosen other route's src attribute no longer affects the source address of the socket/packet).
Now, I also might be totally wrong, but so far nobody has been able to point out what exactly I'm misunderstanding...

More info

http://www.gnumonks.org/ftp/pub/doc/packet-journey-2.4.html
http://www.linuxvirtualserver.org/Joseph.Mack/HOWTO/LVS-HOWTO-19.html#ss19.21

Courtesy : http://www.docum.org/docum.org/kptd/