Traceroute

Written by hannes
Classified under: mirageosprotocol
Published: 2020-06-24 (last updated: 2021-11-19)

Traceroute

Is a diagnostic utility which displays the route and measures transit delays of packets across an Internet protocol (IP) network.

$ doas solo5-hvt --net:service=tap0 -- traceroute.hvt --ipv4=10.0.42.2/24 --ipv4-gateway=10.0.42.1 --host=198.167.222.207
            |      ___|
  __|  _ \  |  _ \ __ \
\__ \ (   | | (   |  ) |
____/\___/ _|\___/____/
Solo5: Bindings version v0.6.5
Solo5: Memory map: 512 MB addressable:
Solo5:   reserved @ (0x0 - 0xfffff)
Solo5:       text @ (0x100000 - 0x212fff)
Solo5:     rodata @ (0x213000 - 0x24bfff)
Solo5:       data @ (0x24c000 - 0x317fff)
Solo5:       heap >= 0x318000 < stack < 0x20000000
2020-06-22 15:41:25 -00:00: INF [netif] Plugging into service with mac 76:9b:36:e0:e5:74 mtu 1500
2020-06-22 15:41:25 -00:00: INF [ethernet] Connected Ethernet interface 76:9b:36:e0:e5:74
2020-06-22 15:41:25 -00:00: INF [ARP] Sending gratuitous ARP for 10.0.42.2 (76:9b:36:e0:e5:74)
2020-06-22 15:41:25 -00:00: INF [udp] UDP interface connected on 10.0.42.2
2020-06-22 15:41:25 -00:00: INF [application]  1  10.0.42.1  351us
2020-06-22 15:41:25 -00:00: INF [application]  2  192.168.42.1  1.417ms
2020-06-22 15:41:25 -00:00: INF [application]  3  192.168.178.1  1.921ms
2020-06-22 15:41:25 -00:00: INF [application]  4  88.72.96.1  16.716ms
2020-06-22 15:41:26 -00:00: INF [application]  5  *
2020-06-22 15:41:27 -00:00: INF [application]  6  92.79.215.112  16.794ms
2020-06-22 15:41:27 -00:00: INF [application]  7  145.254.2.215  21.305ms
2020-06-22 15:41:27 -00:00: INF [application]  8  145.254.2.217  22.05ms
2020-06-22 15:41:27 -00:00: INF [application]  9  195.89.99.1  21.088ms
2020-06-22 15:41:27 -00:00: INF [application] 10  62.115.9.133  20.105ms
2020-06-22 15:41:27 -00:00: INF [application] 11  213.155.135.82  30.861ms
2020-06-22 15:41:27 -00:00: INF [application] 12  80.91.246.200  30.716ms
2020-06-22 15:41:27 -00:00: INF [application] 13  80.91.253.163  28.315ms
2020-06-22 15:41:27 -00:00: INF [application] 14  62.115.145.27  30.436ms
2020-06-22 15:41:27 -00:00: INF [application] 15  80.67.4.239  42.826ms
2020-06-22 15:41:27 -00:00: INF [application] 16  80.67.10.147  47.213ms
2020-06-22 15:41:27 -00:00: INF [application] 17  198.167.222.207  48.598ms
Solo5: solo5_exit(0) called

This means with a traceroute utility you can investigate which route is taken to a destination host, and what the round trip time(s) on the path are. The sample output above is taken from a virtual machine on my laptop to the remote host 198.167.222.207. You can see there are 17 hops between us, with the first being my laptop with a tiny round trip time of 351us, the second and third are using private IP addresses, and are my home network. The round trip time of the fourth hop is much higher, this is the first hop on the other side of my DSL modem. You can see various hops on the public Internet: the packets pass from my Internet provider's backbone across some exchange points to the destination Internet provider somewhere in Sweden.

The implementation of traceroute relies mainly on the time-to-live (ttl) field (in IPv6 lingua it is "hop limit") of IP packets, which is meant to avoid route cycles that would infinitely forward IP packets in circles. Every router, when forwarding an IP packet, first checks that the ttl field is greater than zero, and then forwards the IP packet where the ttl is decreased by one. If the ttl field is zero, instead of forwarding, an ICMP time exceeded packet is sent back to the source.

Traceroute works by exploiting this mechanism: a series of IP packets with increasing ttls is sent to the destination. Since upfront the length of the path is unknown, it is a reactive system: first send an IP packet with a ttl of one, if a ICMP time exceeded packet is returned, send an IP packet with a ttl of two, etc. -- until an ICMP packet of type destination unreachable is received. Since some hosts do not reply with a time exceeded message, it is crucial for not getting stuck to use a timeout for each packet: when the timeout is reached, an IP packet with an increased ttl is sent and an unknown for the ttl is printed (see the fifth hop in the example above).

The packets send out are conventionally UDP packets without payload. From a development perspective, one question is how to correlate the ICMP packet with the sent UDP packet. Conveniently, ICMP packets contain the IP header and the first eight bytes of the next protocol - the UDP header containing source port, destination port, checksum, and payload length (each fields of size two bytes). This means when we record the outgoing ports together with the sent timestamp, and correlate the later received ICMP packet to the sent packet. Great.

But as a functional programmer, let's figure whether we can abolish the (globally shared) state. Since the ICMP packet contains the original IP header and the first eight bytes of the UDP header, this is where we will embed data. As described above, the data is the sent timestamp and the value of the ttl field. For the latter, we can arbitrarily restrict it to 31 (5 bits). For the timestamp, it is mainly a question about precision and maximum expected round trip time. Taking the source and destination port are 32 bits, using 5 for ttl, remaining are 27 bits (an unsigned value up to 134217727). Looking at the decimal representation, 1 second is likely too small, 13 seconds are sufficient for the round trip time measurement. This implies our precision is 100ns, by counting the digits.

Finally to the code. First we need forth and back conversions between ports and ttl, timestamp:

(* takes a time-to-live (int) and timestamp (int64, nanoseconda), encodes them
   into 16 bit source port and 16 bit destination port:
   - the timestamp precision is 100ns (thus, it is divided by 100)
   - use the bits 27-11 of the timestamp as source port
   - use the bits 11-0 as destination port, and 5 bits of the ttl
*)
let ports_of_ttl_ts ttl ts =
  let ts = Int64.div ts 100L in
  let src_port = 0xffff land (Int64.(to_int (shift_right ts 11)))
  and dst_port = 0xffe0 land (Int64.(to_int (shift_left ts 5))) lor (0x001f land ttl)
  in
  src_port, dst_port

(* inverse operation of ports_of_ttl_ts for the range (src_port and dst_port
   are 16 bit values) *)
let ttl_ts_of_ports src_port dst_port =
  let ttl = 0x001f land dst_port in
  let ts =
    let low = Int64.of_int (dst_port lsr 5)
    and high = Int64.(shift_left (of_int src_port) 11)
    in
    Int64.add low high
  in
  let ts = Int64.mul ts 100L in
  ttl, ts

They should be inverse over the range of valid input: ports are 16 bit numbers, ttl expected to be at most 31, ts a int64 expressed in nanoseconds.

Related is the function to print out one hop and round trip measurement:

(* write a log line of a hop: the number, IP address, and round trip time *)
let log_one now ttl sent ip =
  let now = Int64.(mul (logand (div now 100L) 0x7FFFFFFL) 100L) in
  let duration = Mtime.Span.of_uint64_ns (Int64.sub now sent) in
  Logs.info (fun m -> m "%2d  %a  %a" ttl Ipaddr.V4.pp ip Mtime.Span.pp duration)

The most logic is when a ICMP packet is received:

module Icmp = struct
  type t = {
    send : int -> unit Lwt.t ;
    log : int -> int64 -> Ipaddr.V4.t -> unit ;
    task_done : unit Lwt.u ;
  }

  let connect send log task_done =
    let t = { send ; log ; task_done } in
    Lwt.return t

  (* This is called for each received ICMP packet. *)
  let input t ~src ~dst buf =
    let open Icmpv4_packet in
    (* Decode the received buffer (the IP header has been cut off already). *)
    match Unmarshal.of_cstruct buf with
    | Error s ->
      Lwt.fail_with (Fmt.strf "ICMP: error parsing message from %a: %s" Ipaddr.V4.pp src s)
    | Ok (message, payload) ->
      let open Icmpv4_wire in
      (* There are two interesting cases: Time exceeded (-> send next packet),
         and Destination (port) unreachable (-> we reached the final host and can exit) *)
      match message.ty with
      | Time_exceeded ->
        (* Decode the payload, which should be an IPv4 header and a protocol header *)
        begin match Ipv4_packet.Unmarshal.header_of_cstruct payload with
          | Ok (pkt, off) when
              (* Ensure this packet matches our sent packet: the protocol is UDP
                 and the destination address is the host we're tracing *)
              pkt.Ipv4_packet.proto = Ipv4_packet.Marshal.protocol_to_int `UDP &&
              Ipaddr.V4.compare pkt.Ipv4_packet.dst (Key_gen.host ()) = 0 ->
            let src_port = Cstruct.BE.get_uint16 payload off
            and dst_port = Cstruct.BE.get_uint16 payload (off + 2)
            in
            (* Retrieve ttl and sent timestamp, encoded in the source port and
               destination port of the UDP packet we sent, and received back as
               ICMP payload. *)
            let ttl, sent = ttl_ts_of_ports src_port dst_port in
            (* Log this hop. *)
            t.log ttl sent src;
            (* Sent out the next UDP packet with an increased ttl. *)
            let ttl' = succ ttl in
            Logs.debug (fun m -> m "ICMP time exceeded from %a to %a, now sending with ttl %d"
                           Ipaddr.V4.pp src Ipaddr.V4.pp dst ttl');
            t.send ttl'
          | Ok (pkt, _) ->
            (* Some stray ICMP packet. *)
            Logs.debug (fun m -> m "unsolicited time exceeded from %a to %a (proto %X dst %a)"
                           Ipaddr.V4.pp src Ipaddr.V4.pp dst pkt.Ipv4_packet.proto Ipaddr.V4.pp pkt.Ipv4_packet.dst);
            Lwt.return_unit
          | Error e ->
            (* Decoding error. *)
            Logs.warn (fun m -> m "couldn't parse ICMP time exceeded payload (IPv4) (%a -> %a) %s"
                          Ipaddr.V4.pp src Ipaddr.V4.pp dst e);
            Lwt.return_unit
        end
      | Destination_unreachable when Ipaddr.V4.compare src (Key_gen.host ()) = 0 ->
        (* We reached the final host, and the destination port was not listened to *)
        begin match Ipv4_packet.Unmarshal.header_of_cstruct payload with
          | Ok (_, off) ->
            let src_port = Cstruct.BE.get_uint16 payload off
            and dst_port = Cstruct.BE.get_uint16 payload (off + 2)
            in
            (* Retrieve ttl and sent timestamp. *)
            let ttl, sent = ttl_ts_of_ports src_port dst_port in
            (* Log the final hop. *)
            t.log ttl sent src;
            (* Wakeup the waiter task to exit the unikernel. *)
            Lwt.wakeup t.task_done ();
            Lwt.return_unit
          | Error e ->
            (* Decoding error. *)
            Logs.warn (fun m -> m "couldn't parse ICMP unreachable payload (IPv4) (%a -> %a) %s"
                          Ipaddr.V4.pp src Ipaddr.V4.pp dst e);
            Lwt.return_unit
        end
      | ty ->
        Logs.debug (fun m -> m "ICMP unknown ty %s from %a to %a: %a"
                       (ty_to_string ty) Ipaddr.V4.pp src Ipaddr.V4.pp dst
                       Cstruct.hexdump_pp payload);
        Lwt.return_unit
end

Now, the remaining main unikernel is the module Main:

module Main (R : Mirage_random.S) (M : Mirage_clock.MCLOCK) (Time : Mirage_time.S) (N : Mirage_net.S) = struct
  module ETH = Ethernet.Make(N)
  module ARP = Arp.Make(ETH)(Time)
  module IPV4 = Static_ipv4.Make(R)(M)(ETH)(ARP)
  module UDP = Udp.Make(IPV4)(R)

  (* Global mutable state: the timeout task for a sent packet. *)
  let to_cancel = ref None

  (* Send a single packet with the given time to live. *)
  let rec send_udp udp ttl =
    (* This is called by the ICMP handler which successfully received a
       time exceeded, thus we cancel the timeout task. *)
    (match !to_cancel with
     | None -> ()
     | Some t -> Lwt.cancel t ; to_cancel := None);
    (* Our hop limit is 31 - 5 bit - should be sufficient for most networks. *)
    if ttl > 31 then
      Lwt.return_unit
    else
      (* Create a timeout task which:
         - sleeps for --timeout interval
         - logs an unknown hop
         - sends another packet with increased ttl
      *)
      let cancel =
        Lwt.catch (fun () ->
            Time.sleep_ns (Duration.of_ms (Key_gen.timeout ())) >>= fun () ->
            Logs.info (fun m -> m "%2d  *" ttl);
            send_udp udp (succ ttl))
          (function Lwt.Canceled -> Lwt.return_unit | exc -> Lwt.fail exc)
      in
      (* Assign this timeout task. *)
      to_cancel := Some cancel;
      (* Figure out which source and destination port to use, based on ttl
         and current timestamp. *)
      let src_port, dst_port = ports_of_ttl_ts ttl (M.elapsed_ns ()) in
      (* Send packet via UDP. *)
      UDP.write ~ttl ~src_port ~dst:(Key_gen.host ()) ~dst_port udp Cstruct.empty >>= function
      | Ok () -> Lwt.return_unit
      | Error e -> Lwt.fail_with (Fmt.strf "while sending udp frame %a" UDP.pp_error e)

  (* The main unikernel entry point. *)
  let start () () () net =
    let cidr = Key_gen.ipv4 ()
    and gateway = Key_gen.ipv4_gateway ()
    in
    let log_one = fun port ip -> log_one (M.elapsed_ns ()) port ip
    (* Create a task to wait on and a waiter to wakeup. *)
    and t, w = Lwt.task ()
    in
    (* Setup network stack: ethernet, ARP, IPv4, UDP, and ICMP. *)
    ETH.connect net >>= fun eth ->
    ARP.connect eth >>= fun arp ->
    IPV4.connect ~cidr ~gateway eth arp >>= fun ip ->
    UDP.connect ip >>= fun udp ->
    let send = send_udp udp in
    Icmp.connect send log_one w >>= fun icmp ->

    (* The callback cascade for an incoming network packet. *)
    let ethif_listener =
      ETH.input
        ~arpv4:(ARP.input arp)
        ~ipv4:(
          IPV4.input
            ~tcp:(fun ~src:_ ~dst:_ _ -> Lwt.return_unit)
            ~udp:(fun ~src:_ ~dst:_ _ -> Lwt.return_unit)
            ~default:(fun ~proto ~src ~dst buf ->
                match proto with
                | 1 -> Icmp.input icmp ~src ~dst buf
                | _ -> Lwt.return_unit)
            ip)
        ~ipv6:(fun _ -> Lwt.return_unit)
        eth
    in
    (* Start the callback in a separate asynchronous task. *)
    Lwt.async (fun () ->
        N.listen net ~header_size:Ethernet_wire.sizeof_ethernet ethif_listener >|= function
        | Ok () -> ()
        | Error e -> Logs.err (fun m -> m "netif error %a" N.pp_error e));
    (* Send the initial UDP packet with a ttl of 1. This entails the domino
       effect to receive ICMP packets, send out another UDP packet with ttl
       increased by one, etc. - until a destination unreachable is received,
       or the hop limit is reached. *)
    send 1 >>= fun () ->
    t
end

The configuration (config.ml) for this unikernel is as follows:

open Mirage

let host =
  let doc = Key.Arg.info ~doc:"The host to trace." ["host"] in
  Key.(create "host" Arg.(opt ipv4_address (Ipaddr.V4.of_string_exn "141.1.1.1") doc))

let timeout =
  let doc = Key.Arg.info ~doc:"Timeout (in millisecond)" ["timeout"] in
  Key.(create "timeout" Arg.(opt int 1000 doc))

let ipv4 =
  let doc = Key.Arg.info ~doc:"IPv4 address" ["ipv4"] in
  Key.(create "ipv4" Arg.(required ipv4 doc))

let ipv4_gateway =
  let doc = Key.Arg.info ~doc:"IPv4 gateway" ["ipv4-gateway"] in
  Key.(create "ipv4-gateway" Arg.(required ipv4_address doc))

let main =
  let packages = [
    package ~sublibs:["ipv4"; "udp"; "icmpv4"] "tcpip";
    package "ethernet";
    package "arp-mirage";
    package "mirage-protocols";
    package "mtime";
  ] in
  foreign
    ~keys:[Key.abstract ipv4 ; Key.abstract ipv4_gateway ; Key.abstract host ; Key.abstract timeout]
    ~packages
    "Unikernel.Main"
    (random @-> mclock @-> time @-> network @-> job)

let () =
  register "traceroute"
    [ main $ default_random $ default_monotonic_clock $ default_time $ default_network ]

And voila, that's all the code. If you copy it together (or download the two files from the GitHub repository), and have OCaml, opam, and mirage (>= 3.8.0) installed, you should be able to:

$ mirage configure -t hvt
$ make depend
$ make
$ solo5-hvt --net:service=tap0 -- traceroute.hvt ...
... get the output shown at top ...

Enhancements may be to use a different protocol (TCP? or any other protocol ID (may be used to encode more information), encode data into IPv4 ID, or the full 8 bytes of the upper protocol), encrypt/authenticate the data transmitted (and verify it has not been tampered with in the ICMP reply), improve error handling and recovery, send multiple packets for improved round trip time measurements, ...

If you develop enhancements you'd like to share, please sent a pull request to the git repository.

Motivation for this traceroute unikernel was while talking with Aaron and Paul, who contributed several patches to the IP stack which pass the ttl through.

If you want to support our work on MirageOS unikernels, please donate to robur. I'm interested in feedback, either via twitter, hannesm@mastodon.social or via eMail.