Reproducible MirageOS unikernel builds

Reproducible builds summit

I'm just back from the Reproducible builds summit 2019. In 2018, several people developing OCaml and opam and MirageOS, attended the Reproducible builds summit in Paris. The notes from last year on opam reproducibility and MirageOS reproducibility are online. After last years workshop, Raja started developing the opam reproducibilty builder orb, which I extended at and after this years summit. This year before and after the facilitated summit there were hacking days, which allowed further interaction with participants, writing some code and conduct experiments. I had this year again an exciting time at the summit and hacking days, thanks to our hosts, organisers, and all participants.

Goal

Stepping back a bit, first look on the goal of reproducible builds: when compiling source code multiple times, the produced binaries should be identical. It should be sufficient if the binaries are behaviourally equal, but this is pretty hard to check. It is much easier to check bit-wise identity of binaries, and relaxes the burden on the checker -- checking for reproducibility is reduced to computing the hash of the binaries. Let's stick to the bit-wise identical binary definition, which also means software developers have to avoid non-determinism during compilation in their toolchains, dependent libraries, and developed code.

A checklist of potential things leading to non-determinism has been written up by the reproducible builds project. Examples include recording the build timestamp into the binary, ordering of code and embedded data. The reproducible builds project also developed disorderfs for testing reproducibility and diffoscope for comparing binaries with file-dependent readers, falling back to objdump and hexdump. A giant test infrastructure with lots of variations between the builds, mostly using Debian, has been setup over the years.

Reproducibility is a precondition for trustworthy binaries. See why does it matter. If there are no instructions how to get from the published sources to the exact binary, why should anyone trust and use the binary which claims to be the result of the sources? It may as well contain different code, including a backdoor, bitcoin mining code, outputting the wrong results for specific inputs, etc. Reproducibility does not imply the software is free of security issues or backdoors, but instead of a audit of the binary - which is tedious and rarely done - the source code can be audited - but the toolchain (compiler, linker, ..) used for compilation needs to be taken into account, i.e. trusted or audited to not be malicious. I will only ever publish binaries if they are reproducible.

My main interest at the summit was to enhance existing tooling and conduct some experiments about the reproducibility of MirageOS unikernels -- a unikernel is a statically linked ELF binary to be run as Unix process or virtual machine. MirageOS heavily uses OCaml and opam, the OCaml package manager, and is an opam package itself. Thus, checking reproducibility of a MirageOS unikernel is the same problem as checking reproducibility of an opam package.

Reproducible builds with opam

Testing for reproducibility is achieved by taking the sources and compile them twice independently. Afterwards the equality of the resulting binaries can be checked. In trivial projects, the sources is just a single file, or originate from a single tarball. In OCaml, opam uses a community repository where OCaml developers publish their package releases to, but can also use custom repositores, and in addition pin packages to git remotes (url including branch or commit), or a directory on the local filesystem. Manually tracking and updating all dependent packages of a MirageOS unikernel is not feasible: our hello-world compiled for hvt (kvm/BHyve) already has 79 opam dependencies, including the OCaml compiler which is distribued as opam package. The unikernel serving this website depends on 175 opam packages.

Conceptually there should be two tools, the initial builder, which takes the latest opam packages which do not conflict, and exports exact package versions used during the build, as well as hashes of binaries. The other tool is a rebuilder, which imports the export, conducts a build, and outputs the hashes of the produced binaries.

Opam has the concept of a switch, which is an environment where a package set is installed. Switches are independent of each other, and can already be exported and imported. Unfortunately the export is incomplete: if a package includes additional patches as part of the repository -- sometimes needed for fixing releases where the actual author or maintainer of a package responds slowly -- these package neither the patches end up in the export. Also, if a package is pinned to a git branch, the branch appears in the export, but this may change over time by pushing more commits or even force-pushing to that branch. In PR #4040 (under discussion and review), also developed during the summit, I propose to embed the additional files as base64 encoded values in the opam file. To solve the latter issue, I modified the export mechanism to embed the git commit hash (PR #4055), and avoid sources from a local directory and which do not have a checksum.

So the opam export contains the information required to gather the exact same sources and build instructions of the opam packages. If the opam repository would be self-contained (i.e. not depend on any other tools), this would be sufficient. But opam does not run in thin air, it requires some system utilities such as /bin/sh, sed, a GNU make, commonly git, a C compiler, a linker, an assembler. Since opam is available on various operating systems, the plugin depext handles host system dependencies, e.g. if your opam package requires gmp to be installed, this requires slightly different names depending on host system or distribution, take a look at conf-gmp. This also means, opam has rather good information about both the opam dependencies and the host system dependencies for each package. Please note that the host system packages used during compilation are not yet recorded (i.e. which gmp package was installed and used during the build, only that a gmp package has to be installed). The base utilities mentioned above (C compiler, linker, shell) are also not recorded yet.

Operating system information available in opam (such as architecture, distribution, version), which in some cases maps to exact base utilities, is recorded in the build-environment, a separate artifact. The environment variable SOURCE_DATE_EPOCH, used for communicating the same timestamp when software is required to record a timestamp into the resulting binary, is also captured in the build environment.

Additional environment variables may be captured or used by opam packages to produce different output. To avoid this, both the initial builder and the rebuilder are run with minimal environment variables: only PATH (normalised to a whitelist of /bin, /usr/bin, /usr/local/bin and /opt/bin) and HOME are defined. Missing information at the moment includes CPU features: some libraries (gmp?, nocrypto) emit different code depending on the CPU feature.

Tooling

TL;DR: A build builds an opam package, and outputs .opam-switch, .build-hashes.N, and .build-environment.N. A rebuild uses these artifacts as input, builds the package and outputs another .build-hashes.M and .build-environment.M.

The command-line utility orb can be installed and used:

$ opam pin add orb git+https://github.com/hannesm/orb.git#active
$ orb build --twice --keep-build-dir --diffoscope <your-favourite-opam-package>

It provides two subcommands build and rebuild. The build command takes a list of local opam --repos where to take opam packages from (defaults to default), a compiler (either a variant --compiler=4.09.0+flambda, a version --compiler=4.06.0, or a pin to a local development version --compiler-pin=~/ocaml), and optionally an existing switch --use-switch. It creates a switch, builds the packages, and emits the opam export, hashes of all files installed by these packages, and the build environment. The flags --keep-build retains the build products, opam's --keep-build-dir in addition temporary build products and generated source code. If --twice is provided, a rebuild (described next) is executed after the initial build.

The rebuild command takes a directory with the opam export and build environment to build the opam package. It first compares the build-environment with the host system, sets the SOURCE_DATE_EPOCH and switch location accordingly and executes the import. Once the build is finished, it compares the hashes of the resulting files with the previous run. On divergence, if build directories were kept in the previous build, and if diffoscope is available and --diffoscope was provided, diffoscope is run on the diverging files. If --keep-build-dir was provided as well, diff -ur can be used to compare the temporary build and sources, including build logs.

The builds are run in parallel, as opam does, this parallelism does not lead to different binaries in my experiments.

Results and discussion

All MirageOS unikernels I have deployed are reproducible \o/. Also, several binaries such as orb itself, opam, solo5-hvt, and all albatross utilities are reproducible.

The unikernel range from hello world, web servers (e.g. this blog, getting its data on startup via a git clone to memory), authoritative DNS servers, CalDAV server. They vary in size between 79 and 200 opam packages, resulting in 2MB - 16MB big ELF binaries (including debug symbols). The unikernel opam repository contains some reproducible unikernels used for testing. Some work-in-progress enhancements are needed to achieve this:

At the moment, the opam package of a MirageOS unikernel is automatically generated by mirage configure, but only used for tracking opam dependencies. I worked on mirage PR #1022 to extend the generated opam package with build and install instructions.

As mentioned above, if locale is set, ocamlgraph needs to be patched to emit a (locale-dependent) timestamp.

The OCaml program crunch embeds a subdirectory as OCaml code into a binary, which we use in MirageOS quite regularly for static assets, etc. This plays in several ways into reproducibility: on the one hand, it needs a timestamp for its last_modified functionality (and adheres since June 2018 to the SOURCE_DATE_EPOCH spec, thanks to Xavier Clerc). On the other hand, it used before version 3.2.0 (released Dec 14th) hashtables for storing the file contents, where iteration is not deterministic (the insertion is not sorted), fixed in PR #51 by using a Map instead.

In functoria, a tool used to configure MirageOS devices and their dependencies, can emit a list of opam packages which were required to build the unikernel. This uses opam list --required-by --installed --rec <pkgs>, which uses the cudf graph (thanks to Raja for explanation), that is during the rebuild dropping some packages. The PR #189 avoids by not using the --rec argument, but manually computing the fixpoint.

Certainly, the choice of environment variables, and whether to vary them (as debian does) or to not define them (or normalise) while building, is arguably. Since MirageOS does neither support time zone nor internationalisation, there is no need to prematurely solving this issue. On related note, even with different locale settings, MirageOS unikernels are reproducible apart from an issue in ocamlgraph #90 embedding the output of date, which is different depending on LANG and locale (LC_*) settings.

Prior art in reproducible MirageOS unikernels is the mirage-qubes-firewall. Since early 2017 it is reproducible. Their approach is different by building in a docker container with the opam repository pinned to an exact git commit.

Further work

I only tested a certain subset of opam packages and MirageOS unikernels, mainly on a single machine (my laptop) running FreeBSD, and am happy if others will test reproducibility of their OCaml programs with the tools provided. There could as well be CI machines rebuilding opam packages and reporting results to a central repository. I'm pretty sure there are more reproducibility issues in the opam ecosystem. I developed an reproducible testing opam repository with opam packages that do not depend on OCaml, mainly for further tooling development. Some tests were also conducted on a Debian system with the same result. The variations, apart from build time, were using a different user, and different locale settings.

As mentioned above, more environment, such as the CPU features, and external system packages, should be captured in the build environment.

When comparing OCaml libraries, some output files (cmt / cmti / cma / cmxa) are not deterministic, but contain minimal diverge where I was not able to spot the root cause. It would be great to fix this, likely in the OCaml compiler distribution. Since the final result, the binary I'm interested in, is not affected by non-identical intermediate build products, I hope someone (you?) is interested in improving on this side. OCaml bytecode output also seems to be non-deterministic. There is a discussion on the coq issue tracker which may be related.

In contrast to initial plans, I did not used the BUILD_PATH_PREFIX_MAP environment variable, which is implemented in OCaml by PR #1515 (and followups). The main reasons are that something in the OCaml toolchain (I suspect the bytecode interpreter) needed absolute paths to find libraries, thus I'd need a symlink from the left-hand side to the current build directory, which was tedious. Also, my installed assembler does not respect the build path prefix map, and BUILD_PATH_PREFIX_MAP is not widely supported. See e.g. the Debian zarith package with different build paths and its effects on the binary.

I'm fine with recording the build path (switch location) in the build environment for now - it turns out to end up only once in MirageOS unikernels, likely by the last linking step, which hopefully soon be solved by llvm 9.0.

What was fun was to compare the unikernel when built on Linux with gcc against a built on FreeBSD with clang and lld - spoiler: they emit debug sections with different dwarf versions, it is pretty big. Other fun differences were between OCaml compiler versions: the difference between minor versions (4.08.0 vs 4.08.1) is pretty small (~100kB as human-readable output), while the difference between major version (4.08.1 vs 4.09.0) is rather big (~900kB as human-readable diff).

An item on my list for the future is to distribute the opam export, build hashes and build environment artifacts in a authenticated way. I want to integrate this as in-toto style into conex, my not-yet-deployed implementation of tuf for opam that needs further development and a test installation, hopefully in 2020.

If you want to support our work on MirageOS unikernels, please donate to robur. I'm interested in feedback, either via twitter, hannesm@mastodon.social or via eMail.