Mirroring the opam repository and all tarballs

Written by hannes
Classified under: mirageosdeploymentopam
Published: 2022-09-29 (last updated: 2023-11-20)

We at robur developed opam-mirror in the last month and run a public opam mirror at https://opam.robur.coop (updated hourly).

What is opam and why should I care?

Opam is the OCaml package manager (also used by other projects such as coq). It is a source based system: the so-called repository contains the metadata (url to source tarballs, build dependencies, author, homepage, development repository) of all packages. The main repository is hosted on GitHub as ocaml/opam-repository, where authors of OCaml software can contribute (as pull request) their latest releases.

When opening a pull request, automated systems attempt to build not only the newly released package on various platforms and OCaml versions, but also all reverse dependencies, and also with dependencies with the lowest allowed version numbers. That's crucial since neither semantic versioning has been adapted across the OCaml ecosystem (which is tricky, for example due to local opens any newly introduced binding will lead to a major version bump), neither do many people add upper bounds of dependencies when releasing a package (nobody is keen to state "my package will not work with cmdliner in version 1.2.0").

So, the opam-repository holds the metadata of lots of OCaml packages (around 4000 at the moment this article was written) with lots of versions (in total 25000) that have been released. It is used by the opam client to figure out which packages to install or upgrade (using a solver that takes the version bounds into consideration).

Of course, opam can use other repositories (overlays) or forks thereof. So nothing stops you from using any other opam repository. The url to the source code of each package may be a tarball, or a git repository or other version control systems.

The vast majority of opam packages released to the opam-repository include a link to the source tarball and a cryptographic hash of the tarball. This is crucial for security (under the assumption the opam-repository has been downloaded from a trustworthy source - check back later this year for updates on conex). At the moment, there are some weak spots in respect to security: md5 is still allowed, and the hash and the tarball are downloaded from the same server: anyone who is in control of that server can inject arbitrary malicious data. As outlined above, we're working on infrastructure which fixes the latter issue.

How does the opam client work?

Opam, after initialisation, downloads the index.tar.gz from https://opam.ocaml.org/index.tar.gz, and uses this as the local opam universe. An opam install cmdliner will resolve the dependencies, and download all required tarballs. The download is first tried from the cache, and if that failed, the URL in the package file is used. The download from the cache uses the base url, appends the archive-mirror, followed by the hash algorithm, the first two characters of the has of the tarball, and the hex encoded hash of the archive, i.e. for cmdliner 1.1.1 which specifies its sha512: https://opam.ocaml.org/cache/sha512/54/5478ad833da254b5587b3746e3a8493e66e867a081ac0f653a901cc8a7d944f66e4387592215ce25d939be76f281c4785702f54d4a74b1700bc8838a62255c9e.

How does the opam repository work?

According to DNS, opam.ocaml.org is a machine at amazon. It likely, apart from the website, uses opam admin index periodically to create the index tarball and the cache. There's an observable delay between a package merge in the opam-repository and when it shows up at opam.ocaml.org. Recently, there was a reported downtime.

Apart from being a single point of failure, if you're compiling a lot of opam projects (e.g. a continuous integration / continuous build system), it makes sense from a network usage (and thus sustainability perspective) to move the cache closer to where you need the source archives. We're also organising the MirageOS hack retreats in a northern African country with poor connectivity - so if you gather two dozen camels you better bring your opam repository cache with you to reduce the bandwidth usage (NB: this requires at the moment cooperation of all participants to configure their default opam repository accordingly).

Re-developing "opam admin create" as MirageOS unikernel

The need for a local opam cache at our reproducible build infrastructure and the retreats, we decided to develop opam-mirror as a MirageOS unikernel. Apart from a useful showcase using persistent storage (that won't fit into memory), and having fun while developing it, our aim was to reduce our time spent on system administration (the opam admin index is only one part of the story, it needs a Unix system and a webserver next to it - plus remote access for doing software updates - which has quite some attack surface.

Another reason for re-developing the functionality was that the opam code (what opam admin index actually does) is part of the opam source code, which totals to 50_000 lines of code -- looking up whether one or all checksums are verified before adding the tarball to the cache, was rather tricky.

In earlier years, we avoided persistent storage and block devices in MirageOS (by embedding it into the source code with crunch, or using a remote git repository), but recent development, e.g. of chamelon sparked some interest in actually using file systems and figuring out whether MirageOS is ready in that area. A month ago we started the opam-mirror project.

Opam-mirror takes a remote repository URL, and downloads all referenced archives. It serves as a cache and opam-repository - and does periodic updates from the remote repository. The idea is to validate all available checksums and store the tarballs only once, and store overlays (as maps) from the other hash algorithms.

Code development and improvements

Initially, our plan was to use ocaml-git for pulling the repository, chamelon for persistent storage, and httpaf as web server. With ocaml-tar recent support of gzip we should be all set, and done within a few days.

There is already a gap in the above plan: which http client to use - in the best case something similar to our http-lwt-client - in MirageOS: it should support HTTP 1.1 and HTTP 2, TLS (with certificate validation), and using happy-eyeballs to seemlessly support both IPv6 and legacy IPv4. Of course it should follow redirect, without that we won't get far in the current Internet.

On the path (over the last month), we fixed file descriptor leaks (memory leaks) in paf -- which is used as a runtime for httpaf and h2.

Then we ran into some trouble with chamelon (out of memory, some degraded peformance, it reporting out of disk space), and re-thought our demands for opam-mirror. Since the cache is only ever growing (new packages are released), there's no need to ever remove anything: it is append-only. Once we figured that out, we investigated what needs to be done in ocaml-tar (where tar is in fact a tape archive, and was initially designed as file format to be appended to) to support appending to an archive.

We also re-thought our bandwidth usage, and instead of cloning the git remote at startup, we developed git-kv which can dump and restore the git state.

Also, initially we computed all hashes of all tarballs, but with the size increasing (all archives are around 7.5GB) this lead to a major issue of startup time (around 5 minutes on a laptop), so we wanted to save and restore the maps as well.

Since neither git state nor the maps are suitable for tar's append-only semantics, and we didn't want to investigate yet another file system - such as fat may just work fine, but the code looks slightly bitrot, and the reported issues and non-activity doesn't make this package very trustworthy from our point of view. Instead, we developed mirage-block-partition to partition a block device into two. Then we just store the maps and the git state at the end - the end of a tar archive is 2 blocks of zeroes, so stuff at the far end aren't considered by any tooling. Extending the tar archive is also possible, only the maps and git state needs to be moved to the end (or recomputed). As file system, we developed oneffs which stores a single value on the block device.

We observed a high memory usage, since each requested archive was first read from the block device into memory, and then sent out. Thanks to Pierre Alains recent enhancements of the mirage-kv API, there is a get_partial, that we use to chunk-wise read the archive and send it via HTTP. Now, the memory usage is around 20MB (the git repository and the generated tarball are kept in memory).

What is next? Downloading and writing to the tar archive could be done chunk-wise as well; also dumping and restoring the git state is quite CPU intensive, we would like to improve that. Adding the TLS frontend (currently done on our site by our TLS termination proxy tlstunnel) similar to how unipi does it, including let's encrypt provisioning -- should be straightforward (drop us a note if you'd be interesting in that feature).

Conclusion

To conclude, we managed within a month to develop this opam-mirror cache from scratch. It has a reasonable footprint (CPU and memory-wise), is easy to maintain and easy to update - if you want to use it, we also provide reproducible binaries for solo5-hvt. You can use our opam mirror with opam repository set-url default https://opam.robur.coop (revert to the other with opam repository set-url default https://opam.ocaml.org) or use it as a backup with opam repository add robur --rank 2 https://opam.robur.coop.

Please reach out to us (at team AT robur DOT coop) if you have feedback and suggestions. We are a non-profit company, and rely on donations for doing our work - everyone can contribute.