Who maintains package X?

Written by hannes
Classified under: package signingsecurity
Published: 2017-02-16 (last updated: 2017-02-16)

A very important data point for conex, the new opam signing utility, is who is authorised for a given package. We could have written this manually down, or force each author to create a pull request for their packages, but this would be a long process and not easy: the main opam repository has around 1500 unique packages, and 350 contributors. Fortunately, it is a git repository with 5 years of history, and over 6900 pull requests. Each opam file may also contain a maintainers entry, a list of strings (usually a mail address).

The data sources we correlate are the maintainers entry in opam file, and who actually committed in the opam repository. This is inspired by some GitHub discussion.

GitHub id and email address

For simplicity, since conex uses any (unique) identifier for authors, and the opam repository is hosted on GitHub, we use a GitHub id as author identifier. Maintainer information is an email address, thus we need a mapping between them.

We wrote a shell script to find all PR merges, their GitHub id (in a brittle way: using the name of the git remote), and email address of the last commit. It also saves a diff of the PR for later. This results in 6922 PRs (opam repository version 38d908dcbc58d07467fbc00698083fa4cbd94f9d).

The metadata output is processed by github_mail: we ignore PRs from GitHub organisations PR.ignore_github, where commits PR.ignore_pr are picked from a different author (manually), bad mail addresses, and Jeremy's mail address (it is added to too many GitHub ids otherwise). The goal is to have a for an email address a single GitHub id. 329 authors with 416 mail addresses are mapped.

Maintainer in opam

As mentioned, lots of packages contain a maintainers entry. In maintainers we extract the mail addresses of the most recently released opam file. Some hardcoded matches are teams which do not properly maintain the maintainers field (such as mirage and xapi-project ;). We're open for suggestions to extend this massaging to the needs. Additionally, the contact at ocamlpro mail address was used for all packages before the maintainers entry was introduced (based on a discussion with Louis Gesbert). 132 packages with empty maintainers.

Fitness

Combining these two data sources, we hoped to find a strict small set of whom to authorise for which package. Turns out some people use different mail addresses for git commits and opam maintainer entries, which are be easily fixed.

While processing the full diffs of each PR (using the diff parser of conex mentioned above), ignoring the 44% done by janitors (a manually created set by looking at log data, please report if wrong), we categorise the modifications: authorised modification (the GitHub id is authorised for the package), modification by an author to a team-owned package (propose to add this author to the team), modification of a package where no GitHub id is authorised, and unauthorised modification. We also ignore packages which are no longer in the opam repository.

2766 modifications were authorised, 418 were team-owned, 452 were to packages with no maintainer, and 570 unauthorised. This results in 125 unowned packages.

Out of the 452 modifications to packages with no maintainer, 75 are a global one-to-one author to package relation, and are directly authorised.

Inference of team members is an overapproximation (everybody who committed changes to their packages), additionally the janitors are missing. We will have to fill these manually.

alt-ergo -> OCamlPro-Iguernlala UnixJunkie backtracking bobot nobrowser
janestreet -> backtracking hannesm j0sh rgrinberg smondet
mirage -> MagnusS dbuenzli djs55 hannesm hnrgrgr jonludlam mato mor1 pgj pqwy pw374 rdicosmo rgrinberg ruhatch sg2342 talex5 yomimono
ocsigen -> balat benozol dbuenzli hhugo hnrgrgr jpdeplaix mfp pveber scjung slegrand45 smondet vasilisp
xapi-project -> dbuenzli djs55 euanh mcclurmc rdicosmo simonjbeaumont yomimono

Alternative approach: GitHub urls

An alternative approach (attempted earlier) working only for GitHub hosted projects, is to authorise the use of the user part of the GitHub repository URL. Results after filtering GitHub organisations are not yet satisfactory (but only 56 packages with no maintainer, output repo. This approach completely ignores the manually written maintainer field.

Conclusion

Manually maintained metadata is easily out of date, and not very useful. But combining automatically created metadata with manually, and some manual tweaking leads to reasonable data.

The resulting authorised inference is available [in this branch](output repo.