Who maintains package X?
Published: 2017-02-16 (last updated: 2017-02-16)
A very important data point for conex, the new opam signing utility, is who is authorised for a given package. We
could have written this manually down, or force each author to create a
pull request for their packages, but this would be a long process and not
easy: the main opam repository has around 1500 unique packages, and 350
contributors. Fortunately, it is a git repository with 5 years of history, and
over 6900 pull requests. Each opam file may also contain a
a list of strings (usually a mail address).
The data sources we correlate are the
maintainers entry in opam file, and who
actually committed in the opam repository. This is inspired by some GitHub
GitHub id and email address
For simplicity, since conex uses any (unique) identifier for authors, and the opam repository is hosted on GitHub, we use a GitHub id as author identifier. Maintainer information is an email address, thus we need a mapping between them.
We wrote a shell script to find all PR merges, their GitHub id (in a brittle way: using the name of the git remote), and email address of the last commit. It also saves a diff of the PR for later. This results in 6922 PRs (opam repository version 38d908dcbc58d07467fbc00698083fa4cbd94f9d).
The metadata output is processed by
we ignore PRs from GitHub organisations
PR.ignore_github, where commits
PR.ignore_pr are picked from a different author (manually), bad mail addresses,
and Jeremy's mail address (it is added to too many GitHub ids otherwise). The
goal is to have a for an email address a single GitHub id. 329 authors with 416 mail addresses are mapped.
Maintainer in opam
As mentioned, lots of packages contain a
maintainers entry. In
we extract the mail addresses of the most recently released opam
Some hardcoded matches are teams which do not properly maintain the maintainers
field (such as mirage and xapi-project ;). We're open for suggestions to extend
this massaging to the needs. Additionally, the contact at ocamlpro mail address
was used for all packages before the maintainers entry was introduced (based on
a discussion with Louis Gesbert). 132 packages with empty maintainers.
Combining these two data sources, we hoped to find a strict small set of whom to authorise for which package. Turns out some people use different mail addresses for git commits and opam maintainer entries, which are be easily fixed.
While processing the full diffs of each PR (using the diff parser of conex mentioned above), ignoring the 44% done by janitors (a manually created set by looking at log data, please report if wrong), we categorise the modifications: authorised modification (the GitHub id is authorised for the package), modification by an author to a team-owned package (propose to add this author to the team), modification of a package where no GitHub id is authorised, and unauthorised modification. We also ignore packages which are no longer in the opam repository.
2766 modifications were authorised, 418 were team-owned, 452 were to packages with no maintainer, and 570 unauthorised. This results in 125 unowned packages.
Out of the 452 modifications to packages with no maintainer, 75 are a global one-to-one author to package relation, and are directly authorised.
Inference of team members is an overapproximation (everybody who committed changes to their packages), additionally the janitors are missing. We will have to fill these manually.
alt-ergo -> OCamlPro-Iguernlala UnixJunkie backtracking bobot nobrowser janestreet -> backtracking hannesm j0sh rgrinberg smondet mirage -> MagnusS dbuenzli djs55 hannesm hnrgrgr jonludlam mato mor1 pgj pqwy pw374 rdicosmo rgrinberg ruhatch sg2342 talex5 yomimono ocsigen -> balat benozol dbuenzli hhugo hnrgrgr jpdeplaix mfp pveber scjung slegrand45 smondet vasilisp xapi-project -> dbuenzli djs55 euanh mcclurmc rdicosmo simonjbeaumont yomimono
Alternative approach: GitHub urls
An alternative approach (attempted earlier) working only for GitHub hosted projects, is to authorise the use of the user part of the GitHub repository URL. Results after filtering GitHub organisations are not yet satisfactory (but only 56 packages with no maintainer, output repo. This approach completely ignores the manually written maintainer field.
Manually maintained metadata is easily out of date, and not very useful. But combining automatically created metadata with manually, and some manual tweaking leads to reasonable data.
The resulting authorised inference is available [in this branch](output repo.