Conversation
The script "generate.py" is used to parse all test source files for unit tests. These are then written into a "clar.suite" file, which can be included by the main test executable to make available all test suites and unit tests. Our current algorithm simply collects all test suites inside of a dict, iterates through its items and dumps them in a special format into the file. As the order is not guaranteed to be deterministic for Python dictionaries, this may result in arbitrarily ordered C structs. This obviously defeats the purpose of reproducible builds, where the same input should always result in the exact same output. Fix this issue by sorting the test suites by name previous to dumping them as structs. This enables reproducible builds for the libgit2_clar file.
By default, both ar(1) and ranlib(1) will insert additional information like timestamps into generated static archives and indices. As a consequence, generated static archives are not deterministic when created with default parameters. Both programs do support a deterministic mode, which will simply zero out undeterministic information with `ar D` and `ranlib -D`. Unfortunately, CMake does not provide an easy knob to add these command line parameters. Instead, we have to redefine the complete command definitons stored in the variables CMAKE_C_ARCHIVE_CREATE, CMAKE_C_ARCHIVE_APPEND and CMAKE_C_ARCHIVE_FINISH. Introduce a new build option `ENABLE_REPRODUCIBLE_BUILDS`. This option is available on Unix-like systems with the exception of macOS, which does not have support for the required flags. If the option is being enabled, we add those flags to the invocation of both `ar` and `ranlib` to enable deterministically building the static archive.
8294a2a to
d630887
Compare
| OPTION( ENABLE_WERROR "Enable compilation with -Werror" OFF ) | ||
| IF (UNIX AND NOT APPLE) | ||
| OPTION( ENABLE_REPRODUCIBLE_BUILDS "Enable reproducible builds" OFF ) | ||
| ENDIF() |
There was a problem hiding this comment.
Why UNIX AND NOT APPLE? Are these GNU-only settings? If so, what about (say) FreeBSD? I wonder if there's a better way to detect GNUness...
There was a problem hiding this comment.
Those options are not available on all implementations of "ar" and "ranlib", unfortunately. I think being GNU is not even sufficient here, as those options were introduced not that long in the past. I even think Ubuntu 14.04 does not have the ability to build it like this.
So originally, I intended to do implement this mode as the default, such that all builds are deterministic. But as I saw that it wasn't available on quite a lot of platforms, I simply made it an option such that the distributor can decide for himself if he needs reproducible builds or not. And in case he knows what a reproducible build is and what it is for, he probably also has enough knowledge to fix his toolchain.
So I bet that some BSDs have the ability to have reproducible builds. I'd at least expect OpenBSD to have them, regarding their focus on security. But as far as I know, this is not at all available on macOS. So yeah, we could probably come up with something which just tests if those tools support the required flags. But in the end, I don't think it's really gaining us much, as it is only an option for experts who know what they are doing.
There was a problem hiding this comment.
I see. That seems reasonable...
Reproducible builds have the aim of generating the exact same binary files for the same input files, thus giving an actual verifiable path from source code to binary code. So this is actually a security feature.
I've set out to make our build system fully deterministic in order to enable reproducible builds. Unforunately, the expected epic journey was more of a small trip out of the door, as most stuff is already built in a deterministic way. There were only two small outliers to this.
The first one is our test suite. The "generate.py" script, which generates our test suite definitions, dumped the modules in a non-deterministic way. As such, our clar test suite was compiled with differently ordered structs and was thus not deterministic.
The second one was how we generate static libraries. The tools ar(1) and ranlib(1) both are non-deterministic by default because they do inlike information like UID, GID and timestamps into the resulting static archive. This can be turned off by enabling the deterministic mode via a simple flag. While this sounds rather simple, I don't really like the solution for the CMake build system, as there is no simple way to just pass in additional flags to these commands. Instead, we have to override the complete commands as defined by three variables. We could hide this behind a simple build-time option "DETERMINISTIC_BUILD" or similar.
All in all, this leaves us with three files which are not reproducible in the build directory (assuming the path to the build directory does not change): two of them are log files and the third is the clar cache. The first two are indeterministic by definiton and should stay so, the third is too unimportant to care. As it is a simple serialization of Python objects via pickle, there's also no easy fix here (I think, though I may be mistaken).
The script I've used to test: