DNF’s continuous integration (CI) has historically struggled from multiple standpoints, including: reliability, coverage, and results not being publicly available. We recently migrated to GitHub Actions, which—in addition to increasing our integration test suite stability and coverage—led to it being more reliable and its results available publicly to contributors.
Let’s list the issues with the old CI setup:
- Not accessible to public
- Not really that fast (roughly 1 hour 30 minutes per run)
- Common stability issues
- Our Jenkins instance was being decommissioned, forcing us to action
What are the improvements?
Public results availability
The results are now available. The result presentation could be described as “rudimentary”. So far we only have test logs. To find the failures, you need to search the logs yourself. Since the logs are some 40,000 lines long, the in-house trick is to search for the string “1 failed”. Since the Behave feature files are run one by one, this will always find the failure.
The speedup
The CI run now takes around 50 minutes. Part of the speedup is due to GitHub Actions being faster than our original test runner. The other (somewhat unrelated) part is using Copr batch builds (thanks Copr!).
Stability improvements
We are facing occasional failures related to the overlay filesystem in Podman containers, which stem from our nested container setup. Apart from this, we haven’t had any other issues. It is also easy to re-run the CI jobs in case of an infrastructure failure.
The implementation
The DNF stack CI is perhaps a bit unique in that we need to test our full stack (libcomps
, librepo
, libdnf
, dnf
, plugins) as a whole. We don’t use the “separate changes in the components and test + merge independently” workflow. It would bring too much overhead in our case. In addition, the integration test suite needs the full stack implementation to test.
Thus, we build our full stack from git on each CI run. To share the non-trivial business logic of the CI between all components of our stack, we have actions defined for building the packages and running the test suite in our CI repository (ci-dnf-stack
). These actions are then used by all the stack components’ workflows.
We are using the rpm-gitoverlay
tool for building packages of the full stack in Copr. It uses yaml configuration files called overlays to define which RPMs to build, how, and the dependencies between them. As an example, here’s the main stack overlay definition for CI.
As a side note, we are also trying to run DNF stack users’ test suites when possible to catch regressions early. That’s why we separate the Copr build job and our integration test suite run. In this way we can run more test suites on the built packages. Right now, we have the Ansible test suite there. You can see an example of the full picture in the dnf
CI workflow.
Since we need the shared actions defined in ci-dnf-stack
in each of our CI workflow files, perhaps an interesting implementation detail is that the first repository we clone in our CI workflows is the ci-dnf-stack
. Then we clone the component being tested from the PR branch, into the gits/COMPONENT
directory, just so that we can then build it using rpm-gitoverlay
along with the rest of the stack.
The security aspect
It is perhaps a bit notorious topic that one should be careful about running untrusted code in their GitHub Actions workflows, as a malicious attacker could submit a pull request with code that will, for example, steal the GitHub secrets.
We are running our CI on pull requests without checking them first (an exception being in the ci-dnf-stack itself). The PR code is never executed in the GitHub Actions runner directly; it is packaged into a source RPM, sent to Copr to build, and then installed into an unprivileged Podman container. We believe the container isolation is secure enough to prevent an attacker from getting to the secrets on the host.
What remains to be done
An attentive reader may have noticed that the described implementation doesn’t really deal with the situation when multiple, related pull requests are submitted against different components of the stack. It is indeed so: the CI still doesn’t work for that case—what we need here is a mechanism to link the related PRs and build & run them together. This remains as work for the future.
Other tools considered
We have considered other CI tools to base our CI on, namely Packit and Zuul.
Packit doesn’t support multi-component workflow at this moment, so it wasn’t a good fit for our needs.
Zuul does support multi-component workflow and has support for running CI on multiple related PRs. We attempted to integrate it, but it wasn’t ultimately fruitful. At the point we had to make a decision to replace the Red Hat internal Jenkins setup, GitHub Actions seemed like the most feasible replacement in the given time frame.
In addition to GitHub Actions documentation, Martin Pitt’s blog post about working on Red Hat Installer team CI has been a great resource, as well Martin’s advice over email conversations. Thanks, Martin!
Start the discussion by commenting on the auto-created topic at discussion.fedoraproject.org