Coverage-Driven Test Selection

This post is a part of The Third Annual C# Advent. Checkout out all the other blogposts there, or read my entries from previous two years:

The idea of this year’s experiment is simple: using per-test coverage data flowing from CI to run only that part of your test suite that might have been affected by changes you did. It’s not an original thought – check for example paper Code coverage-based regression test selection and prioritization in WebKit for a prior art and further list of interesting references.

In practice, only teams with grave problems with either their project structure or build orchestration might resort to measures of this kind – I guess that’s the reason why the idea didn’t catch on universally. Normally you either limit size of your repositories and scope of your tests to be able to run all your verification quickly, or you use either in- or out-of-repository tools to orchestrate your test execution based on either project or repository structure.

This post have three parts.

  • First, I will talk a little bit about PR loop, to provide some more context for the effort.
  • Second, I check what coverage we have to our disposal in .NET world in 2019 and what I can use for my experiment.
  • Then, I will show a POC tool for coverage-driven test selection and analyze its properties.

PR loop(s)

Often, there are two slightly distinct checks connected with pull request loop.

  • The first one opens or closes the gate for merging. It needs to happen after every update of the PR.
  • The second one re-verifies everything is still OK after the author signalized their intent to merge. It almost always runs only once.

Although the motivation for such separation is not clear at first, it’s established by a set of practical constraints.

PRs get updated and there is often an expensive human being waiting for the result of checks, so there is a strong motivation to make the gate check quick. Artifacts are usually not needed in this part, so their creation can be omitted. Putting the artifact creation altogether behind the merge would damage correctness of the master though as artifact creation logic is still a logic that can be broken.

Often, there is lot of stuff happening in master meanwhile the PR is open. Theoretically, we can invalidate the gate check after every master merge, but that’s grossly non-economical, both in terms of cost of compute and cost of humans waiting for re-runs. It’s almost sure the changes in master didn’t break the gate. At this point, total correctness of the gate check is, again, conceded as – given we have the pre-merge re-verification in place anyway – it’s not necessary to re-run to keep the master green.

Notice how merge trains in GitLab work. Notice how Azure DevOps offer auto-completion button for developers who have finished “the real human work” and now they just want the computers to wrap it. Arguably, the industry is pushed by practical concerns to the model of the two-stage PR, one before developer actually wants to merge, the other after the fact.

Coverage tools

You might have noticed that with .NET Core 3.0 templates, dotnet new xunit adds coverlet dependency to your test project file automatically. I like both this behavior and coverlet project a lot. Please go and check the source code of the tool – it’s just nice.

Can we use it to get the data we need for our experiment? Unfortunately, no, because coverlet cannot yet provide coverage data per test. And we need the bipartite graph of ((file, line), test) tuples to correlate with the set of lines affected by the current changeset.

Henceforth, we will have to resort to good old OpenCover.

POC for coverage-driven test selection

Our tool will provide two commands.

  • cdts-measure will run OpenCover and save coverage data, with the current commit hash as a name, into an Azure blob storage account for everyone to use. In a real usage (haha), this command would be ran by CI as a part of the merge check (the second PR check) to provide the data to local runs of the team working on further changes, and to the first PR checks.
  • cdts-test will take the closest master commit to the current commit (git merge-base <current> master), ask blob storage for the coverage data previously provided by cdts measure. Then it will ask git for the diff between the commit and the current state. Then, for every changed SUT file, it will rerun tests that touched lines that coincide with the changes it got from git.

Let’s check how this works for a really simple project:

> cat ClassLibrary1/Class1.cs
namespace ClassLibrary1
{
    public static class Class1
    {
        public static string Method1()
        {
            return "A";
        }
    }
}
> cat ClassLibrary1/Class2.cs
namespace ClassLibrary1
{
    public static class Class2
    {
        public static string Method1()
        {
            return "B";
        }
    }
}
> cat XUnitTestProject1/UnitTest1.cs
using ClassLibrary1;
using Xunit;

namespace XUnitTestProject1
{
    public class UnitTest1
    {
        [Fact]
        public void Test1()
        {
            Assert.Equal("A", Class1.Method1());
        }

        [Fact]
        public void Test2()
        {
            Assert.Equal("B", Class2.Method1());
        }
    }
}

Test passes in this situation:

> dotnet test
[...]
Total tests: 2
     Passed: 2

So after capturing the coverage data:

> cdts-measure
Measuring coverage of 061da68...
Uploading coverage of 061da68...

We can do this change:

diff --git a/ClassLibrary1/Class1.cs b/ClassLibrary1/Class1.cs
index dcaae02..f772aa0 100644
--- a/ClassLibrary1/Class1.cs
+++ b/ClassLibrary1/Class1.cs
@@ -4,7 +4,7 @@
     {
         public static string Method1()
         {
-            return "A";
+            return "C";
         }
     }
 }

This one breaks the test Test1:

> dotnet test
[...]
Total tests: 2
     Passed: 1
     Failed: 1

… but using dotnet test you can see Test2 was still actually executed, even though given the change and the coverage there was no chance to be actually affected. So we run cdts-test:

> cdts-test
Downloading coverage for 061da68...
[...]
Total tests: 1
     Passed: 0
     Failed: 1

Voilà, we are no longer executing the test we know won’t break.

Analysis

Correctness

We won’t get any false negatives as we only run subset of tests that we would run otherwise. Can we get any false positive? Well, if we have non-determinism in our test suite, it can happen that coverage changes in successive executions. Therefore, we still need to run the full suite sometimes.

The distinction of twofold nature of PR loop from the start of the article helps us here. The full test suite can run as a merge check in the merge train, the reduced suite can present itself as a quick feedback for the developer. We can have both the interactivity, and the certainty.

Coverage data size

The biggest problem of the presented method is size of the data structure we publish and require on any test run. It’s in the rank of the size of the codebase itself, and that even holds only for sparse bipartite graphs (i.e. for codebases with high unit-ness). That doesn’t seem to bad as git operations running against the same amount of data in the repo itself. The problem is that git diffs and we don’t.

There are several somewhat practical venues to fix this:

  • Save the coverage to separate buckets, per-SUT-file for example. Given that most PRs changes at most several dozens of files, the tool might query coverage data only for them, separately.
  • Replace blob storage with an actual server keeping coverage information in its data structure, responding directly to question as “what tests touch files A, B, C for commit acb897d1”?

Conclusion

I will put the actual source code of the tool to my GitHub soon and update this article with a link – sorry about the delay.

I liked the exercise as it reinforces my long standing point: coverage data are not an artifact of bad software management. They are a useful engineering tool. Our tooling should gather them and work with them more.

» Subscribe to RSS
» Write me an e-mail