Flaky tests can be incredibly frustrating. Ideally, every new test failure would be due to the latest changes that the developer made and the developer could focus on debugging these failures. Unfortunately, some failures are not due to the latest changes but instead due to flaky tests. Flaky tests can either pass or fail even when run on the same code. Because their behavior is not directly related to the code, they can be very difficult to debug. The key problem with flaky tests is that developers cannot rely on the simple pass/fail outcome of test runs to decide how to proceed with their development: When a test fails, should the developer debug the failure or not? When a test fails, is it because the developer’s latest change, or not? If a team considers a single failed test to break the entire build, how can they triage flaky tests?
Flaky tests plague companies large and small. Google reports that 1 in 7 of their tests have some level of flakiness associated with it. A quick search also pops up issues about flaky tests at ThoughtWorks, at SemaphoreCI, at LucidChart, and on Martin Fowler’s blog.
Prior to our work, the most effective way to locate flaky tests was to repeatedly rerun failed tests to identify flaky tests. If some rerun passes, the test is definitely flaky; but if all reruns fail, the status is unknown. Rerunning failed tests is directly supported by many testing frameworks, including Android, Jenkins, Maven, Spring, and Google TAP. Rerunning every failed test is extremely costly when organizations see hundreds to millions of test failures per day. Even Google, with its vast compute resources, does not rerun all (failing) tests on every commit but reruns only those suspected to be flaky, and only outside of peak test execution times.
Our own empirical study found that how tests are rerun (e.g. immediately upon failure, or after some delay, in the same process, or in a new one) is more important than how many times each test is rerun.
Once flaky tests are identified, there are several strategies to help developers deal with them. Some developers automatically open tickets when they detect flaky tests, or remove flaky test results from reporting, and re-running tests to identify flaky tests.
Our approach for finding flaky tests, DeFlaker works without requiring any costly reruns, and is described here.