Flaky Visual Regression Tests, and what to do about them

railsNovember 16, 2023by Samuel de Moura

Introduction

Visual regression testing is a type of automated testing that compares the appearance of a website or application over time. They provide an easy way to ensure that changes to the user interface don't accidentally break existing visuals or introduce new visual bugs. Usually, they work by keeping screenshots of the latest known-good version of the interface being tested. If any code changes cause it to change visually, the test will fail.

Why write visual regression tests?

These kinds of tests are very good at catching problems that other testing approaches tend to miss. For example: a seemingly innocent CSS change in a shared component might cause an important UI element (such as a confirmation button) to be hidden behind another element, or some code change might cause an important asset not to render due to an unexpected change in routing logic or asset preprocessing.

Besides being great at catching unexpected issues, they also give you a lot of bang for your buck: for the average modern web application, simply loading a screen already entails testing shared layouts, routing, assets, API response handling, etc., all without requiring additional assertions beyond 'the page has loaded'.

The bad

As with any other testing methodology, visual regression testing has its negatives. One of the main ones is that it's fairly prone to 'flakiness': that is, it might fail intermittently for no apparent reason.

At first, this leads to wasted time as developers chase down non-existent problems and become frustrated from having their work stalled for seemingly no reason. But even worse: when this becomes a recurring problem, the tendency is for the team to lose trust in the test suite and become less confident when shipping changes or working within the code base, inevitably resulting in a slower time-to-market. Not only that but developers might be tempted to disable certain tests or become less motivated to write new ones, which might lead to more issues over time.

Thankfully, there are a few things one can do to drastically reduce flakiness and ensure that visual regression testing remains a valuable tool for maintaining application stability and product quality. In this article, we'll go over a few of these.

What (not) to do

A common suggestion you might come across is to simply increase the detection threshold. This means that more pixels need to be different before the test fails. In practice, however, this usually proves ineffective, because it's basically impossible to find a correct value for the detection threshold without running into:

Missed regressions: If the threshold is high enough, actual style breakages will go undetected. A style change might've broken the rounded corners on a component or the alignment of your custom-styled checkbox, and those will go undetected if the threshold is high enough.
Incorrect regressions: To work around the false negatives, you might just want to stick to a low detection threshold. That sounds fine, but then you're back at square one. A minor difference in text rendering - due to a sub-pixel layout shift caused by some unknown non-determinism at page load time, for example - will still result in a significant amount of pixels being different across snapshots, and you'll have a failing spec even though everything looks exactly the same to the human eye.

In that case, what should you do? Well, it mostly depends on what kind of failures you're observing, so let's go through some common solutions to the most common causes.

Ensure a standardized environment when running the tests

Each operating system, web browser, and even browser engine version has its own little quirks that cause things to render differently. Don't bother trying to get consistent results between a snapshot taken on Safari on macOS and one taken on Firefox on Linux. Even Firefox against another version of Firefox on the same Linux distribution isn't a guaranteed match.

Instead, standardize the environment where snapshots are taken (and where the visual regression tests are run). Usually, this means using a consistent Docker image and perhaps pinning browser versions to avoid unnecessary surprises when upgrading system packages and application libraries.

(Note: You should still update the browser regularly, but there's a small chance that it'll require regenerating all the test snapshots before your visreg tests start passing again. Therefore, pinning and treating it as a slightly higher risk upgrade might be useful, depending on the project.)

Blur your snapshots (or configure your diffing algorithm appropriately)

If your visreg tests seem to fail at random for completely mysterious reasons and the before/after images look exactly the same to the naked eye, it's usually due to the stuff that you don't really care about. For example: browser engine updates causing changes in font rendering, fractional values, and non-deterministic load order causing sub-pixel differences in element positioning, that kind of thing.

For these sorts of situations, simply blurring your snapshots with a radius of 1 or 2 pixels can be enough to drastically cut back on these false positives, or even eliminate them entirely. Depending on the tooling you're using, you might actually have to blur your screenshots manually (using ImageMagick for example), or there might be some configuration option so that this is done automatically before calculating the diffs. Take a quick look at the documentation of whichever visreg testing tool you're using to see what's available.

Disable animations

I don't think I've ever seen a case where animations were important in a visual regression test. Instead, they usually cause flakiness because a snapshot might be taken too early/too late while some animation is playing. If this is your situation, I recommend just disabling them entirely.

Again, this depends on your project and tooling. If you're using Capybara, you can do:

# in rails_helper.rb
Capybara.disable_animations = true

And if whatever testing framework you're using doesn't have this feature built-in, you can always hack some CSS together and inject it during tests. Maybe something like:

*, *::before, *::after {
  -moz-animation: none !important;
  -moz-transition: none !important;
  animation: none !important;
  transition: none !important;
}

This encompasses quite a few different situations, but all of them cause the same problem in the end: flaky tests because execution timing isn't guaranteed to be exactly the same on every run. For example:

UI elements are time-dependent

This one usually shows up in the form of DOM elements that are dependent on the date/time. For example, an item showing created X minutes ago in the UI, where X is dependent on the time taken between the initial database seeding process, and the visual regression snapshot being taken.

This scenario is usually fixed by mocking the current time (and/or the passage of time). If you're using Ruby, for example, you can use Timecop.

Snapshots being taken with inconsistent timings

Modern web applications can be fairly complex. After the initial page load, a decent chunk of content might not show up in the DOM until some client-side JavaScript has been executed. Or a few network calls have been made. Or some CPU cycles have been spent crunching data returned by an API.

Many testing frameworks try to account for that behavior transparently: if using Capybara or Cypress for example, a test that asserts that a specific string exists in the DOM will not fail outright, even if the string doesn't exist. Instead, the assertion will be retried until either the string shows up, or some time limit is reached.

This behavior usually prevents flakiness but can have the opposite effect when we forget about it. Here's a concrete example to illustrate: suppose you want to write a test to ensure that an important dialog modal is rendering correctly. To accomplish this, you assert that the modal's title is visible on the page, and rely on the screenshot that gets automatically taken at the end of the test to make sure that everything looks the way it's supposed to.

This works fine until some piece of dynamic content gets added to that modal: perhaps a drop-down now requires an API call to return some data before being populated, or a user table might have a new avatar column where the images load with unpredictable timing. Since you're only waiting on the title text, some of that inner modal content may or may not have loaded at that point, and now that test is flaky.

In these sorts of situations, a reliable workaround is to wait on whatever element is causing the flakiness. For example:

If the root cause is an API call and you're using Cypress, you can mock the call and use cy.wait() to wait on it before proceeding.
If using Capybara and the root cause is an image asset, simply asserting that the asset is visible will make the test wait until the image has shown up in the DOM before proceeding.

Allow retries (sparingly!)

Last, but not least: sometimes, it makes sense to just give up and throw in the towel. However! That can still be done in a careful, sparing way that's not going to harm your test suite in the long term.

Suppose there's an intermittent test failure that's caused by a bug that happens exclusively due to the way the test environment is set up. Perhaps a stateful WebSocket server isn't playing nice with the inter-test cleanup mechanism in your framework. If you've gotten to this point, there's a decent chance that spending any extra time on it is not a wise investment, and the usual thing to do is to disable the test and move on.

Instead, consider allowing retries but only for that specific test. rspec-retry, although abandoned, is a great example of how this can be done: create a test-level attribute or argument specifying how many times it can be retried, and adjust accordingly:

it 'should randomly succeed', retry: 3 do
  expect(rand(2)).to eq(1)
end

It's important to resist the urge to add retries across your entire test suite. Doing so drastically reduces its trustworthiness because now whenever a test passes, you can't know if that feature is working perfectly, or if it only works every now and again and that fact is being covered up. Used sparingly, however, retries are a great way to avoid spending too much engineering time on inconsequential failures without being forced to remove the problematic tests entirely.

So there you have it - several techniques for improving the reliability of your visual regression tests. I hope that the tips above can help you on your way towards creating a more stable and effective test suite for your applications.

Closing Remark

Could your team use some help with topics like this and others covered by ShakaCode's blog and open source? We specialize in optimizing Rails applications, especially those with advanced JavaScript frontends, like React. We can also help you optimize your CI processes with lower costs and faster, more reliable tests. Scraping web data and lowering infrastructure costs are two other areas of specialization. Feel free to reach out to ShakaCode's CEO, Justin Gordon, at [email protected] or schedule an appointment to discuss how ShakaCode can help your project!