Measuring the efficiency of test automation in software projects

Adrien Van Thong
5 min readJun 29, 2017


Test automation is accepted as a best practice and seen as a must in nearly all software engineering projects. After all, automated tests can run continuously 24/7/365 without any breaks and are the best way to quickly catch regressions in the code base almost as soon as they are introduced —unfortunately, due to time constraints not every test case can be automated, so we must choose wisely which ones to automate.

Once many test cases have been automated, it’s easy to fall into the trap of “set it and forget it” and move on to automating the next test case without looking back. Unfortunately, some of the tests that were automated will turn out to be “bad” tests. How do we find out which automated tests are “bad”, and what can we do about them?

What is a “bad” test?

Whenever an automated test fails, it needs to be triaged. An engineer’s time is needed to dig in to the test failure to make a determination why the test failed. For an automated test, it could be for any of the following reasons:

  • The software product has a bug and is doing the wrong thing.
  • The test code has a logic error in it and is doing the wrong thing.
  • The product behaviour evolved, but the test code was not updated to reflect it.

In both the later cases, these failures are false positives. The product was doing the right thing, but the automated test was not, and in both cases the engineer will then need to invest more time in fixing the test. This effort represents the upkeep overhead in maintaining each automated test suite to ensure they are testing the right thing and generating accurate results. automated test cases with high upkeep overhead have a low reliability.

If left unchecked, the amount of automated tests generating false positives will accumulate and end up demanding more and more of the team’s attention in triage. In the worse possible case, a team could be spending their entire day triaging false positives, instead of automating new tests or triaging actual product failures.

An ideal automation test base has a minimal upkeep overhead and minimal low reliability tests. The best automated tests are the ones that need to be changed at little as possible once they’ve been initially written.

How to identify low reliability automated tests?

Depending on the size of the team and number of automated tests, the test/automation engineers triaging the failures may have a good gut feeling for what the low reliability test suites are, since they are frequently fixing the same suites. As the number of tests scales up, it can be easier for low reliability test suites to fly undetected from the “gut feeling” radar.

Since our quality team has hundreds of automated test suites and thousands of automated tests, we needed a good system to identify which suites were problematic. The solution we came up with was to have a custom field in our bug tracker (we use JIRA) representing the corresponding test suite name for each automation bug. Since we already had a project in JIRA which we used to track all bugs in the automation test base, the new field allowed us to craft queries to identify which test suites had the most automation bugs attached to them.

This also allowed us to build tools to use the JIRA APIs and JQL to map out automation test quality over time, identify which test suites were trending up, down, etc. We could also use this same method to determine which automation tests uncovered the most product bugs.

Gleaming information from this data

Automated tests with high numbers of automation bugs associated with them represent the low reliability tests with high upkeep overhead. Since the entire point of automating tests is to save the time it would take to run them manually, once an automated test crosses the threshold where it takes considerably more effort to maintain it than it would to run it manually, then the automated test has outlived its usefulness.

Automated test suites with few automation bugs represent the suites with the lowest overhead and highest reliability. These are the suites that will get us our highest return on investment, so we want to invest more effort in increasing coverage on those automation suites.

Automated test suites with high numbers of automation bugs represent the suites with the highest overhead and lowest reliability. Those suites need an intervention to keep them from monopolizing our attention.

Word of warning: not all bugs are equal. One difficult bug may outweigh ten simple bugs in overhead. This should be taken into consideration when measuring overhead.

What to do with the low reliability tests?

The answer could depend on what is causing elevated automation bugs in these test suites.

It may simply be that the quality of the underlying test framework is very poor and the problem can be addressed with some refactoring.

It could also be that the feature being tested changes very frequently, in which case, perhaps it is better to hold off on automating that feature until the product code base stabilizes.

Another common reason is that the feature is extremely difficult to automate properly. This is a tricky category because it doesn’t necessarily mean the automated test is low-value. If the same test coverage is extremely difficult or time-consuming to run manually without automation, the extra overhead of maintaining the automated test may be worth it.

If a low reliability automated test case with low value cannot be salvaged, it should be removed from the automation framework and run manually instead.

Lessons learned

Since the product we test has an end-user facing web-based UI, this part of the product is very frequently being updated to meet ever-evolving web usability standards and popular design patterns. While excellent automation libraries exist out there to help make UI testing easy (e.g. Selenium), our team discovered the hard way that investing in automating the testing of this UI was simply not worth the effort. Since we were constantly up against a moving target, we were spending a significant amount of time trying to keep up with the changes, and to make matters worse these test cases were extremely quick and easy to run manually.

By comparison, the automated test cases exercising the product’s public REST APIs have been found to be our highest value automated test cases. Since public APIs should always be backwards compatible, we have to worry less about a moving target. These tests tend to be difficult to run manually but easy to automate, and once they’ve been automated they require very little overhead.

In conclusion, eliminating low-value automated tests and identifying test frameworks which need refactoring while emphasizing the high-value high-reliability suites can go a long way to building an efficient automated test framework. It all starts with having the right data.