Why code review beats testing: evidence from decades of programming research

Posted on October 3, 2011

tl;dr If you want to ship high quality code, you should invest in more than one of formal code review, design inspection, testing, and quality assurance. Testing catches fewer bugs per hour than human inspection of code, but it might be catching different types of bugs.

Everyone wants to find bugs in their programs. But which methods are the most effective for finding bugs? I found this remarkable chart in chapter 20 of Steve McConnell’s Code Complete. Summarized here, the chart shows the typical percentage of bugs found using each bug detection technique. The range shows the high and low percentage across software organizations, with the center dot representing the average percentage of bugs found using that technique.

As McConnell notes, “The most interesting facts … are that the modal rates don’t rise above 75% for any one technique, and that the techniques average about 40 percent.” Especially noteworthy is the poor performance of testing compared to formal design review (human inspection). There are three pages of fascinating discussion that follow; I’ll try to summarize.

What does this mean for me?

No one approach to bug detection is adequate. Capers Jones – the researcher behind two of the papers McConnell cites – divides bug detection into four categories: formal design inspection, formal code inspection, formal quality assurance, and formal testing. The best bug detection rate, if you are only using one of the above four categories, is 68%. The average detection rate if you are using all four is 99%.

But a less-effective method may be worth it, if it’s cheap enough

It’s well known that bugs found early on in program development are cheaper to fix; costs increase as you have to push changes to more users, someone else has to dig through code that they didn’t write to find the bug, etc. So while a high-volume beta test is highly effective at finding bugs, it may be more expensive to implement this (and you may develop a reputation for releasing highly buggy software).

Shull et al (2002) estimate that non-severe defects take approximately 14 hours of debugging effort after release, but only 7.4 hours before release, meaning that non-critical bugs are twice as expensive to fix after release, on average. However, the multiplier becomes much, much larger for severe bugs: according to several estimates severe bugs are 100 times more expensive to fix after shipping than they are to fix before shipping.

More generally, inspections are a cheaper method of finding bugs than testing; according to Basili and Selby (1987), code reading detected 80 percent more faults per hour than testing, even when testing programmers on code that contained zero comments. This went against the intuition of the professional programmers, which was that structural testing would be the most efficient method.

How did the researchers measure efficiency?

In each case the efficiency was calculated by taking the number of bug reports found through a specific bug detection technique, and then dividing by the total number of reported bugs.

The researchers conducted a number of different experiments to try and measure numbers accurately. Here are some examples:

Giving programmers a program with 15 known bugs, telling them to use a variety of techniques to find bugs, and observing how many they find (no one found more than 9, and the average was 5)
Giving programmers a specification, measuring how long it took them to write the code, and how many bugs existed in the code base
Formal inspections of processes at companies that produce millions of lines of code, like NASA.

In our company we have low bug rates, thanks to our adherence to software philosophy X.

Your favorite software philosophy probably advocates using several different methods listed above to help detect bugs. Pivotal Labs is known for pair programming, which some people rave about, and some people hate. Pair programming means that with little effort they’re getting informal code review, informal design review and personal desk-checking of code. Combine that with any kind of testing and they are going to catch a lot of bugs.

Before any code at Google gets checked in, one owner of the code base must review and approve the change (formal code review and design inspection). Google also enforces unit tests, as well as a suite of automated tests, fuzz tests, and end to end tests. In addition, everything gets dogfooded internally before a release to the public (high-volume beta test).

It’s likely that any company with a reputation for shipping high-quality code will have systems in place to test at least three of the categories mentioned above for software quality.

Conclusions

If you want to ship high quality code, you should invest in more than one of formal code review, design inspection, testing, and quality assurance. Testing catches fewer bugs per hour than human inspection of code, but it might be catching different types of bugs. You should definitely try to measure where you are finding bugs in your code and the percentage of bugs you are catching before release – 85% is poor, and 99% is exceptional.

Appendix

The chart was created using the jQuery plotting library flot. Here’s the raw Javascript, and the CoffeeScript that generated the graph.

References

Steve McConnell’s Code Complete, which I’d recommend for anyone who’s interested in improving the quality of the code they write.
Capers Jones, “Software defect-removal efficiency”, published in Computer, volume 29, issue 4, unfortunately the paper is gated.
Forrest Shull et al, “What We Have Learned About Fighting Software Defects,” 2002
Victor Basili and Richard Selby, “Comparing the Effectiveness of Software Testing Strategies,” IEEE Transactions on Software Engineering, volume 13, issue 12.

Liked what you read? I am available for hire.

42 thoughts on “Why code review beats testing: evidence from decades of programming research”

Michael Favia October 3, 2011 at 6:02 pm

Thanks for the great article quite informative. As if to prove your point, you have a “bug” in your article i found after a quick code read. I assume that this:

It’s well known that it’s more expensive to find bugs early on in program development;

is meant to be:

It’s well known that it’s less expensive to find bugs early on in program development;

Thanks again otherwise.

Reply ↓
1. kevin Post authorOctober 3, 2011 at 6:54 pm
  
  Thanks Michael – it should be fixed now.
  
  Reply ↓
Duncan Bayne October 3, 2011 at 6:03 pm

“It’s well known that it’s more expensive to find bugs early on in program development;”

Surely you mean less expensive? Or more efficient?

Reply ↓
1. kevin Post authorOctober 3, 2011 at 6:53 pm
  
  Thanks – it should be fixed now.
  
  Reply ↓
Glenn Smith October 3, 2011 at 9:18 pm

While I agree finding early is cheaper and better I do wonder about the costs of inspection, particulary doing them well; developers in my experience often think they are above doing code inspections and the test organisation should find the issues. This is partly cultural and linked to having separate dev and test.

The concerns on costs is that a lot of studies that proove the benefits are from the 80s or 90s. Today CPU time is much cheaper therefore I’d like to see data that still proves human inspection is better than mass automated regression and UT.

Also these studies don’t always clarify what they mean by inspection. Are they talking post coded Fagan style with four people or are they referring to peer programming where it’s inspected on the fly?

Don’t get me wrong, I’m not trying to disagree, but would love to see a study that answered these queries.

Reply ↓
Steve Phillips October 3, 2011 at 10:51 pm

These statistics sound compelling, but you assume that today’s formal/automated testing (and code reviews, for that matter) is just as effective as it was in the ’80s. Today’s testing frameworks are no better than ~25 years ago, really? I don’t know one way or the other, but is that the case?

Reply ↓
David Thielen October 3, 2011 at 11:05 pm

I’m not sure it works the same now as before, for two reasons. First, all the studies of efforts in the past were of projects where the program was written over 9 months – 3 years and then released all at once. That is very different from today where new features dribble out weekly.

Second, a lot of the controlled tests use a small self-contained program to review/test/etc. But very few programs match that model of a complete program that you can easily fully picture at once.

We have exhaustive unit tests. And informal design reviews as/when it makes sense. But we stopped doing formal code reviews because we weren’t finding anything from them.

What has been of substantial help is we release new features as soon as they are complete and tested. We release that as a beta and it is immediately used by the customers who find that new feature valuable. This provides a limited beta, but one where the beta testers are fully using the new functionality.

We also resolve issues found in the beta fast. We do a new public build about every third Friday. The rule is any Friday where there is even a single sev 1 or 2 bug fixed in the latest code. But even if there are none, about every 3 – 4 weeks we have a new feature and so need a new build to ship that.

What works for us works because of the developers we have (they’re very good), the type of app we have (the parts are very seperable), and releasing each feature as it is complete. It won’t work for all.

But I do think one should be wary of using studies that don’t match one’s own environment.

Reply ↓
Pingback: Why code review beats testing: evidence from decades of programming research « that dismal science
AlexForbes October 4, 2011 at 4:44 am

Interesting article. Those reading may be interested in also taking a look at the largest code review study ever done, with Cisco using CodeCollaborator, within the free eBook, http://smartbear.com/best-kept-secrets-of-peer-code-review/

Also, an interesting assessment on pair programming vs. other methods of code review may be found here:
http://www.softwarequalityconnection.com/2011/04/does-pair-programming-obviate-the-need-for-code-review/
and here: http://smartbear.com/resources/cc/Episode_3_ProsAndConsOfFourKindsOfCodeReview.pdf

Good luck with all your software quality practices.

Alex

Reply ↓
John October 4, 2011 at 5:03 am

Some sources say that most bugs come from missing code, code that should have been written but wasn’t rather than code that was written incorrectly. And such omissions would be easier to find in a review. They’d be almost impossible to find via unit testing because if you didn’t think to write the code, you probably didn’t think to write a unit test for it either.

Reply ↓
Winfield October 4, 2011 at 6:04 am

That’s a nice infographic but I don’t think it has any bearing on reality.

Automated testing is the only way to verify software behaves the way you expect it, by codifying those expectations in tests. This is the way you will find 98% of the bugs in your application.

Code review and pair programming are great ways to prevent mistakes and find different kinds of bugs, where the design can be better, or something conceptually misses the mark. It’s an inherently different kind of testing that finds different kinds of bugs.

Reply ↓
1. Sumana Harihareswara October 4, 2011 at 3:10 pm
  
  This is the way you will find 98% of the bugs in your application.
  
  Citation needed. :-)
  
  Reply ↓
Martin October 4, 2011 at 8:59 am

Interesting.

The great results you mention are obtained by following a “formal code inspection” process. The “informal code review process” that–in my experience–most organizations follow does not produce such good results and is in fact worse than unit testing.

On quick reading, the Google process seems more informal than formal (contrast with section 21.3 of Code Complete, for example).

It has been reported that the most effective code inspections are those where different reviewers are assigned responsibility for reviewing different aspects of the code, such as security, concurrency, algorithms, etc.

Reply ↓
1. kevin Post authorOctober 4, 2011 at 9:59 am
  
  Interesting – do you have data for the “different aspects” approach?
  
  Reply ↓
Martin October 4, 2011 at 11:17 am

Can’t put my finger on the reference right at the moment.

The underlying concept is “perspective based reading”, which is shown to be more effective than, e.g., checklists for inspection of requirements, designs, etc. (For example: http://www.tol.oulu.fi/projects/tarjous/perspective.pdf)

This was extended to “technical aspects” as perspectives instead of “roles”.

Reply ↓
DAve Kirby October 4, 2011 at 2:13 pm

This misses the point that some testing practices such as TDD are for preventing bugs in the first place, not for detecting them after the fact. If you write tests first and then write the code to make the tests pass (perhaps running the tests many times while you write the code) you will find that the tests detect very few bugs for the simple reason that the bugs would have been detected so early in the development cycle that they would not even register as bugs.

I believe that the graph and metrics quoted in the article come from research done in the 1980s when the waterfall process dominated, so may not apply so well to modern agile practices.

Reply ↓
Andrew Pennebaker October 4, 2011 at 3:34 pm

Which category does QuickCheck fall in?

http://www.haskell.org/haskellwiki/Introduction_to_QuickCheck

Reply ↓
Johan Martinsson October 5, 2011 at 4:13 am

I can’t believe this post doesn’t even mention TDD. I’m not going to argue about it’s efficiency but not mentioning it??

Edsker Dijkstra made a famous statement that we now associate with lean development methods
“If you want more effective programmers, you will discover that they should not waste their time debugging, they should not introduce the bugs to start with.”

Reply ↓
1. Bernhard M. Wiedemann October 6, 2011 at 6:04 am
  
  Dijkstra also wrote
  “Program testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence.”
  (likely because there is a nearly infinite number of possible inputs and code-paths)
  
  yet, automated testing is definitely useful.
  
  Reply ↓
Rakesh Waghela October 5, 2011 at 8:50 am

Did you ever played holi ? your photo says so ;)

Reply ↓
Kevlin Henney October 6, 2011 at 4:31 am

It’s important to have more than technique focused on checking you’ve done the right thing in the right way. Like binocular vision, you get a better sense of the whole and see more than you might otherwise have done from a single point of view.

However, it is worth understanding what the quoted studies from the past actually report. The one thing they do not reveal is that testing is less effective than other means. What they reveal is that the sooner you find something, the cheaper it is.

So why do they seem to suggest testing is less effective? These results suffer from what we can call lifecycle bias: they are invariably measurements of a sequential process model where testing activities are placed at the backend of the development lifecycle, i.e., some distance in time from when code is written, and where requirements framing is strictly separated from implementation and other activities.

So first off, most of what is being measured is the effect of separation in time, and not the specific effectiveness of a particular activity. It turns out that when you reduce and remove theses separations, such as in an iterative and incremental lifecycle, the effectiveness of all activities improves.

We see this in the case of code reviews, which, in most studies done on effectiveness of code reviews, are collocated in time with coding activities in the implementation phase.

Even more interesting is when activities such as unit testing are done along with the code, rather than some time after. The measurements in these cases do not show an increase in the number of bugs found: they show that bugs are less likely to be introduced in the first place. The goal is not to increase the number of bugs you find; the goal is to adopt practices that reduce the possibility of bugs.

So while multiple practices, code reviews included, are important, it is not quite for the reasons outlined in the article.

Reply ↓
Karlo Smid October 6, 2011 at 11:59 pm

Hi!

I do not agree about your statement.
Here are my facts.

Code review is code checking. My question is “Why do you find so many issues in the first place while doing code review”? Who is writing that code? If you find more issues during code review than in testing phase your developers does not care about code quality, or do not now much about code quality. Code reviews are OK if you are mentor and need to train new employee.

Also metrics in testing business are very dangerous.
Statement:
“You should definitely try to measure where you are finding bugs in your code and the percentage of bugs you are catching before release – 85% is poor, and 99% is exceptional.”

Basic mathematics is:
99% multiply X. My question is: What is X?

Regards, Karlo.

Reply ↓
1. Karlo Smid October 10, 2011 at 3:24 am
  
  Reply to myself:
  Why is code review so important.
  In most organisations there is no requirements configuration management of “hidden requirements”. In order to reveal hidden requirements (every team member has a piece of those) we need to have code review sessions in order to bring those hidden requirements out to the rest of the team.
  There are also “common sense” requirements. Experience developer introduce them to the junior developer during those sessions.
  
  Reply ↓
Pingback: Les liens du vendredi » Team Fusion
Pingback: WL-6 « evetro
Pingback: michaelgalloy.com — Combining testing methods
Pingback: Friday Link Party – 21st Oct 2011 | Grinding Gears
Pingback: Friday Linkparty - Fog Creek Blog
Pingback: .NET i jiné ... : Odkazy z prohlížeče – 2.2.2012
Pingback: La qualité » Team Fusion
Pingback: xion.log » Against Unit Tests
Pingback: Code review » Team Fusion
Pingback: efficiency of code reviews | åkesson's blog
Alex Forbes April 11, 2012 at 1:45 pm

Why code review?
This may help explain.

http://www.softwarequalityconnection.com/2012/03/4-reasons-developers-resist-code-review-and-why-they-shouldnt/

Reply ↓
Pingback: La qualité | Deuteron
Pingback: The Best Method of Software Testing | Technology-Enabled Business Solutions
Pingback: Year Of Security for Java – Week 26 – Do Code Reviews : John Melton's Weblog
Pingback: technique @PMSIpilot » Code review
Pingback: technique @PMSIpilot » La qualité
Sushilkumar N Gehi June 27, 2015 at 11:45 am

With so much compelling literature on code review & inspection.

Wondering if we have data points or metrics on Amount of time/resources/dollars spent on code review /inspection.vis a vis testing.

Just that the projects where I have worked time/resources/dollars spent on testing activities is atleast 5 times more than time/resources/dollars spent on code review /inspection.

Reply ↓
Pingback: Pair Programming - Why ONTRAPORT Uses It | The ONTRAPORT Blog
Pingback: If You Automate Your Tests, Automate Your Code Review - DaedTech

KEVIN BURKE