Why code review beats testing: evidence from decades of programming research

tl;dr If you want to ship high quality code, you should invest in more than one of formal code review, design inspection, testing, and quality assurance. Testing catches fewer bugs per hour than human inspection of code, but it might be catching different types of bugs.

Everyone wants to find bugs in their programs. But which methods are the most effective for finding bugs? I found this remarkable chart in chapter 20 of Steve McConnell’s Code Complete. Summarized here, the chart shows the typical percentage of bugs found using each bug detection technique. The range shows the high and low percentage across software organizations, with the center dot representing the average percentage of bugs found using that technique.

As McConnell notes, “The most interesting facts … are that the modal rates don’t rise above 75% for any one technique, and that the techniques average about 40 percent.” Especially noteworthy is the poor performance of testing compared to formal design review (human inspection). There are three pages of fascinating discussion that follow; I’ll try to summarize.

What does this mean for me?

No one approach to bug detection is adequate. Capers Jones – the researcher behind two of the papers McConnell cites – divides bug detection into four categories: formal design inspection, formal code inspection, formal quality assurance, and formal testing. The best bug detection rate, if you are only using one of the above four categories, is 68%. The average detection rate if you are using all four is 99%.

But a less-effective method may be worth it, if it’s cheap enough

It’s well known that bugs found early on in program development are cheaper to fix; costs increase as you have to push changes to more users, someone else has to dig through code that they didn’t write to find the bug, etc. So while a high-volume beta test is highly effective at finding bugs, it may be more expensive to implement this (and you may develop a reputation for releasing highly buggy software).

Shull et al (2002) estimate that non-severe defects take approximately 14 hours of debugging effort after release, but only 7.4 hours before release, meaning that non-critical bugs are twice as expensive to fix after release, on average. However, the multiplier becomes much, much larger for severe bugs: according to several estimates severe bugs are 100 times more expensive to fix after shipping than they are to fix before shipping.

More generally, inspections are a cheaper method of finding bugs than testing; according to Basili and Selby (1987), code reading detected 80 percent more faults per hour than testing, even when testing programmers on code that contained zero comments. This went against the intuition of the professional programmers, which was that structural testing would be the most efficient method.

How did the researchers measure efficiency?

In each case the efficiency was calculated by taking the number of bug reports found through a specific bug detection technique, and then dividing by the total number of reported bugs.

The researchers conducted a number of different experiments to try and measure numbers accurately. Here are some examples:

  • Giving programmers a program with 15 known bugs, telling them to use a variety of techniques to find bugs, and observing how many they find (no one found more than 9, and the average was 5)
  • Giving programmers a specification, measuring how long it took them to write the code, and how many bugs existed in the code base
  • Formal inspections of processes at companies that produce millions of lines of code, like NASA.
In our company we have low bug rates, thanks to our adherence to software philosophy X.

Your favorite software philosophy probably advocates using several different methods listed above to help detect bugs. Pivotal Labs is known for pair programming, which some people rave about, and some people hate. Pair programming means that with little effort they’re getting informal code review, informal design review and personal desk-checking of code. Combine that with any kind of testing and they are going to catch a lot of bugs.

Before any code at Google gets checked in, one owner of the code base must review and approve the change (formal code review and design inspection). Google also enforces unit tests, as well as a suite of automated tests, fuzz tests, and end to end tests. In addition, everything gets dogfooded internally before a release to the public (high-volume beta test).

It’s likely that any company with a reputation for shipping high-quality code will have systems in place to test at least three of the categories mentioned above for software quality.

Conclusions

If you want to ship high quality code, you should invest in more than one of formal code review, design inspection, testing, and quality assurance. Testing catches fewer bugs per hour than human inspection of code, but it might be catching different types of bugs. You should definitely try to measure where you are finding bugs in your code and the percentage of bugs you are catching before release – 85% is poor, and 99% is exceptional.

Appendix

The chart was created using the jQuery plotting library flot. Here’s the raw Javascript, and the CoffeeScript that generated the graph.

References

39 thoughts on “Why code review beats testing: evidence from decades of programming research

  1. Michael Favia

    Thanks for the great article quite informative. As if to prove your point, you have a “bug” in your article i found after a quick code read. I assume that this:

    It’s well known that it’s more expensive to find bugs early on in program development;

    is meant to be:

    It’s well known that it’s less expensive to find bugs early on in program development;

    Thanks again otherwise.

    Reply
  2. Glenn Smith

    While I agree finding early is cheaper and better I do wonder about the costs of inspection, particulary doing them well; developers in my experience often think they are above doing code inspections and the test organisation should find the issues. This is partly cultural and linked to having separate dev and test.

    The concerns on costs is that a lot of studies that proove the benefits are from the 80s or 90s. Today CPU time is much cheaper therefore I’d like to see data that still proves human inspection is better than mass automated regression and UT.

    Also these studies don’t always clarify what they mean by inspection. Are they talking post coded Fagan style with four people or are they referring to peer programming where it’s inspected on the fly?

    Don’t get me wrong, I’m not trying to disagree, but would love to see a study that answered these queries.

    Reply
  3. Steve Phillips

    These statistics sound compelling, but you assume that today’s formal/automated testing (and code reviews, for that matter) is just as effective as it was in the ’80s. Today’s testing frameworks are no better than ~25 years ago, really? I don’t know one way or the other, but is that the case?

    Reply
  4. David Thielen

    I’m not sure it works the same now as before, for two reasons. First, all the studies of efforts in the past were of projects where the program was written over 9 months – 3 years and then released all at once. That is very different from today where new features dribble out weekly.

    Second, a lot of the controlled tests use a small self-contained program to review/test/etc. But very few programs match that model of a complete program that you can easily fully picture at once.

    We have exhaustive unit tests. And informal design reviews as/when it makes sense. But we stopped doing formal code reviews because we weren’t finding anything from them.

    What has been of substantial help is we release new features as soon as they are complete and tested. We release that as a beta and it is immediately used by the customers who find that new feature valuable. This provides a limited beta, but one where the beta testers are fully using the new functionality.

    We also resolve issues found in the beta fast. We do a new public build about every third Friday. The rule is any Friday where there is even a single sev 1 or 2 bug fixed in the latest code. But even if there are none, about every 3 – 4 weeks we have a new feature and so need a new build to ship that.

    What works for us works because of the developers we have (they’re very good), the type of app we have (the parts are very seperable), and releasing each feature as it is complete. It won’t work for all.

    But I do think one should be wary of using studies that don’t match one’s own environment.

    Reply
  5. Pingback: Why code review beats testing: evidence from decades of programming research « that dismal science

  6. AlexForbes

    Interesting article. Those reading may be interested in also taking a look at the largest code review study ever done, with Cisco using CodeCollaborator, within the free eBook, http://smartbear.com/best-kept-secrets-of-peer-code-review/

    Also, an interesting assessment on pair programming vs. other methods of code review may be found here:
    http://www.softwarequalityconnection.com/2011/04/does-pair-programming-obviate-the-need-for-code-review/
    and here: http://smartbear.com/resources/cc/Episode_3_ProsAndConsOfFourKindsOfCodeReview.pdf

    Good luck with all your software quality practices.

    Alex

    Reply
  7. John

    Some sources say that most bugs come from missing code, code that should have been written but wasn’t rather than code that was written incorrectly. And such omissions would be easier to find in a review. They’d be almost impossible to find via unit testing because if you didn’t think to write the code, you probably didn’t think to write a unit test for it either.

    Reply
  8. Winfield

    That’s a nice infographic but I don’t think it has any bearing on reality.

    Automated testing is the only way to verify software behaves the way you expect it, by codifying those expectations in tests. This is the way you will find 98% of the bugs in your application.

    Code review and pair programming are great ways to prevent mistakes and find different kinds of bugs, where the design can be better, or something conceptually misses the mark. It’s an inherently different kind of testing that finds different kinds of bugs.

    Reply
  9. Martin

    Interesting.

    The great results you mention are obtained by following a “formal code inspection” process. The “informal code review process” that–in my experience–most organizations follow does not produce such good results and is in fact worse than unit testing.

    On quick reading, the Google process seems more informal than formal (contrast with section 21.3 of Code Complete, for example).

    It has been reported that the most effective code inspections are those where different reviewers are assigned responsibility for reviewing different aspects of the code, such as security, concurrency, algorithms, etc.

    Reply
  10. DAve Kirby

    This misses the point that some testing practices such as TDD are for preventing bugs in the first place, not for detecting them after the fact. If you write tests first and then write the code to make the tests pass (perhaps running the tests many times while you write the code) you will find that the tests detect very few bugs for the simple reason that the bugs would have been detected so early in the development cycle that they would not even register as bugs.

    I believe that the graph and metrics quoted in the article come from research done in the 1980s when the waterfall process dominated, so may not apply so well to modern agile practices.

    Reply
  11. Johan Martinsson

    I can’t believe this post doesn’t even mention TDD. I’m not going to argue about it’s efficiency but not mentioning it??

    Edsker Dijkstra made a famous statement that we now associate with lean development methods
    “If you want more effective programmers, you will discover that they should not waste their time debugging, they should not introduce the bugs to start with.”

    Reply
    1. Bernhard M. Wiedemann

      Dijkstra also wrote
      “Program testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence.”
      (likely because there is a nearly infinite number of possible inputs and code-paths)

      yet, automated testing is definitely useful.

      Reply
  12. Kevlin Henney

    It’s important to have more than technique focused on checking you’ve done the right thing in the right way. Like binocular vision, you get a better sense of the whole and see more than you might otherwise have done from a single point of view.

    However, it is worth understanding what the quoted studies from the past actually report. The one thing they do not reveal is that testing is less effective than other means. What they reveal is that the sooner you find something, the cheaper it is.

    So why do they seem to suggest testing is less effective? These results suffer from what we can call lifecycle bias: they are invariably measurements of a sequential process model where testing activities are placed at the backend of the development lifecycle, i.e., some distance in time from when code is written, and where requirements framing is strictly separated from implementation and other activities.

    So first off, most of what is being measured is the effect of separation in time, and not the specific effectiveness of a particular activity. It turns out that when you reduce and remove theses separations, such as in an iterative and incremental lifecycle, the effectiveness of all activities improves.

    We see this in the case of code reviews, which, in most studies done on effectiveness of code reviews, are collocated in time with coding activities in the implementation phase.

    Even more interesting is when activities such as unit testing are done along with the code, rather than some time after. The measurements in these cases do not show an increase in the number of bugs found: they show that bugs are less likely to be introduced in the first place. The goal is not to increase the number of bugs you find; the goal is to adopt practices that reduce the possibility of bugs.

    So while multiple practices, code reviews included, are important, it is not quite for the reasons outlined in the article.

    Reply
  13. Karlo Smid

    Hi!

    I do not agree about your statement.
    Here are my facts.

    Code review is code checking. My question is “Why do you find so many issues in the first place while doing code review”? Who is writing that code? If you find more issues during code review than in testing phase your developers does not care about code quality, or do not now much about code quality. Code reviews are OK if you are mentor and need to train new employee.

    Also metrics in testing business are very dangerous.
    Statement:
    “You should definitely try to measure where you are finding bugs in your code and the percentage of bugs you are catching before release – 85% is poor, and 99% is exceptional.”

    Basic mathematics is:
    99% multiply X. My question is: What is X?

    Regards, Karlo.

    Reply
    1. Karlo Smid

      Reply to myself:
      Why is code review so important.
      In most organisations there is no requirements configuration management of “hidden requirements”. In order to reveal hidden requirements (every team member has a piece of those) we need to have code review sessions in order to bring those hidden requirements out to the rest of the team.
      There are also “common sense” requirements. Experience developer introduce them to the junior developer during those sessions.

      Reply
  14. Pingback: Les liens du vendredi » Team Fusion

  15. Pingback: WL-6 « evetro

  16. Pingback: michaelgalloy.com — Combining testing methods

  17. Pingback: Friday Link Party – 21st Oct 2011 | Grinding Gears

  18. Pingback: Friday Linkparty - Fog Creek Blog

  19. Pingback: .NET i jiné ... : Odkazy z prohlížeče – 2.2.2012

  20. Pingback: La qualité » Team Fusion

  21. Pingback: xion.log » Against Unit Tests

  22. Pingback: Code review » Team Fusion

  23. Pingback: efficiency of code reviews | åkesson's blog

  24. Pingback: La qualité | Deuteron

  25. Pingback: The Best Method of Software Testing | Technology-Enabled Business Solutions

  26. Pingback: Year Of Security for Java – Week 26 – Do Code Reviews : John Melton's Weblog

  27. Pingback: technique @PMSIpilot » Code review

  28. Pingback: technique @PMSIpilot » La qualité

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>