Why code review beats testing: evidence from decades of programming research

tl;dr If you want to ship high quality code, you should invest in more than one of formal code review, design inspection, testing, and quality assurance. Testing catches fewer bugs per hour than human inspection of code, but it might be catching different types of bugs.

Everyone wants to find bugs in their programs. But which methods are the most effective for finding bugs? I found this remarkable chart in chapter 20 of Steve McConnell’s Code Complete. Summarized here, the chart shows the typical percentage of bugs found using each bug detection technique. The range shows the high and low percentage across software organizations, with the center dot representing the average percentage of bugs found using that technique.

As McConnell notes, “The most interesting facts … are that the modal rates don’t rise above 75% for any one technique, and that the techniques average about 40 percent.” Especially noteworthy is the poor performance of testing compared to formal design review (human inspection). There are three pages of fascinating discussion that follow; I’ll try to summarize.

What does this mean for me?

No one approach to bug detection is adequate. Capers Jones – the researcher behind two of the papers McConnell cites – divides bug detection into four categories: formal design inspection, formal code inspection, formal quality assurance, and formal testing. The best bug detection rate, if you are only using one of the above four categories, is 68%. The average detection rate if you are using all four is 99%.

But a less-effective method may be worth it, if it’s cheap enough

It’s well known that bugs found early on in program development are cheaper to fix; costs increase as you have to push changes to more users, someone else has to dig through code that they didn’t write to find the bug, etc. So while a high-volume beta test is highly effective at finding bugs, it may be more expensive to implement this (and you may develop a reputation for releasing highly buggy software).

Shull et al (2002) estimate that non-severe defects take approximately 14 hours of debugging effort after release, but only 7.4 hours before release, meaning that non-critical bugs are twice as expensive to fix after release, on average. However, the multiplier becomes much, much larger for severe bugs: according to several estimates severe bugs are 100 times more expensive to fix after shipping than they are to fix before shipping.

More generally, inspections are a cheaper method of finding bugs than testing; according to Basili and Selby (1987), code reading detected 80 percent more faults per hour than testing, even when testing programmers on code that contained zero comments. This went against the intuition of the professional programmers, which was that structural testing would be the most efficient method.

How did the researchers measure efficiency?

In each case the efficiency was calculated by taking the number of bug reports found through a specific bug detection technique, and then dividing by the total number of reported bugs.

The researchers conducted a number of different experiments to try and measure numbers accurately. Here are some examples:

  • Giving programmers a program with 15 known bugs, telling them to use a variety of techniques to find bugs, and observing how many they find (no one found more than 9, and the average was 5)
  • Giving programmers a specification, measuring how long it took them to write the code, and how many bugs existed in the code base
  • Formal inspections of processes at companies that produce millions of lines of code, like NASA.
In our company we have low bug rates, thanks to our adherence to software philosophy X.

Your favorite software philosophy probably advocates using several different methods listed above to help detect bugs. Pivotal Labs is known for pair programming, which some people rave about, and some people hate. Pair programming means that with little effort they’re getting informal code review, informal design review and personal desk-checking of code. Combine that with any kind of testing and they are going to catch a lot of bugs.

Before any code at Google gets checked in, one owner of the code base must review and approve the change (formal code review and design inspection). Google also enforces unit tests, as well as a suite of automated tests, fuzz tests, and end to end tests. In addition, everything gets dogfooded internally before a release to the public (high-volume beta test).

It’s likely that any company with a reputation for shipping high-quality code will have systems in place to test at least three of the categories mentioned above for software quality.

Conclusions

If you want to ship high quality code, you should invest in more than one of formal code review, design inspection, testing, and quality assurance. Testing catches fewer bugs per hour than human inspection of code, but it might be catching different types of bugs. You should definitely try to measure where you are finding bugs in your code and the percentage of bugs you are catching before release – 85% is poor, and 99% is exceptional.

Appendix

The chart was created using the jQuery plotting library flot. Here’s the raw Javascript, and the CoffeeScript that generated the graph.

References

Chrome extension to drop jQuery into any site

Chrome extensions are sandboxed. This means that, while they can access the page, they can’t interact with other Javascript variables that are live on the page or live in other Chrome extensions. All in all this is probably a good thing, because the global namespace gets polluted easily and there are probably security reasons why you want to have this feature as well.

However, this makes debugging Chrome extensions a pain in the neck, because none of the variables that are live in your extension are live in the page itself. This means that you can’t really debug them in the console. Lots of times all I want to do is inspect some objects on the page with jQuery and test that I’m selecting the right object on the page. I have jQuery in my extension, but I can’t run jQuery on the page, because the page doesn’t load it.

So I wrote a Chrome extension that’ll drop jQuery into any page you are testing, so you can inspect things from the console. It’s pretty simple again but will save me from writing console.log statements into my Chrome content script, tabbing over to chrome://extensions clicking “Reload”, refreshing the page, and then clicking around to trigger my behavior. You want to enable it while you are debugging your app, and then disable it when it’s complete, otherwise you’ll have a performance hit from the extra external request. It’s about three lines of Javascript, but they are a good three lines.

You can grab the extension here.

How long does it take for shuffled poker chips to return to their original state?

I like to shuffle poker chips while I work. It gives me something to do while pages are loading or while I'm thinking that doesn't involve checking Twitter or Hacker News.

I got curious Shuffling poker chipshow long it took for chip stacks of various heights to return to their shuffled state. However right now I can only shuffle stacks of 18 chips (9 red, 9 blue) before they come tumbling back to the floor, out of my hand. So I wrote a short program to simulate chip shuffles and determine how long it would take to get the stacks back to their shuffled state.

Here are the results:

Number of Red Chips    Total Shuffles
1                      2
2                      4
3                      3
4                      6
5                      10
6                      12
7                      4
8                      8
9                      18
10                     6
11                     11
12                     20
13                     18
14                     28
15                     5
16                     10
17                     12
18                     36
19                     12
20                     20

That's a pretty interesting pattern. I'm sure there's a bunch of cool math behind it; I'll look it up soon and then update this post if I find any. Note that if the number of shuffles is divisible by two, the chips reverse order in the stack (ex. go from all red on the bottom to all red on the top) before returning to their original state.

Update: I found the pattern behind the poker chip shuffle sequence, on a site that lets you plug in any integer sequence and try to find the pattern that matches. You can find the pattern here. Glad I found that page, because I'm not sure I would have guessed the pattern. Here is more detail:

In other words, least m such that 2n+1 divides 2^m-1. Number of riffle shuffles of 2n+2 cards required to return a deck to initial state. A riffle shuffle replaces a list s(1), s(2), ..., s(m) by s(1), s((i/2)+1), s(2), s((i/2)+2), ... a(1) = 2 because a riffle shuffle of [1, 2, 3, 4] requires 2 iterations [1, 2, 3, 4] -> [1, 3, 2, 4] -> [1, 2, 3, 4] to restore the original order.

Here's the example code, in Haskell:

To output numbers, save that code as a file called
chips.hs
and run:
$ ghci
GHCi, version 6.10.4: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer ... linking ... done.
Loading package base ... linking ... done.
Prelude> :l chips.hs
[1 of 1] Compiling Main             ( chips.hs, interpreted )
Ok, modules loaded: Main.
*Main> zip [1..20] (map getNumberOfShuffles [1..20])

How to import a CSV file to Google Calendar

I had a lot of trouble with this recently, so hope this helps. Google is pretty strict with the CSV files it imports.

First off open Excel. In the first row of your file, add the following headers:

Subject, Start Date, Start Time, End Date, End Time, All Day Event, Reminder On/Off, Reminder Date, Reminder Time, Meeting Organizer, Description, Location, Private

Every event (every row in Excel) needs to have a Subject, Start Date, and Start Time. The other headers are optional and you can mix and match them as much as you please. Here is a complete sample I uploaded today.

Subject, Start Date, Start Time, End Date, End Time, Private, All Day Event, Location
Ramsay Shield, 7/26/2008, 7:00 PM, , 9:00 PM, FALSE, , TBC
Training, 7/27/2008, 12:00 PM, , TRUE, FALSE,

Make sure you format your cells properly

Highlight every time under Start Time, End Time, and Reminder Time, click “Format Cells,” then “Time,” then “1:30 PM.”

Also, highlight every date under Start Date, End Date, and Reminder Date, then click “Format Cells,” then “Date,” then “3/14/2001.”

You need to do both of those things for Google to recognize your file.

For Private, All Day Event, and Reminder On/Off, type True or False in the relevant cell. If you leave them blank, Google Calendar will use the default settings for these events (for privacy, it will make the event private or public depending on whether the calendar is private or public). If there isn’t any info after Start Date & Start Time, Google assumes the event is one hour long. Also, make sure that the only data on the spreadsheet is events – otherwise Google will return an error.

Once you’re done click File –> Save As, then click down at the bottom where it says “Microsoft Office Excel Notebook” for file type, and change it to CSV (Comma Separated Value) format. Then you’re ready to upload to Google!

Note: if you upload the same CSV file twice, Google will create duplicates of your events rather than replace the first set. There’s no “Remove duplicates” button either – you have to do it manually, or erase the whole calendar and start again.

Tools I use to get the job done

Every day I look for new tools that'll help me get my work done faster. Here's some of the stuff I use to get the job done.

My [only] machine is a 13' MacBook Pro. I have a 120GB solid state drive, which is the single best upgrade you can make to any machine - it makes your laptop so much faster. I also have 8GB of RAM. Don't buy RAM or an SSD from Apple, buy the parts from Newegg and then install them yourself using the guides on iFixit. With the new unibody laptops you'll need a Phillips #00 and a Torx T5 screwdriver but you can get these from your local hardware store for less than $10.

I also have a 2TB external drive with backups and my music library, and a Samsung scanner. I use AudioTechnica noise-canceling headphones to stay focused in busy environments.

I have a pretty elaborate set of dotfiles that I use to configure the terminal, vim, and other applications. I keep these in a Mercurial repository so I can get my configuration set up really quickly on a new machine.

MacVim

iTerm 2 and bash are fantastic for a terminal, and MacVim is a great GUI editor for vim - I use it to write blog posts and code. I use Adium for chat and Quicksilver to open and close applications and control iTunes. Skitch is an excellent tool for screenshots, because you can share photos with one click - I use it at least once a day. I use Colloquy for IRC (it's chat for programming people). Jumpcut helps keep track of things I've pasted to the Clipboard and TextExpander expands snippets I use often, like my email address and our home wireless password.

I'm obsessed with maximizing screen space. I keep the Dock minimized, (a) because I really never use it and (b) because it takes up about 50-100 pixels at the bottom of the screen. Divvy resizes/moves windows (to quickly make windows full screen, or left half/right half) and Witch switch between open windows (instead of applications) quickly. At home I have two ASUS 24' monitors; one connects through the DisplayPort and the other one connects with a Diamond BVU adapter.

I use Chrome for casual web browsing and Firefox + Firebug for web development. I use 1Password to generate strong passwords. I use Webfaction for web hosting; it's the best combination of price and features (full SSH, multiple websites, etc) I've been able to find. I might upgrade to Linode soon however, for faster sites and more space on the server. I use rsync to copy my 1Password Anywhere file to the Webfaction server, giving me access to my passwords from any web browser. I also use Shell in a Box to get SSH access to Webfaction from any browser.

I really wish there were better single site browser apps for Mac. Apparently Chrome and Firefox are both working on tools for this. I have single site apps for Google Voice, Calendar, and Gmail, configured with Prism. This helps me pull them up quickly, especially when I have 40+ tabs open in Chrome.

Some stuff I do that's atypical: I type using a DVORAK keyboard, to the endless annoyance of people trying to borrow my laptop. I also use Scroll Reverser to have the scroll direction mirror that of mobile devices; you move your fingers up to scroll down (apparently this is the default in Mac OS Lion).

I have bank accounts with ING Direct and Ally Bank. In my investment account on Sharebuilder I'm invested 50% in VT (a world market fund), 30% in PCRDX (a commodity futures index fund), and 20% in RWO (a global real estate fund).

Test if your server is up, using Ruby

How do you know if your website is up, right now? Maybe you are browsing it and got an error, or a friend texted to tell you it was down. Otherwise your website might be down and you’d have no idea. Here is a simple script that will help you learn faster whether your website is up or down. I wrote it in Ruby to start learning Ruby syntax.

require 'uri'
require 'net/http'
require 'net/smtp'
def send_email(to, opts={})
    # Send an email. In Ruby, the ||= operator means 'only assign the
    # new value if the value has not been set yet'
    opts[:server]      ||= 'localhost'
    opts[:port]        ||= 25
    opts[:ehlo_domain] ||= 'localhost'
    opts[:username]    ||= 'username'
    opts[:password]    ||= 'password'
    opts[:from]        ||= 'youremail@domain.com'
    opts[:from_alias]  ||= 'First Last'
    opts[:subject]     ||= 'Test email'
    opts[:body]        ||= 'This is a test!'
    # the following says 'create a new string called msg containing everything
    # from here to the END_OF_MESSAGE marker'
    # then, we fill in the email with all of the values we filled in above. Note
    # that there has to be a newline between the subject and the body of the
    # message.
    # This is a plaintext message - to send HTML emails, we would need a more
    # complex template.
    msg = <<END_OF_MESSAGE
From: #{opts[:from_alias]} <#{opts[:from]}>
To: <#{to}>
Subject: #{opts[:subject]}
#{opts[:body]}
END_OF_MESSAGE
    Net::SMTP.start(opts[:server], opts[:port], opts[:ehlo_domain],
                    opts[:username], opts[:password], :plain) do |smtp|
        smtp.send_message msg, opts[:from], to
    end
end
# test for main method
if __FILE__ == $0
    # the domain we want to check
    url = 'http://testdomain.com'
    my_email = "myemail@gmail.com"
    r = Net::HTTP.get_response(URI.parse(url))
    if r.code != "200"
        # we have a big problem, server is down or other
        opts = {
            :server     => 'smtp.yoursmtpserver.com',
            :port       => 587,   # should be 25 or 587
            :ehlo_domain => 'localhost',      #this doesn't matter too much
            :username   => 'yoursmtpusername',
            :password   => 'yoursmtppassword',
            :from_alias => 'testdomain.com Admin',
            :subject    => 'The website is down',
            :body       => "Hey,\nan attempt to reach the homepage returned " +
            "the status code " + r.code.to_s() + ". \n\nYou should check it out."
        }
        send_email my_email, opts
    end
end

That will send an HTTP request to your website. If the request was successful, you’ll get an HTTP Header Response that says ‘Status Code: 200 OK’. If we don’t get a 200, this means the site is definitely wrong. We may get a 200 and still have problems with the database or other, but this is meant to be a simple test. Other well known codes are 404 (File Not Found), 500 (Internal Server Error), and 418 (I’m a teapot).

Once you’ve got the script in place, save it as “is_server_up.rb” somewhere on a computer which has lots of uptime (your school’s CS server, a remote SSH server, or otherwise). Your laptop is a bad choice because when you close your laptop, or are not connected to the Internet, the test will not be able to run. The computer hosting your website is also a bad choice because if that computer goes down, your website will be down but the script will not run.

On a Linux or Mac machine, type

$ crontab -e

which should open up a Terminal text editor. Then type:

0-59/2 * * * * /usr/local/bin/ruby /path/to/your/file/is_server_up.rb

where

/usr/local/bin/ruby

is the path to the Ruby executable on your laptop. If you don’t know where it is, try typing “which ruby” in the Terminal. If nothing comes up, you probably need to download Ruby. The

0-59/2

bit says to run the script every 2 minutes. You might be worried that fetching your homepage every 2 minutes will generate an extra 720 hits to your site every day and affect your Analytics stats. However, Analytics will only measure visits from browsers that can execute Javascript, and this Ruby script will not execute any Javascript, so you are safe.

Note that if you have an important site, or one which generates revenue, you probably should not use this tool – there are better tools like Pingdom that will let you know right away whether your site is down.

How to find a Pearson’s correlation and ordinary least squares in Python

I’m going to try to blog more technical stuff, as well as simple recipes, here more. I’m working out the best way to present it – whether to bundle everything together or separate out the technical posts.

Today we’re going to use Python to find a simple correlation, and then fit a straight line to the curve.

First you want to install the SciPy and NumPy libraries – they have a lot of cool functions for Python. On Mac, if you have MacPorts installed, this is trivial:

$ sudo port install py27-numpy py27-scipy

Then you find a Pearson’s correlation as follows:

# The dependent variable
>>> x = [1, -2, 2, 3, 1]
# The independent variable
>>> y = [7.5, -3.5, 14.5, 19, 6.6]
#Find a correlation
>>> from scipy.stats.stats import pearsonr
#First value is the r-value, 2nd is the p-value
>>> pearsonr(x,y)
(0.98139984935586166, 0.0030366388199721478)
# To find the best-fit line, use the numpy directory
>>> from numpy.linalg import lstsq
# Put the x variable in the correct format.
>>> A = numpy.vstack([x, numpy.ones(len(x))]).T
>>> A
array([[ 1.,  1.],
       [-2.,  1.],
       [ 2.,  1.],
       [ 3.,  1.],
       [ 1.,  1.]])
>>> lstsq(A, y)
(array([ 4.5 ,  4.32]), array([ 10.848]), 2, array([ 4.53897844,  1.84327826]))
# 4.5 is the slope, 4.32 is the y-intercept. 10.848 is the sum of the residuals.
# 2 is the rank. The last array are the singular values.

For more details, including how to plot the correlation, see the NumPy documentation here.

Presentation on JIT compilers (and the unfortunate sex life of the banana)

Posting’s been kind of light, so I’ll link to two presentations I’ve done for class recently.

I’m currently taking a course on compilers, which is the best course I’ve taken in college. Very quickly, your computer can only understand a very limited set of instructions: add 01110011 to 11001011, load eight zeroes and ones from memory (the cache, or random-access memory (RAM)), overwrite a set of zeroes and ones, etc. However, if you had to write entire programs using only this instruction set, it would be hard for programmers to keep track of variable state, to write error-free code and to find bugs when they are made (Notable exception: Roller Coaster Tycoon was written entirely in x86 assembly language, which explains (a) why there has never been a Mac version released, and (b) why RCT3 was so much worse – it was written in a higher-level language by a different company).

So what programmers did was invent abstractions – control flow tools like if-then-else branching, while and for loops, as well as data structures like arrays and dictionaries. These allow programmers to write in a more natural language, but programs must be translated – “compiled” – down to machine language. Because this is done automatically, and many people use the same compiler (so everyone benefits from errors caught by one person), generated machine code is usually free of errors.

I gave a talk in class on a special type of compiler, called just-in-time (JIT). Normally you write your program, compile it, and then run the machine language. But a JIT compiler is writing and compiling new machine language as the program is running. Here are the slides:

There are a few reasons this can be quicker than compiling beforehand (“static” compilation). First, you have more information about the values of variables at runtime that you can use to eliminate redundancy in your code. Second, in the case of loops, where you are following the same code paths hundreds or thousands of times in a row, you can see which paths are being used the most often, and rewrite your code so that the most common code path is executed in a straight line. This is important because modern processors will try to pre-fetch and execute several instructions at a time, and if you keep jumping from place to place in your code, the processor can’t execute instructions in advance.

I also wrote a paper (and gave a talk) on some of the threats facing the Cavendish banana, which will probably go extinct soon. I’ve been fortunate to write papers this year on some fun topics; last semester I was able to cite Robin Hanson, Tyler Cowen, Eliezer Yudkowsky and Alan Turing in one philosophy paper, which was pretty fun. That paper’s here.

The Unfortunate Sex Life of the Banana

Everything I know about debugging

If you write programs, a large portion of your time is spent testing your code, and finding and removing bugs in your program. Not being able to understand a bug is one of the more frustrating experiences you can have as a programmer. I’ve been on the lookout for ways to reduce the amount of time I spend debugging, but I really haven’t seen any good articles on the theory of debugging. Here’s everything I know about debugging. If you have additions, feel free to leave a comment.

Write better code

The best defense is a good offense – if you write your code in a clear and concise way, and write by making incremental changes which are immediately tested, you are less likely to run into time-sucking bugs later on. Some people recommend test-driven development and code review as ways to cut down on the number of bugs that make their way into programs. Remember, the earlier you can catch bugs, the less expensive they are to fix.

Try simple fixes for 2-3 minutes

Some bugs are simple and can be fixed within minutes. I would recommend spending 2 or 3 minutes trying a few simple fixes before bringing out the big guns. Yes, this is “cargo cult” programming but maybe 50% of the bugs you find have trivial fixes, and it might not be cost-effective to bring out the big guns right at the start.

Set up a reduced test case

Try to set up the simplest possible version of your code that produces the error. In HTML and CSS, this is known as a reduced test case. This often involves deleting large swathes of your code, so I would recommend either copy/pasting to another page, or committing the current version of your code, then making changes, and reverting to the original version once you’ve identified the problem. If you have a unit test suite, this would be a good time to add a new test case to the suite.

Use print statements or a step-through debugger

If setting up a reduced test case is not feasible (multiple dependencies or complex code), use print statements to determine the values of variables before and after all method calls in your program that could be causing the bug. This is tedious, but systematic, and preferred to cargo cult programming, where you make changes at random and hope that they solve your problem. Another approach is to use a step through debugger – stop the program at a point where you know the output is correct, and then step through the program line by line, monitoring the information in the program. In python, the pdb module does this. In Java, the DrJava program is looked down upon for being mostly for students, but I’ve found it extremely useful for viewing the values of variables at points in the program. Most programming languages should have a step-through debugger.

Use assert statements

Alternately, you can use “assert” statements to assert that a variable should have a specific value at a point in your code. This will stop code from executing immediately at the point where a variable has an incorrect value. This has the added benefit of making sure you understand your code.

Concede defeat

If you’re still at a loss, consider ways in which you can “go around” the bug, by rewriting sections of your code in a way you understand, or trying something else entirely to solve your goals. Note: I struggle with this because not knowing the cause of a bug drives me up a wall, but you might have some success with it. Also, stepping away from your computer and thinking about the problem in more abstract terms can help – I’ve had some luck with just taking a shower and thinking through the problem.

Ask the Internet for help

If you’re still at a loss, try copying and pasting your error message into Google, or posting on StackOverflow. Be sure to be explicit about what you expected to see and what actually happened, and post the code of any error messages you might have. You can also try the IRC channel for the programming language, but I’ve had less success with this – there might be someone nice who’s willing to help you out, but more often than not your message will be ignored.

About me
My name is Kevin Burke and I’m a senior at Claremont McKenna College. I started programming in January of 2010. Read more about me, or follow me on Twitter here.

Disable ESPN Autoplay

I wrote a Google Chrome extension that stops videos and ads from playing automatically on ESPN.com. This is another example of scratching my own itch; most people can enable this feature by clicking Autostart Off on any ESPN video, but I clear out my cookies every time I close Chrome, so that tool doesn’t work for me. Also, this is more important for me than other people because I’ll click to open three stories at once, and if the videos all begin playing at the same time, it gets extremely annoying.

Anyway, here’s the extension, all six lines of it in Javascript:

jQuery("#videoPlayer").ready(function(){
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.text = "function check_if_ready(){if (espn.video.player.adPlaying){espn.video.pause();} else{setTimeout(check_if_ready, 100);}}check_if_ready();"
    document.body.appendChild(script);
});

Unfortunately getting to that point took a while; I tried a few other things before I hit on that solution. It’s not perfect but it gets the job done. In the future it would be nice to skip the ads entirely, or auto-play only the video in the tab I’m currently watching. You can download the extension here or improve the source code here.