Real Life Go Benchmarking

I've been following the commits to the Go project for some time now. Occasionally someone will post a commit with benchmarks showing how much the commit improves performance along some axis or another. In this commit, they've increased the performance of division by 7 (a notoriously tricky number to divide by) by about 40% on machines running the ARM instruction set. I'm not 100% sure what the commit does, but it switches around the instructions that get generated when you do long division on ARM in a way that makes things faster.

Anyway, I wanted to learn how to do benchmarks, and practice making something faster. Some motivation came along when Uber released their go-zap logging library, with a set of benchmarks showing my friend Alan's logging library as the worst performer. So I thought it would be a good candidate for benchmarking.

Fortunately Alan has already included a set of benchmarks in the test suite. You can run them by cloning the project and then calling the following:

go test -run=^$ -bench=.

You need to pass -run=^$ to exclude all tests in the test suite, otherwise all of the tests will run and also all of the benchmarks. We only care about the benchmarks, so -run=^$ filters out every argument.

We get some output!

BenchmarkStreamNoCtx-4                   	  300000	      5735 ns/op
BenchmarkDiscard-4                       	 2000000	       856 ns/op
BenchmarkCallerFileHandler-4             	 1000000	      1980 ns/op
BenchmarkCallerFuncHandler-4             	 1000000	      1864 ns/op
BenchmarkLogfmtNoCtx-4                   	  500000	      3866 ns/op
BenchmarkJsonNoCtx-4                     	 1000000	      1816 ns/op
BenchmarkMultiLevelFilter-4              	 2000000	      1015 ns/op
BenchmarkDescendant1-4                   	 2000000	       857 ns/op
BenchmarkDescendant2-4                   	 2000000	       872 ns/op
BenchmarkDescendant4-4                   	 2000000	      1029 ns/op
BenchmarkDescendant8-4                   	 1000000	      1018 ns/op
BenchmarkLog15AddingFields-4             	   50000	     29492 ns/op
BenchmarkLog15WithAccumulatedContext-4   	   50000	     33599 ns/op
BenchmarkLog15WithoutFields-4            	  200000	      9417 ns/op
PASS
ok  	github.com/inconshreveable/log15	30.083s

The name on the left is the benchmark name. The number (4) is the number of CPU's used for the benchmark. The number in the middle is the number of test runs; to get a good benchmark, you want to run a thing as many times as feasible, then divide the total runtime by the number of test runs. Otherwise you run into problems like coordinated omission and weighting the extreme outliers too much, or failing to weight them at all.

To get a "good" benchmark, you want to try to isolate the running code from anything else running on the machine. For example, say you run the tip benchmarks while a Youtube video is playing, make a change, and then run the benchmarks while nothing is playing. Video players require a lot of CPU/RAM to play videos, and all other things being equal, the benchmark is going to be worse with Youtube running. There are a lot of ways to accidentally bias the results as well, for example you might get bored with the tip benchmarks and browse around, then make a change and observe the console intensely to see the new results. You're biasing the results simply by not switching tabs or screens!

If you have access to a Linux machine with nothing else running on it, that's going to be your best bet for benchmarking. Otherwise, shut down as many other programs as are feasible on your main machine before starting any benchmark tests.

Running a benchmark multiple times can also be a good way to compensate for environmental effects. Russ Cox's benchstat program is very useful for this; it gathers many runs of a benchmark, and tells you whether the results are statistically significant. Run your benchmark with the -count flag to run it multiple times in a row:

go test -count=20 -run=^$ -bench=Log15AddingFields | tee -a master-benchmark

Do the same for your change, writing to a different file (change-benchmark), then run benchstat:

benchstat master-benchmark change-benchmark

You'll get really nice looking output. This is generally the output used by the Go core development team to print benchmark results.

$ benchstat benchmarks/tip benchmarks/early-time-exit
name                 old time/op  new time/op  delta
StreamNoCtx-4        3.60µs ± 6%  3.17µs ± 1%  -12.02%  (p=0.001 n=8+6)
Discard-4             837ns ± 1%   804ns ± 3%   -3.94%  (p=0.001 n=7+6)
CallerFileHandler-4  1.94µs ± 2%  1.88µs ± 0%   -3.01%  (p=0.003 n=7+5)
CallerFuncHandler-4  1.72µs ± 3%  1.65µs ± 1%   -3.98%  (p=0.001 n=7+6)
LogfmtNoCtx-4        2.39µs ± 2%  2.04µs ± 1%  -14.69%  (p=0.001 n=8+6)
JsonNoCtx-4          1.68µs ± 1%  1.66µs ± 0%   -1.08%  (p=0.001 n=7+6)
MultiLevelFilter-4    880ns ± 2%   832ns ± 0%   -5.44%  (p=0.001 n=7+6)
Descendant1-4         855ns ± 3%   818ns ± 1%   -4.28%  (p=0.000 n=8+6)
Descendant2-4         872ns ± 3%   830ns ± 2%   -4.87%  (p=0.001 n=7+6)
Descendant4-4         934ns ± 1%   893ns ± 1%   -4.41%  (p=0.001 n=7+6)
Descendant8-4         990ns ± 2%   958ns ± 2%   -3.29%  (p=0.002 n=7+6)

OK! So now we have a framework for measuring whether a change helps or hurts.

How can I make my code faster?

Good question! There's no one answer for this. In general, there are three strategies:

Figure out a way to do less work than you did before. Avoid doing an expensive computation where it's not necessary, exit early from functions, &c.
Do the same work, but in a faster way; use a different algorithm, or use a different function, that's faster.
Do the same work, but parallelize it across multiple CPU's, or threads.

Each technique is useful in different places, and it can be hard to predict where you'll be able to extract performance improvements. It is useful to know how expensive various operations are, so you can evaluate the costs of a given operation.

It's also easy to spend a lot of time "optimizing" something only to realize it's not the bottleneck in your program. If you optimize something that takes 5% of the code's execution time, the best overall speedup you can get is 5%, even if you get the code to run instantaneously. Go's test framework has tools for figuring out where your code is spending the majority of its time. To get the best use out of them, focus on profiling the code execution for a single benchmark. Here, I'm profiling the StreamNoCtx benchmark in the log15 library.

$ go test -cpuprofile=cpu.out -benchmem -memprofile=mem.out -run=^$ -bench=StreamNoCtx -v
BenchmarkStreamNoCtx-4   	  500000	      3502 ns/op	     440 B/op	      12 allocs/op

This will generate 3 files: cpu.out, mem.out, and log15.test. These are binary files. You want to use the pprof tool to evaluate them. First let's look at the CPU profile; I've started it and then run top10 to get the top 10 functions.

$ go tool pprof log15.test cpu.out
Entering interactive mode (type "help" for commands)
(pprof) top5
560ms of 1540ms total (36.36%)
Showing top 5 nodes out of 112 (cum >= 60ms)
      flat  flat%   sum%        cum   cum%
     180ms 11.69% 11.69%      400ms 25.97%  runtime.mallocgc
     160ms 10.39% 22.08%      160ms 10.39%  runtime.mach_semaphore_signal
     100ms  6.49% 28.57%      260ms 16.88%  github.com/inconshreveable/log15.escapeString
      60ms  3.90% 32.47%       70ms  4.55%  bytes.(*Buffer).WriteByte
      60ms  3.90% 36.36%       60ms  3.90%  runtime.stringiter2

The top 5 functions are responsible for 36% of the program's runtime. flat is how much time is spent inside of a function, cum shows how much time is spent in a function, and also in any code called by a function. Of these 5, runtime.stringiter2, runtime.mallocgc, and runtime.mach_semaphore_signal are not good candidates for optimization. They are very specialized pieces of code, and they're part of the Go runtime, so changes need to pass all tests and get approved by the Go core development team. We could potentially figure out how to call them less often though - mallocgc indicates we are creating lots of objects. Maybe we could figure out how to create fewer objects.

The likeliest candidate for improvement is the escapeString function in our own codebase. The list function is super useful for checking this.

(pprof) list escapeString
ROUTINE ======================== github.com/inconshreveable/log15.escapeString in /Users/kevin/code/go/src/github.com/inconshreveable/log15/format.go
      30ms      330ms (flat, cum) 23.40% of Total
      10ms       10ms    225:func escapeString(s string) string {
         .          .    226:	needQuotes := false
         .       90ms    227:	e := bytes.Buffer{}
         .          .    228:	e.WriteByte('"')
      10ms       50ms    229:	for _, r := range s {
         .          .    230:		if r <= ' ' || r == '=' || r == '"' {
         .          .    231:			needQuotes = true
         .          .    232:		}
         .          .    233:
         .          .    234:		switch r {
         .          .    235:		case '\\', '"':
         .          .    236:			e.WriteByte('\\')
         .          .    237:			e.WriteByte(byte(r))
         .          .    238:		case '\n':
         .          .    239:			e.WriteByte('\\')
         .          .    240:			e.WriteByte('n')
         .          .    241:		case '\r':
         .          .    242:			e.WriteByte('\\')
         .          .    243:			e.WriteByte('r')
         .          .    244:		case '\t':
         .          .    245:			e.WriteByte('\\')
         .          .    246:			e.WriteByte('t')
         .          .    247:		default:
         .      100ms    248:			e.WriteRune(r)
         .          .    249:		}
         .          .    250:	}
         .       10ms    251:	e.WriteByte('"')
         .          .    252:	start, stop := 0, e.Len()
         .          .    253:	if !needQuotes {
         .          .    254:		start, stop = 1, stop-1
         .          .    255:	}
      10ms       70ms    256:	return string(e.Bytes()[start:stop])

The basic idea here is to write a string, but add a backslash before double quotes, newlines, and tab characters. Some ideas for improvement:

We create a bytes.Buffer{} at the beginning of the function. We could keep a Pool of buffers, and retrieve one, so we don't need to allocate memory for a buffer each time we escape a string.
If a string doesn't contain a double quote, newline, tab, or carriage return, it doesn't need to be escaped. Maybe we can avoid creating the Buffer entirely for that case, if we can find a fast way to check whether a string has those characters in it.
If we call WriteByte twice in a row, we could probably replace it with a WriteString(), which will use a copy to move two bytes, instead of growing the array twice.
We call e.Bytes() and then cast the result to a string. Maybe we can figure out how to call e.String() directly.

You can then look at how much time is being spent in each area to get an idea of how much any given idea will help your benchmarks. For example, replacing WriteByte with WriteString probably won't save much time; you're at most saving the time it takes to write every quote and newline, and most strings are made up of letters and numbers and space characters instead. (It doesn't show up at all in the benchmark but that's because the benchmark writes the phrase "test message" over and over again, and that doesn't have any escape-able characters).

That's the CPU benchmark; how much time the CPU spends running each function in the codebase. There's also the memory profile; how much memory each function allocates. Let's look at that! We have to pass one of these flags to pprof to get it to show memory information.

Sample value selection option (for heap profiles):
  -inuse_space      Display in-use memory size
  -inuse_objects    Display in-use object counts
  -alloc_space      Display allocated memory size
  -alloc_objects    Display allocated object counts

Let's pass one. Notice in this case, the top 5 functions allocate 96% of the objects:

go tool pprof -alloc_objects log15.test mem.out
(pprof) top5
6612331 of 6896359 total (95.88%)
Showing top 5 nodes out of 18 (cum >= 376843)
      flat  flat%   sum%        cum   cum%
   2146370 31.12% 31.12%    6612331 95.88%  github.com/inconshreveable/log15.LogfmtFormat.func1
   1631482 23.66% 54.78%    1631482 23.66%  github.com/inconshreveable/log15.escapeString
   1540119 22.33% 77.11%    4465961 64.76%  github.com/inconshreveable/log15.logfmt
    917517 13.30% 90.42%    1294360 18.77%  github.com/inconshreveable/log15.formatShared
    376843  5.46% 95.88%     376843  5.46%  time.Time.Format

Let's look at a function:

ROUTINE ======================== github.com/inconshreveable/log15.logfmt in /Users/kevin/code/go/src/github.com/inconshreveable/log15/format.go
   1540119    4465961 (flat, cum) 64.76% of Total
         .          .     97:		if i != 0 {
         .          .     98:			buf.WriteByte(' ')
         .          .     99:		}
         .          .    100:
         .          .    101:		k, ok := ctx[i].(string)
         .    2925842    102:		v := formatLogfmtValue(ctx[i+1])
         .          .    103:		if !ok {
         .          .    104:			k, v = errorKey, formatLogfmtValue(k)
         .          .    105:		}
         .          .    106:
         .          .    107:		// XXX: we should probably check that all of your key bytes aren't invalid
         .          .    108:		if color > 0 {
         .          .    109:			fmt.Fprintf(buf, "\x1b[%dm%s\x1b[0m=%s", color, k, v)
         .          .    110:		} else {
   1540119    1540119    111:			fmt.Fprintf(buf, "%s=%s", k, v)
         .          .    112:		}
         .          .    113:	}

In the common case on line 111 (when color = 0), we're calling fmt.Fprintf to write the key, then an equals sign, then the value. Fprintf also has to allocate memory for its own byte buffer, then parse the format string, then interpolate the two variables. It might be faster, and avoid allocations, to just call buf.WriteString(k), then write the equals sign, then call buf.WriteString(v). But you'd want to benchmark it first to double check!

Conclusion

Using a combination of these techniques, I was able to improve the performance of log15 by about 30% for some common code paths, and reduce memory allocations as well. I was not expecting this, but I also found a way to speed up JSON encoding in the Go standard library by about 20%!

Want someone to benchmark/improve performance in your company's application, or teach your team more about benchmarking? I am available for consulting; email me and let's set something up!

Liked what you read? I am available for hire.

2 thoughts on “Real Life Go Benchmarking”

Bobby Jacob July 4, 2016 at 11:18 am

Great post. Not sure if this is how they did it, but here’s an article on how to speed up division by constants:

http://ridiculousfish.com/blog/posts/labor-of-division-episode-i.html

Reply ↓
Karl Keefer July 8, 2016 at 7:54 pm

Thanks for writing this up! Very interesting and helpful. Off to write some benchmarks of my own!

Reply ↓

KEVIN BURKE

Real Life Go Benchmarking

How can I make my code faster?

Conclusion

2 thoughts on “Real Life Go Benchmarking”

Leave a Reply Cancel reply