GopherCon 2019 - Optimizing Go Code without a blindfold

By Colin King for the GopherCon 2019 Liveblog on July 26, 2019


Presenter: Daniel Martí

Liveblogger: Colin King

illustration

Illustration by Sketch Post

Overview

Making code faster is exciting, and benchmarks in Go make that easy to do! Not really. Optimizing a program can be complex and requires careful consideration to do so properly. This session will demonstrate techniques and tools which are a must for any performance aficionado.

Go Benchmarks

Do you know if your code is slow? Do you want to know if it can go faster? If so, the first tool you may want to reach for is benchmarking.

The Go standard testing library includes tooling for benchmarking your code to measure its CPU and memory consumption. Let's look at a simple example:

copyList code

You could then benchmark this utility function like so:

Benchmark Code

From the test below, we can see that this simple function takes about 5,000 nanoseconds per call, along with quite a few allocations:

Example Benchmark Output

Can we do better? To find out, we can reach for pprof to identify which lines are so problematic:

Example Pprof Output

From above, we can see that our benchmark is spending the majority of its time performing append operations! Thankfully, there's an easy fix by pre-allocating the slice and setting each element:

Optimized copyList Code

To understand how much we improved on our benchmark test, we can introduce another tool -- benchcmp -- which displays the performance changes between two executions of the same benchmark test:

benchcmp output

🎉Woohoo!! 🎉 That's a massive performance improvement.

But, can you trust your benchmarks?

While the example in the previous section was simple, the production code we want to benchmark tends to be more complex. Let's look at a more resource-intensive example by checking out the benchmark tests used by the standard library's encoding/json package. Specifically, we'll look at the BenchmarkCodeDecoder test.

If you run this test a few times, you'll notice that the performance measurement varies with each run:

json CodeDecoder

json CodeDecoder comparison

While a +/-3% difference may not seem like it's worth losing sleep over, packages like encoding/json regularly see minor speed improvements -- in fact, three of the last four performance improvements for encoding/json were under 4%. With this much noise, how can we reliably identify if a change has meaningfully impacted the performance of a package?

For that, we're going to need some statistics.

Benchmarks <3 Statistics

To understand the true performance characteristics of a package, we'll want to look at multiple executions of our benchmarks, so we can compute the mean and variance. For that, we can use the benchstat command:

json benchstat

3% of variance is still rather high. To reduce that, we're going to have a look at what else your CPU is up to. Ideally, your CPU is as close to 0% as possible, since your benchmark tests will happily use up as much CPU as possible. However, to be realistic, you are probably at work with Slack, an editor and 20+ Chrome tabs open. (Fun fact: animated emojis used to be huge CPU hog in Slack)

Closing these resource-hungry apps was enough to drop our variance down to +/-1%:

json benchstat

However, that's not where our CPU problems stop. If you run a benchmark a few times in a row, you might notice a major drop in performance, for no apparent reason:

json benchstat throttling

In this case, it's due to CPU throttling since the above benchmarks have pegged the CPU usage at 100% for too long. Thankfully, there's another tool -- perflock -- we can introduce to prevent our benchmarks from using too much CPU at once:

perflock

In the example above, we limit our benchmark to up to 70% CPU utilization, which is enough to prevent our computer from having to throttle its CPU.

While benchstat can be used to analyze multiple benchmark executions to compute the mean and variance, it can also be used to understand how a change you made impacted the performance of your code. Let's look at an example:

benchstat comparison

As you can see above, our new code ran, on average, in 13.4ms vs. 13.5ms. ...Woohoo? Thankfully, benchstat provides some statistical context at the bottom to help us understand if this improvement is statistically significant. Usually, instead of a ~ we'll see a +/-X.XX% change between the two sets of benchmark executions. However, in this case benchstat decided that the delta was not significant enough because of the high p-value (0.247). What's a p-value? Think of it as the probability, assuming our change had no impact on the performance of our code, of seeing benchstat report a performance improvement at least as good as what we just saw. Lower p-values are better, because that means any improvement that benchstat found is more unlikely to have occurred without a meaningful performance improvement.

Now for a surprise -- that was actually two runs of the same code with no changes! A p-value of 0.247 is actually quite bad, and in general you'll want a p-value of <= 0.05 for an improvement to be statistically significant.

Benchmarking Recap

With that, you can now write benchmark tests and use benchstat to understand the performance of your code, then use pprof to identify potential changes, and test them in a reliable environment with perflock + benchstat!

Detour: Compiler Optimizations

You may have heard of Moore's Law -- that computing power (well, transistor count) will double every 18 months -- but what about Proebsting's Law? It postulates that advances in compilers double computing power every 18 years. Not nearly as exciting, but nonetheless shows that compilers can give us significant performance improvements! And given that we now know a good deal about benchmarking our code, let's spend some time talking about some of the ways the Go compiler helps us out, too.

Before we dive into compiler tricks, consider giving the cmd/compile doc a read to get a high-level introduction to the Golang compiler.

Function Inlining

You can pass arguments to the compiler with the -gcflags flag, via go build. As an example, if you pass the -m flag twice, it'll cause the compiler to output which functions it could inline, but didn't:

compiler inlining

You can use this on your own code to identify functions that you could simplify in order for the compiler to be able to inline them!

Heap Allocations

You can also use this compiler flag to find situations where expressions escape to the heap, requiring memory allocations. If you have a hot function that is generating a lot of allocations, this may help you identify why:

compiler heap

Bounds Checks

Whenever you index into a slice, the golang compiler generates a bounds check on the slice. The golang compiler performs an optimization pass (called balance check elimination, or bce) that removes these checks if it can determine that an indexing operation is statically safe. You can toggle on debugging for the prover, with a configurable debug level, to see whether or not these checks are able to be removed from your code:

compiler bounds check

Clearing Maps

Before Go 1.11, the most efficient way to clear a map was to overwrite it with a newly-allocated map. However, that's not particularly efficient! As of Go 1.11 and beyond, you can now iterate over the keys in a map and delete each, and the compiler is smart enough to optimize this operation, which saves you a fresh map allocation!

compiler clearing maps

Checking String Length

Similarly, the efficient way to compute the length of a string previously involved iterating over the string and counting the number of runes. However, you can now simplify that to a length check on a rune slice:

compiler string length

With all that said, consider giving the compiler a chance by writing code that it can optimize!

One More Thing: SSA Analysis

we need to go deeper

If you read through the Go compiler doc above, you'll know that the Go compiler uses an SSA (Static Single Assignment) representation of your source code to perform optimization passes. In their words: "In this phase, the AST is converted into Static Single Assignment (SSA) form, a lower-level intermediate representation with specific properties that make it easier to implement optimizations and to eventually generate machine code from it."

You can actually investigate the SSA output of a function in your package with a special environment variable, GOSSAFUNC, for go build:

compiler ssa command

This generates the SSA output for the simple HelloWorld program above, which I've included here.

You can read more about the internals of the Go compiler's SSA phase with their introductory doc here.


You can see the full slides from Daniel's talk here.