GopherCon 2018 - Allocator Wrestling

Presenter: Eben Freeman

Liveblogger: Beyang Liu

Summary

A whirlwind tour of the Go memory allocator and garbage collector, with tools and tips on how to optimize.

  • The allocator and garbage collector are pretty ingenious!
  • Single allocations are fast but not free
  • The garbage collector can stop individual goroutines, even though STW pauses are very short
  • A combination of tools is essential to understand what's going on: benchmark with GC off
  • use CPU profiler to find hot allocations
  • use memory profiler to understand allocation count/bytes
  • use execution tracer to understand GC pattern
  • use escape analyzer to understand why allocations happen

Eben Freeman (@_emfree_), an engineer at Honeycomb, talks about pitfalls and optimizations related to the memory allocator in Go.

Go is a managed-memory language, which means most of the time you don't have to worry about manual memory management, because the runtime does a lot of work for you. However, dynamic memory allocation is not free, and a program's allocation patterns can substantially affect its performance. The performance implications, however, are not obvious from language syntax. This talk covers techniques and tools that you can use to detect and address allocator bottlenecks.

A motivating example

The architecture of the storage / query service at Honeycomb:

image

The design/performance goals are:

  • subsecond / single-digit second query latency
  • hundreds of millions / billions of rows per query
  • flexible queries, on-the-fly schema changes

Using the techniques described in this talk, they were able to obtain speedups of up to 2.5x on some key benchmarks:

image

Outline of this talk

  1. Allocator internals: How do the allocator and garbage collector work?
  2. Tools: Tools for understanding allocation patterns
  3. What can we change? Strategies for improving memory efficiency

Allocator internals

Memory layout

Depending on their lifetimes, objects are allocated either on the goroutine's stack or on the heap. The Go compiler decides which objects go where, but the general rule is that objects that outlive their containing function calls need to go on the heap.

image

So what actually is in the heap? How did it get there? And who's keeping track of it all?

To answer these questions, we should think about what design goals the authors of the heap memory allocator probably had in mind:

  • Goal: Efficiently satisfy allocations of a given size, but avoid fragmentation
    • Solution: allocate like-sized objects in blocks
  • Goal: Avoid locking in the common case
    • Solution: maintain per-CPU caches
  • Goal: Efficiently reclaim freeable memory
    • Solution: use bitmaps for metadata, run GC concurrently ` The heap is divided into two levels of structure: arenas and spans:

image

A global mheap struct keeps track of both of them:

type mheap struct {
  arenas [1 << 22]*heapArena       // Covering map of arena frames
  free []mSpanList                 // Lists of fully free spans
  central []mcentral               // Lists of in-use spans
  // plus much more
}

Arenas are coarse sections of the available address space (64MB on 64-bit archs). We allocate OS memory in units of arenas. For each arena, there's metadata in a heapArena struct:

type mheap struct {
  arenas [1 << 22]*heapArena    // Covering map of arena frames
  // ...
}
 
type heapArena struct {
   // page-granularity map to spans
  spans [pagesPerArena]*mspan
  // pointer/scalar bitmap (2bits/word)
  bitmap [heapArenaBitmapBytes]byte
}

Most heap objects live in a span: a run of pages containing objects of a fixed size class: image

type span struct {
  startAddr  uintptr
  npages     uintptr
  spanclass  spanClass
 
  // allocated/free bitmap
  allocBits *gcBits
  // ...
}

There are ~70 different span size classes.

Each P has an mcache holding a span of each size class. Ideally, allocations can be satisfied directly out of the mcache (so they're fast). In Go, a P is a scheduling context. Generally there's one P per core and at most one goroutine running per P at a time:

image

To allocate, we find the first free object in our cached mspan, then return its address:

image

Let's say we need 96 bytes. First we'd look in the mcache for the mspan with 96-byte objects. After allocation, the picture would look like this:

image

This design means that "most" memory allocations are fast and require no locking! There are just 3 quick steps:

  1. Find a cached span with the right size (mcache.mspan[sizeClass])
  2. Find the next free object in the span
  3. If necessary, update the heap bitmap

So now we have covered 2 of our 3 design goals for the allocator: (1) Efficiently satisfy allocations of a given size, but avoid fragmentation, (2) Avoid locking in the common case. But what about (3) Efficiently reclaim memory? In other words, garbage collection?

Garbage collection

We have to find and reclaim objects once they're no longer referenced.

Go uses a tricolor concurrent mark-sweep garbage collector, which sounds intimidating, but isn't. "Mark-sweep" means the garbage collector is divided into a mark phase, where objects are marked for collection, and a sweep phase, where memory is actually freed.

In a tricolor garbage collector, objects are marked white, grey, or black. Initially, all objects are white. We start by marking goroutine stacks and globals. When we reach an object, we mark it grey:

image

When an object's referents are all marked, we mark it black:

image

At the end, objects are either white or black:

image

White objects can then be swept and freed.

Simple, right? Well, it leaves some questions open:

  • How do we know what an object's referents are?
  • How do we actually mark an object?

Go uses bitmaps for such metadata.

Pointer identification

Say we have a struct type like:

type Row struct {
  index int
  data []interface{}
}

In memory, that looks like:

image

How does the garbage collector know what other objects it points to? I.e., which of its fields are pointers?

Remember that this heap object is actually inside an arena. The arena's bitmap tells us which of its words are pointers:

image

Mark state

Similarly, mark state is kept in a span's gcMark bits:

image

Once we're done marking, unmarked bits correspond to free slots, so we can just swap that mark state into the alloc bits:

image

This means sweeping in Go is generally super fast (while marking adds more overhead and is what we have to be concerned about).

The garbage collector is concurrent... with a twist. If we're not careful with the implementation, the application can do sneaky stuff to thwart the garbage collector. For example, take the following code:

type S struct {
  p *int
}
 
func f(s *S) *int {
  r := s.p
  s.p = nil
  return r
}

After the first line of the function executes, we could end up with a state like this:

image

And at the return statement, the GC state could look like this:

image

And then the return value would be a live pointer to memory that the garbage collector could free!

To prevent this from occurring, the compiler translates pointer writes into potential calls into the write barrier. Very roughly:

image

There are many ingenious optimizations in the Go compiler that let this process happen concurrently, but we won't get into those here.

We can, however, begin to understand how marking might be a problem as far as processor overhead is concerned.

When GC is marking, the write barrier is turned on.

During marking 25% of GOMAXPROCS are dedicated to background marking. But additional goroutines can be forced into mark assist:

image

Why the need for mark assist? A rapidly allocating goroutine can outrun the background markers. So when allocating during marking, a goroutine gets charged for each allocation. If it's in debt, it has to do mark work before continuing:

func mallocgc(size uintptr, ...) unsafe.Pointer {
   // ...
   assistG.gcAssistBytes -= int64(size)
	if assistG.gcAssistBytes < 0 {
       // This goroutine is in debt. Assist the GC to
       // this before allocating. This must happen
       // before disabling preemption.
       gcAssistAlloc(assistG)
    }
    // ...

Summary

We've learned quite a bit about the runtime:

  • The runtime allocates data in spans to avoid fragmentation
  • Per-CPU caches speed up allocation, but the allocator still has to do some bookkeeping
  • GC is concurrent, but write barriers and mark assists can slow a program
  • GC work is proportional to scannable heap (allocating a big buffer of scalars is much cheaper than allocating a big buffer of pointers, because you have to follow the pointers)

Tools

It seems like dynamic memory allocation has some cost. Does this mean that reducing allocations will always improve performance? Well, it depends. The builtin memory profiler can tell us where we're allocating, but can't answer the causal question, "Will reducing allocations make a difference?" To do that, we can use three other tools:

  • crude experimenting
  • sampling profiling with pprof
  • go tool trace

Crude experimenting

If you want to see how much overhead the allocator and garbage collector add, you can turn them off with the following runtime flags:

  • GOGC=off : Disables garbage collector
  • GODEBUG=sbrk=1 : Replaces entire allocator with simple persistent allocator that gets big blocks from the OS and gives you successive slices of those as you allocate

image

This might seem kind of stupid, but it's a cheap way to establish a rough estimate of possible speedup potential. Maybe 30% speedup isn't worthwhile -- or maybe it's a compelling opportunity!

Problems with this approach:

  • not viable in production, need synthetic benchmarks
  • not reliable for programs with large heaps: persistent allocator may not perform that well either for those

However, it is a worthwhile first check.

Profiling

A pprof CPU profile can often show time spent in runtime.mallocgc.

Tips:

  • Use the flamegraph viewer in the pprof web UI
  • If pprof isn't enabled in your binary, you can use Linux perf too

image image

This gives us line-level attribution of where we're doing CPU-expensive allocations, and can help clue us into the underlying cause of GC/allocator overhead.

Problems with this approach:

  • Program might not be CPU-bound
  • Allocation might not be on critical path
  • Time in background marking (gcBgMarkWorker) can mislead (time spent here doesn't necessarily mean you have net slowdown)

go tool trace

The execution tracer might be the best tool at our disposal to understand the impact of allocating in granular detail.

The execution tracer captures very granular runtime events over a short time window:

curl localhost:6060/debug/pprof/trace?seconds=5 > trace.out

You can visualize the output in a web UI:

go tool trace trace.out

...though it can be a little dense: image

The top-level blue bar shows you at a top-level when GC is running, and the lower bars show what's happening internally.

Remember, top-level GC doesn't mean the program is blocked, but what happens within GC is interesting!

image

image

The minimum mutator utilization curve can also show you if a service is GC-bound at different quartiles. ("Mutator" here means "not GC".)

Summary

Together, benchmarks with the allocator off, CPU profiles, and execution traces give us a sense of:

  • whether allocation / GC are affecting performance
  • which call sites are spending a lot of time allocating
  • how throughput changes during GC.

What can we change?

If we've concluded that allocations are a source of inefficiency, what can we do? A few of high-level strategies:

  • Limit pointers
  • Allocate in batches
  • Try to recycle objects (e.g., sync.Pool)

What about just tuning GOGC?

  • Absolutely helps with throughput. However...
  • If we want to optimize for throughput, GOGC doesn't express the real goal: "use all available memory, but no more"
  • Live heap size is generally but not always small
  • High GGOC makes avoiding OOMs harder

Okay, now onto things we can change in code to improve allocation/GC behavior...

Limit pointers

The Go compiler can be enticed to tell you why a variable is heap-allocated:

go build -gcflags="-m -m"

but its output is a bit unwieldy.

https://github.com/loov/view-annotated-file helps digest it: image

Sometimes, spurious heap allocations are easily avoided!

image

image

In addition to avoiding spurious allocations, avoiding structs with inner pointers helps the garbage collector:

image

^ Why this discrepancy? Because the sneaky time.Time struct conceals a nefarious pointer:

type Time struct {
  wall uint64
  ext in64
  loc *Location
}

Slab allocation

Even though the fast path in the allocator is very optimized, we still need to do some work on every allocation:

  • prevent ourselves from being preempted
  • check if we need to assist GC
  • compute the next free slot in the mcache
  • set heap bitmap bits
  • etc.

That ends up being ~20ns per allocation, if you need a back-of-the-envelope. In some cases, we can amortize that overhead by doing fewer, bigger allocs:

// Allocate individual interface{}s out of a big buffer
type SlicePool struct {
	bigBuf []interface{}
}
 
func (s *SlicePool) GetSlice(size int) []interface{} {
	if size >= len(s.bigBuf) {
		s.bigBuf = make([]interface{}, blockSize)
	}
	res := s.bigBuf[:size]
	s.bigBuf = s.bigBuf[size:]
	return res
}

The danger is that any live reference will keep the whole slab alive:

image

Also, these aren't safe for concurrent use. They're best for a few heavily-allocating goroutines.

Recycling objects

Consider the following storage engine architecture:

image

This architecture

  • is simple, easy to reason about
  • maximizes available parallelism
  • generates tons of garbage

Optimization: explicitly recycle allocated blocks of memory:

image

var buf RowBuffer
select {
  case buf = <-recycled:
  default:
    buf = NewRowBuffer()
}

A more sophisticated version would use sync.Pool, which

  • maintains slices of recycled objects, sharded by CPU (with runtime support)
  • allows lock-free get/put in the fast path
  • caveat: cleared on every GC

Danger: must be very careful to zero or overwrite recycled memory.

In review

Below is a chart of different benchmarks after successive rounds of optimization (your mileage may vary):

image

In summary,

  • The allocator and garbage collector are pretty ingenious!
  • Single allocations are fast but not free
  • The garbage collector can stop individual goroutines, even though STW pauses are very short
  • A combination of tools is essential to understand what's going on: benchmark with GC off
  • use CPU profiler to find hot allocations
  • use memory profiler to understand allocation count/bytes
  • use execution tracer to understand GC pattern
  • use escape analyzer to understand why allocations happen

Acknowledgements and further reading

Special credit to Ian Wilkes and many others at Honeycomb: the true masterminds

Suggested further reading:

Allocation efficiency in high-performance Go services Achille Roussel / Rick Branson https://segment.com/blog/allocation-efficiency-in-high-performance-go-services/

Go 1.5 concurrent garbage collector pacing Austin Clements https://golang.org/s/go15gcpacing

So You Wanna Go Fast Tyler Treat https://bravenewgeek.com/so-you-wanna-go-fast/

About the author

Beyang Liu is the CTO and co-founder of Sourcegraph. Beyang studied Computer Science at Stanford, where he published research in probabilistic graphical models and computer vision at the Stanford AI Lab. You can chat with Beyang on Twitter @beyang or our community Discord.

Get Cody, the AI coding assistant

Cody makes it easy to write, fix, and maintain code.