GopherCon 2019 - Two Go programs, three different profiling techniques, in 50 minutes
Anton Velikanov for the GopherCon 2019 Liveblog
Go, being a relatively recent statically typed, compiled language, is known to produce efficient programs. But writing a program in Go is not sufficient for good performance.
All life code session
Dave was doing a life demo of showing 3 different types of profiling:
- CPU Profiling
- Memory Profiling
Dave started with looking through the following example of code which counts the number of words in the text file provided.
The algorithm is doing the following: Check for an EOF and a space - so we can count for words and also check for letters.
Let's build and run this program:
We have some text:
1.2 MB text of a fine prose.
time ./words moby.txt 181275 words - on average takes 2 seconds
Let’s compare that to unix’s wc -w
time wc -w moby.txt 0.002 seconds 215829 words
We have to make our program to run faster.
First thing we need to do is to profile this program.
Technique 1 - CPU Profiling
run this program.
pprof - with pprof the program runs a bit slower, but the benefit is that we have now a profile
go tool pprof -http:8080
we see here a callstack. We sample callstack of active goroutines and check the goroutines that do the most of work.
Here is what is actually our program:
The biggest contributor: syscall.syscall. Is it too slow? It is using all the time in our program, why? Too many syscalls. Its not that syscall is slow, but all the reading operations are unbuffered.
We don't know for how long the underlying goroutine If the syscall returns fast enough
We fix it - we buffer.
The times are very small now.
The CPU profiling does not show any useful anymore - its just a background noise.
Technique 2 - Memory Profiling
Lets switch from CPU profiling to memory profiling.
time ./words moby.txt
real 0.044 sys 0.006
All the allocations of this program are calling from main readbyte() defer profile.Start(profile.MemProfile, profile.MemProfileRate(1), ....)
We can see every allocation happen inn this program.
reader interface - we don't know what should be passed here in readbyte - all the memory allocations are being ported.
Let reuse the buffer instead allocating it every time on a stack.
The allocations are tinny now!
We reduced the allocations of this program. go tool pprof -http:8080 mem.pprof
We are only twice as slow as wc
Technique 3 - Tracing
An example that produces a Mandelbrot image
An algorithm runs 1.6s - to generate 1Mb image. Lets see if we can improve it.
main.paint - 1.10 seconds main.fillPixel - using all the CPU time here. But its hard to improve the algorithm.
Lets use the trace tool to find out.
- Single threaded while composing the image.
We see that the main.fillPixel function was on the CPU the most when pprof captured the stack.
- Compute every pixel in parallel
We don't have many goroutines running simultaneously.
We can see from the trace that the program is only using one CPU.
Using more than one CPU
We saw from the previous trace that the program is running sequentially and not taking advantage of the other CPU's on this machine.
Mandelbrot generation is known as embarrassingly_parallel. Each pixel is independent of any other, they could all be computed in parallel. So, let’s try that.
Now the algorithm runs on a multiple chunks but we have gasps.
Batching up work
Using one goroutine per pixel was too fine grained. There wasn't’t enough work to justify the cost of the goroutine.
Instead, let’s try processing one row per goroutine.
This looks like a good improvement, we almost halved the runtime of the program. Let’s look at the trace.
We have 1000 goroutines but we have 1000 time more work to do.
The time to compute the image is 200 ms and the time to write the image is also 200ms which is good!
In the main goroutine we queue the work using buffers channel
-mode worker -workers = 8 (The number of CPU's Dave has)
Looks similar with the only difference: Fixed number of goroutines now.
Did it much difference in practice?
- fixed number of goroutines?
- Or many goroutines? Depending on the use case. 2 strategies but they deliver pretty much the same result.
Boost the performance of the code - nearly 4 times!
For this algorithm - more CPU's - better! If we have 1000 cores we can solve each row at the same. If we have 1000000 cores - we can compute every pixel in parallel. But we spend 50% of writing the image. But that's a sequential operation.
The speed is limited by serial part of an application that can not be paralleled.
Think about your own programs! How much of the work can be done in parallel?
The key is to make effectively all of the work in parallel.