GopherCon 2018 - Performance Tuning Workshop Notes

conference, golang, gophercon2018, notes

These are some notes from my experiences at the GopherCon 2018 Performance Tuning Workshop, run by Dave Cheney and Francesc Campoy. I don’t expect these will be laid out in any particularly useful way; I am mostly taking them so I can remember some of the bits I found most useful in the future.

The contents of this post are licensed under as the basis of the content is this workshop repository:

Mechanical Sympathy

  • Important to know how a thing works to work well with it.
  • Don’t need to be a writer/designer/maintainer of the language itself, but having a cursory understanding helps

The Case for Performance Tuning


  • Huge benefits of Moore’s law (doubling of # of transistors on a chip every 18 months for ~50 years) might be slowing down

    • If computers were still going to get faster at the same rate, maybe we don’t need to worry so much
    • Performance now increasing at maybe ~3% per year, meaning doubling every ~20 years
    • CPU frequencies have been basically static for a while, largely due to heat management
  • The trend has been to increase the number of cores, rather than increasing the performance of a single core

    • Parallelism does not get you nearly as far as chip performance; often you can’t parallelize enough to effectively use more cores (after a point)
  • Most improvements have been architectural

    • e.g., Out of Order Execution, Speculative Execution
    • Optimization has tended towards doing things in bulk well, but one-offs not as well

Memory Land

  • Memory capacities are basically unbounded still, but time to access is not increasing nearly as well

    • Getting data out of main memory is Slow
    • Cache lines are faster; speed degrades as you fall down cache levels, and cache sizes are limited (especially L1/L2, whose size is limited by speed-of-light concerns and the need to have stable access times for cpu instruction scheduling)


  • For increased performance, will have to write for discrete units, not just CPU
  • Compiled lanaguages will enable better working with CPU branch prediction (vs. interpreted)
  • Need languages that allow efficient reasoning/working with memory (not hiding it behind a VM or excessive pointer chasing)


  • To improve performance, you need to be able to measure performance

    • Benchmark on idle machines (not shared, no web browsing, etc)
    • Disable power saving and other performance-hurting factors
    • Watch out for VMs and shared cloud hosting (noisy)
  • Run benchmarks multiple times (both before and after) to get consistent results

Testing Package Tools

    func BenchmarkFoo(b *testing.B) {
        for i := 0; i < b.N; i++ {
  • By default, benchmarks are not run by go test; add -bench flag to run them

    • Go will run the regular tests before benchmarks; can disable with -run
    • Can pick core counts w/ -cpu=1,2,4,8, e.g. (actually adjusting GOMAXPROCS)
  • Reports b.N and ns/op (ns per loop body)

    • b.N is an increasing number based on how long the benchmark takes (tries to get the most runs in ~1s)
    • open issue to set b.N instead of determining
  • Can increase -benchtime to get a high-enough number of iterations (M * 1k) to be valid/stable

  • Can increase -count to get more runs

  • Tool from Russ Cox go get can tell you how stable a set of benchmark results are (with -count, e.g.)

    • tee to a file, and then benchstat the file to get stats

Getting Solid Before/After Samples

  • Compile test binaries w/ go test -c && mv foo.test foo.old-version

    • Pass args as instead of -foo
  • Make sure there are tests to make sure you don’t break things while chasing optimizations

  • Can pass both out files to benchstat to get comparative stats (including p-value and number of non-outliers)

Benchmark Setups

    func BenchmarkFoo(b *testing.B) {
        for i := 0; i < b.N; i++ {


Benchmark Allocations

Use b.ReportAllocs() in the benchmark function, or force on with -benchmem

Beware Compiler Optimizations

  • (popcount example – see examples in original repo)

  • Compiler: “You can’t prove I didn’t do it” – inlined away (no actual function call, just the loop)

    • math/bits package has the built-in instructions (when available)
  • To fix: make sure the compiler can’t prove that some other part of the program can’t see it (store to local var, then at the end of the loop, assign value to a package public variable)

    • private package variable could (in the future) still result in optimizing away if the compiler starts supporting that
    • don’t want to just do //go:noinline; want the inlining performance benefits – just fix the benchmark
  • -gcflags "-S" to see assembly (can be useful to see if things got optimized away)

  • Also beware of only-constants in benchmarks


  • Benchmarks help give you an idea of where you are, but not which parts could or should be optimized

  • pprof

    • beware sampling issues
    • runtime/pprof package
    • go tool pprof (not recommended – drop-in replacement exists)
  • pprof profiling methods

    • cpu
    • memory (two modes)
    • blocking
    • mutex contention

CPU Profiling

  • 100/s (every 10ms) records a stacktrace from the go runtime

  • Benchmark w/ -cpuprofile=cpu.pb.gz

    • Has a speed cost (not unexpected)
  • Examine w/ go tool pprof cpu.pb.gz

    • top inside is not that useful, b/c runtime and things like that show up when we can’t do things there
    • and we really want cumulative (top -cum)
    • can just do web (opens a graph)
  • pprof drop-in replacement go get

    • pprof ... instead of go tool pprof ...
    • pprof -http=:6060 ... to open a web server w/ flame graphs and other cool stuff
    • code view w/ timing on a per-line basis

Memory Profiling

  • For heap allocations only

  • Samples 1 of every 1000 allocations (can tweak)

    • tracks code paths that led to allocations, not what is using the memory allocated
  • -alloc_objects vs. -inuse_objects to detect different things (or no flag)

    • no flag gives memory size
    • the other flags give counts
  • Apparently it can be used to detect leaks, but not quite straightforward

Block Profiling

  • Similar to CPU profile but records time spent waiting on a shared resource

  • Possibly useful for identifying concurrency bottlenecks

    • channel waits
    • mutex waits (can also do mutex profiling)

Mutex Profiling

  • Like block profiling for mutexes specifically

General Tips

  • Profile one thing at a time (more will hurt performance too much, probably)

Profiling Non-Benchmarks

  • Option 1: calling defer profile.Start().Stop() (worked pretty well in testing)

  • Option 2: (see repo)

  • Too much time in syscalls means you should probably be buffering to avoid doing syscalls as often

  • "net/http/pprof"

    • can pass the url to the debug endpoints to the pprof tools instead of a file
    • only has a performance hit when you run pprof against the url
  • go-wrk to generate load

Execution Tracer

  • Can’t catch really fast or rare events with pprof

  • Instead of asking the runtime for data, configure the runtime to log/report data

    • Performance hit is not nearly as predictable as pprof
    • start with trace.Start, defer trace.Stop(), usually in main, but can do in a specific function
  • go tool trace trace.out (needs Chrome)

    • w / s zoom in/out
    • a / d move left/right
    • ? for help
    • Can see “mark assist” things that slow down goroutines as they help the garbage collector
    • Can see when goroutines get tagged for GC assist (only those that ask for a malloc can get co-opted)
    • STW (stop everything so we can free memory)
    • SWEEP (do the memory cleanup)

Compiler Optimizations

  • 3 Big ones: escape analysis, inlining, dead code elimination
  • Go compiler was built from the Plan9 compiler

Escape Analysis

  • Can return pointers out of a function (and the compiler will allocate on the heap)

  • But can take things that are usually thought of as on the heap and put them on the stack instead

    • e.g., slice that is created and used entirely within a function
    • can’t use make/new to determine heap/stack
    • ...interface{} can make things escape
  • Can make the compiler tell you about it w/ -gcflags=-m, or -gcflags="-m -m" for even more info

  • There is a threshold of stack growth vs. heap allocation where in-function things might escape anyhow

  • const vs var sizes matter for escape analysis


  • Only for leaf functions

    • Leafs are functions that do not call other functions
  • Only for “small” functions – ratio of preamble overhead to work done matters

    • e.g., getters, setters
    • “can inline” vs. “inlining call to "
  • go1.11 seems to be able to chain inlines, and can inline bits that might panic

  • can control inlining

    • -gcflags=-l disables
    • -gcflags="-l -l" more aggressive, maybe bigger binaries (note: NOT the same as no -l (?))
    • -gcflags="-l -l -l" more aggressive, maybe bigger binaries again, might be buggy
    • -gcflags="-l=4" in go1.11 will enable experimental mid stack inlining (bigger and slower right now)
  • things like const vs. var won’t affect inlining like they do for escape analysis

  • but (public) package variables will

Dead Code Elimination

  • Can use some conditional compilation to set consts to different values so that sometimes the compiler can optimize away code (e.g., debug stuff)

Tips and Tricks

  • unicode.IsSpace(b) for detecting all kinds of spaces

  • “concurrency of gophers”

  • reduce allocations

    • a la Read([]byte) (int, error) vs. Read() ([]byte, error)
  • avoid []byte to string conversions

    • conversion creates a copy
    • pick one representation and stick with it
  • there are compiler optimizations for doing v, ok := m[string(key)] (and that specifically, not when doing the conversion before map lookup)

  • string concatenation is complicated

    • initial benchmark was appending to properly sized []byte > +-concat > Sprintf > FPrintf
  • preallocate slices if the length is known

  • goroutines can seem free, but there are memory concerns (each one has a stack of at least 2K)

    • make sure there is a stop condition so the memory can get cleaned up
  • async filesystem IO is not widely available

    • filesystem IO will consume an OS thread while running
  • use io.Reader / io.Writer instead of passing around large []byte when possible

  • Timeouts

    • use SetReadDeadline / SetWriteDeadline / SetDeadline for network IO
    • use worker pool channels
  • defer-ing a function that is too-small can be expensive

  • avoid finalizers (non-deterministic, since connected to GC)

  • avoid, or at least minimize, cgo

  • <- find video about go execution tracer (uses mandelbrot example from workshop)