go / runtime

The Go runtime is compiled into every Go binary. It provides the memory allocator, garbage collector, goroutine scheduler, and everything that makes go, channels, and defer work.

This article is my compressed notes on the Internals for Interns four-part series, which I verified against the Go 1.26 source. It ends with practical implications for day-to-day Go programming.

Bootstrap

Before func main() runs, the runtime:

  1. Creates g0 (the runtime's housekeeping goroutine) and m0 (the first OS thread)
  2. Sets up Thread-Local Storage so each thread knows which goroutine it's running
  3. Detects CPU features (AES for map hashing, etc.)
  4. Initializes the stack pool, memory allocator, and type/interface tables
  5. Creates P structs (one per GOMAXPROCS, defaulting to CPU count)
  6. Spawns the sysmon background thread
  7. Creates a goroutine for runtime.main, which calls your main.main

Memory allocator

The allocator sits between your program and the OS. It grabs large arenas (64MB on 64-bit systems) via mmap and subdivides them into 8KB pages, which are grouped into spans.

Each span holds objects of a single size class. There are 68 size classes from 8 bytes to 32KB. A 50-byte allocation rounds up to 64 bytes and fills a slot in a span of 64-byte slots. Objects larger than 32KB are allocated directly from the heap.

Allocation uses a three-level cache:

  1. mcache (per-P, no locks) — the fast path
  2. mcentral (per-size-class, shared) — refills mcache
  3. mheap (global page allocator) — refills mcentral

Most allocations hit the mcache and need no locks.

Scheduler

The scheduler multiplexes goroutines onto OS threads using three structures (the GMP model):

An M must acquire a P to run Go code. When an M blocks on a syscall, the P detaches and moves to a free M, keeping work moving.

Each P has a 256-slot local run queue plus a runnext fast slot. Idle Ps steal work from other Ps' queues. The sysmon thread runs in the background to preempt long-running goroutines and retake Ps stuck in syscalls.

Garbage collector

Go uses a concurrent, non-moving, tri-color mark-and-sweep collector. Collection runs in four phases:

  1. Sweep termination: brief stop-the-world to finish prior sweeps and enable the write barrier
  2. Mark: concurrent; traces the object graph using ~25% CPU
  3. Mark termination: brief stop-the-world to disable write barrier and swap bitmaps
  4. Sweep: concurrent; frees unmarked objects

The write barrier intercepts pointer writes during marking so the GC doesn't miss reachable objects that your code is moving around. Go uses a hybrid Yuasa-Dijkstra barrier that shades both old and new pointer targets.

Go 1.26 introduces the Green Tea GC (GOEXPERIMENT=greenteagc), which improves mark phase locality by batching objects within the same span before scanning them.

So what

Knowing these internals changes how I write Go in a few concrete ways.

Reduce heap allocations

Every heap allocation is work the GC must trace. The compiler's escape analysis decides what goes on the heap. Check its decisions:

go build -gcflags='-m' ./...

Common ways to keep things off the heap:

Prefer values over pointers in collections

The GC scans every pointer in a live object. A []User where User has no pointer fields is cheaper for the GC than a []*User, because the collector doesn't need to chase each element.

When structs are small, pass and store them by value.

Tune GC for your workload

The GC triggers based on heap growth. Two knobs control it (see the GC guide):

import "runtime/debug"

// Set at startup, or via environment variables
// GOGC=200 GOMEMLIMIT=512MiB ./myapp
debug.SetGCPercent(200)
debug.SetMemoryLimit(512 << 20)

Don't fear goroutines, but don't ignore them

Goroutines are cheap (~2KB stack, recycled when done) but each live goroutine is a GC root the collector must scan. Millions of goroutines are fine if they're short-lived. Millions of long-lived, blocked goroutines holding pointers add GC pressure.

Use the profiling tools

go test -bench=. -benchmem       # allocations per op
go test -cpuprofile cpu.prof     # CPU profile
go test -memprofile mem.prof     # memory profile
go tool pprof cpu.prof           # analyze
GODEBUG=gctrace=1 ./myapp       # GC cycle log
go tool trace trace.out          # scheduler/GC timeline

benchmem is especially useful: if allocs/op drops to zero, you've moved everything to the stack.

GOMAXPROCS rarely needs changing

It defaults to the number of CPU cores, which is the right answer for most workloads. Raising it doesn't help because it only controls how many Ps exist (and thus how many goroutines can run truly in parallel). More Ms can exist than Ps to handle blocking syscalls.

Channel patterns get scheduler help

When a goroutine sends on a channel and the receiver is ready, the scheduler uses the runnext slot to run the receiver immediately on the same P. This means tight producer-consumer pairs on channels have low scheduling latency by design.

← All articles