Scope: This post covers modeling the OpenRTB 2.6 bid request spec in Go, profiling encoding/json with pprof, and benchmarking three parsing approaches. It does not cover bid evaluation logic, auction mechanics, or production deployment. All code is available on GitHub.
Where Go fits in an ad stack#
The auction core is C++. Around it lives a services layer — validation, pacing, reporting pipelines, ad ops APIs — written in Go at most companies. A bid request arrives at a Go validation service, which checks schema compliance, applies publisher-level rules, and attaches fraud scores before forwarding to the C++ core.
At this validation tier you might handle 50k–200k requests per second. JSON parsing isn't a microsecond concern, but it isn't free either. At 50k RPS, a 7.7µs parse time consumes 385ms of CPU per second per core, just for deserialization. Every allocation creates work for the garbage collector, and GC pauses — even the short ones Go produces — affect tail latencies at this throughput.
This post is about understanding that cost and reducing it.
The OpenRTB 2.6 data model#
The IAB Tech Lab's OpenRTB specification defines the JSON wire format for programmatic ad transactions. A bid request describes one ad opportunity — who's available to see an ad, on what surface, with what constraints.
The spec is designed to be universal — display, video, CTV, mobile, and native formats in a single schema. The consequence: most fields are optional. A banner request on desktop web doesn't carry video.protocols. A mobile app request doesn't need site.page. Out of roughly 200 fields across the full spec, a realistic bid request might populate 20–30.
The optional field problem#
In C++, the natural tool here is std::optional<T>. Go has no language-level equivalent. The idiomatic approach for optional JSON fields is pointer types: *int, *float64, *string. A nil pointer means the field was absent in the JSON; a non-nil pointer means it was present, even if the value is zero.
This distinction is not cosmetic. Consider bidfloor:
BidFloor *float64 `json:"bidfloor,omitempty"`
A nil BidFloor means no floor was set — bid anything. A non-nil BidFloor with value 0.0 means the floor is explicitly zero — a meaningful instruction to the auction. If you used float64 instead of *float64, both cases would decode as 0.0 and you'd silently lose the distinction.
Every optional scalar field in the spec gets a pointer type. Every optional nested object gets a pointer to struct. Slices ([]Imp, []string) are naturally nil when absent.
One more detail: every OpenRTB object has an ext field for non-standard extensions — buyer/seller-specific data not in the base spec. Model it as json.RawMessage:
Ext json.RawMessage `json:"ext,omitempty"`
This preserves the raw bytes without decoding them, so extension data passes through intact even if your service doesn't understand it.
Baseline: encoding/json#
The stdlib implementation is a direct mapping from spec to struct:
// Spec: OpenRTB 2.6 — §3.2.1
type BidRequest struct {
ID string `json:"id"`
Imp []Imp `json:"imp"`
Site *Site `json:"site,omitempty"`
App *App `json:"app,omitempty"`
User *User `json:"user,omitempty"`
Device *Device `json:"device,omitempty"`
AT *int `json:"at,omitempty"`
TMax *int `json:"tmax,omitempty"`
BCat []string `json:"bcat,omitempty"`
BAdv []string `json:"badv,omitempty"`
Ext json.RawMessage `json:"ext,omitempty"`
}
The struct tags (json:"...") map Go's exported field names to the spec's lowercase JSON keys. For unmarshaling, encoding/json does case-insensitive key matching, so "id" in JSON would match an ID struct field even without a tag. But the tag ensures exact key mapping in both directions — without it, marshaling would emit "ID" instead of "id", producing non-compliant output. The omitempty modifier skips the field during marshaling if it's nil or zero.
Parsing is a thin wrapper around json.Unmarshal:
func Parse(data []byte) (*BidRequest, error) {
var br BidRequest
if err := json.Unmarshal(data, &br); err != nil {
return nil, fmt.Errorf("parse: %w", err)
}
return &br, nil
}
Baseline benchmark on Apple M2 Pro, Go 1.26.2, input is a banner bid request with site, publisher, device, and user populated — 804 bytes of JSON:
goos: darwin
goarch: arm64
pkg: github.com/nsmkhn/rtbench/openrtb
cpu: Apple M2 Pro
BenchmarkParse_StdlibJSON 7,685 ns/op 1,840 B/op 49 allocs/op
7.7 microseconds. 49 heap allocations per parse.
What pprof reveals#
Before reaching for a faster library, the right move is to profile and understand where the time goes. For benchmarks, the simplest path is the built-in -cpuprofile flag:
go test -bench=BenchmarkParse_StdlibJSON -benchmem -cpuprofile=cpu.prof ./openrtb/
go tool pprof -top cpu.prof
For a standalone profiling pass (useful when you want to control iteration count or compare implementations outside the test harness), Go's runtime/pprof package works directly:
f, err := os.Create("cpu.prof")
if err != nil {
log.Fatalf("could not create cpu.prof: %v", err)
}
defer f.Close()
if err := pprof.StartCPUProfile(f); err != nil {
log.Fatalf("could not start CPU profile: %v", err)
}
defer pprof.StopCPUProfile()
for range 200_000 {
openrtb.Parse(data)
}
The -top output tells the story clearly:
0.09s 8.41% encoding/json.stateInString
0.09s 8.41% runtime.madvise
0.07s 6.54% encoding/json.checkValid
0.06s 5.61% encoding/json.(*decodeState).scanWhile
0.02s 1.87% encoding/json.indirect (cum: 0.17s, 15.89%)
encoding/json.checkValid at 6.54% — stdlib's JSON decoder makes two passes over the input: a validation pass to verify the JSON is well-formed, then a decode pass. You pay the cost of reading every byte twice on every call.
encoding/json.indirect (1.87% flat, 15.89% cumulative) — this is the reflection machinery that follows pointer chains. The flat time is small, but the cumulative time tells the real story: when you include all functions indirect calls, it accounts for nearly 16% of total parse time. For every optional field (*Site, *int, *float64), stdlib calls indirect to decide whether to allocate a new value or dereference an existing pointer. With 30+ pointer fields across the object graph, this overhead compounds.
String processing scattered across many functions — stateInString (8.41%), scanWhile (5.61%), rescanLiteral (3.74%), unquoteBytes (3.74%) are all part of the JSON state machine. None dominates alone, but together they consume a quarter of the CPU. This fragmentation across many state transitions is the cost of the generic, reflection-based approach.
runtime.madvise at 8.41% — memory management overhead from the 200k × 49 allocations = 9.8 million heap objects created during the profile. The Go runtime repeatedly negotiating memory page ownership with the kernel as the heap grows and shrinks. This level of madvise is specific to the benchmark's high allocation churn — in a steady-state service with a warmed heap, this cost drops significantly.
The profile confirms that stdlib's cost is structural, not incidental. Two full passes over the input and runtime reflection on every field are not things you can tune away at the call site.
json-iterator: cached reflection#
github.com/json-iterator/go is a drop-in stdlib replacement that caches reflection metadata aggressively. The first time it encounters a *BidRequest, it builds an internal type descriptor covering every field and how to populate it. Subsequent parses reuse that descriptor instead of re-deriving it through the reflection system.
import jsoniter "github.com/json-iterator/go"
var jsonIterator = jsoniter.ConfigCompatibleWithStandardLibrary
var br BidRequest
if err := jsonIterator.Unmarshal(data, &br); err != nil {
return nil, err
}
ConfigCompatibleWithStandardLibrary matches stdlib's edge-case behavior exactly — same error messages, same handling of unknown fields. The only change is the import.
BenchmarkParse_JsonIterator 2,185 ns/op 1,456 B/op 48 allocs/op
3.5× faster than stdlib. This is near the practical ceiling for reflection-based approaches — you can cache the metadata, but you can't eliminate the pointer indirection or the two-pass limitation.
goccy/go-json: no reflection#
github.com/goccy/go-json rewrites the JSON decoder from scratch. It analyzes your types at program startup and builds internal decode functions per type — no reflection at runtime, no two-pass validation, no indirect overhead.
import gojson "github.com/goccy/go-json"
var br BidRequest
if err := gojson.Unmarshal(data, &br); err != nil {
return nil, err
}
BenchmarkParse_GoJson 1,590 ns/op 1,963 B/op 23 allocs/op
4.8× faster than stdlib. Allocations drop from 49 to 23 — a 53% reduction.
The bytes-per-op is slightly higher than json-iterator (1,963 vs 1,456). That tradeoff favors go-json: GC pressure scales primarily with allocation count, not total bytes. Fewer heap objects means fewer GC write barriers, fewer things for the collector to track, fewer collection cycles triggered.
The -top profile output confirms what the speedup suggests. Profiling 200k iterations:
stdlib took 1.54 seconds total. goccy/go-json took 385ms total — 4× faster just to complete the same work.
30ms 12.50% github.com/goccy/go-json/internal/decoder.(*stringDecoder).decodeByte
20ms 8.33% github.com/goccy/go-json/internal/decoder.decodeKeyByBitmapUint16
20ms 8.33% github.com/goccy/go-json/internal/decoder.decodeKeyByBitmapUint8
20ms 8.33% runtime.madvise
10ms 4.17% github.com/goccy/go-json.unmarshal (cum: 100ms, 41.67%)
No checkValid. No indirect. No reflection machinery. Time is spent almost entirely in generated field assignment code (decodeKeyByBitmapUint* are direct key lookups in the JSON object, stringDecoder.decodeByte is direct string processing). The generated code does exactly what you'd write by hand: read the key, match it against known field names, assign the value.
A benchmark contamination story#
During this work I ran an intermediate experiment with github.com/mailru/easyjson — a code-generation approach that runs go generate to produce UnmarshalJSON methods on all your types. When those methods are present, encoding/json finds that your types implement json.Unmarshaler and calls the generated code instead of using reflection.
The result was unexpected: the "stdlib" benchmark got meaningfully faster after running the generator, and json-iterator regressed on allocations. The benchmarks were no longer measuring what their names implied.
This is a subtle trap. Any library that adds UnmarshalJSON or MarshalJSON methods to your types changes the behavior of every JSON benchmark in the same package. Baseline numbers must be captured before introducing such a library — or in a separate binary that doesn't link it. (For what it's worth, easyjson's generated code was comparable to goccy/go-json in throughput, but the contamination issue made it unsuitable for fair comparison in the same benchmark suite.)
Results#
Benchmarks run on Apple M2 Pro, Go 1.26.2, -bench=. -benchmem -count=5 (median of 5 runs reported). Input: a banner bid request with site, publisher, device, and user fully populated — 804 bytes of JSON.
| Implementation | ns/op | B/op | allocs/op | vs baseline | RPS/core @ 50% CPU |
|---|---|---|---|---|---|
encoding/json |
7,685 | 1,840 | 49 | 1× | ~65k |
json-iterator |
2,185 | 1,456 | 48 | 3.5× | ~229k |
goccy/go-json |
1,590 | 1,963 | 23 | 4.8× | ~314k |
The RPS column is 500ms ÷ ns/op — how many parses fit in half a second per core, leaving the other half for validation logic, network I/O, and everything else.
In adtech, mean latency is rarely the constraint — P99 is. Fewer allocations means fewer GC cycles, which means fewer pause events that inflate tail latencies. The allocation count column matters more than the ns/op column for production tail behavior.
A note on library stability: json-iterator is the more battle-tested choice — it's been in production at scale for years and its behavior is well-understood. goccy/go-json is faster but community-driven and moves quickly; pin your dependency and test after upgrades.
Production considerations#
Concurrency is the real multiplier
The benchmarks above are single-goroutine. A real validation service handles many requests concurrently. Running the same benchmark with 10 goroutines in parallel:
| ns/op | B/op | allocs/op | |
|---|---|---|---|
goccy/go-json single-goroutine |
1,590 | 1,963 | 23 |
goccy/go-json parallel (10 goroutines) |
911 | 1,965 | 23 |
go-json scales cleanly under concurrent load: each goroutine's allocations are independent, so there's no contention on the parser itself. The 43% throughput gain compounds on top of the 4.8× library improvement.
Parse only what you need
The full OpenRTB 2.6 spec has ~200 fields. A validation service typically reads 20–30 of them: request ID, impression IDs, bid floors, domain, device identifiers for fraud checks. Unmarshaling the remaining fields into Go structs wastes both CPU and memory. Define a purpose-built struct with only the fields your service reads — the JSON decoder ignores keys it has no destination for, and you stop paying to allocate structs you never inspect.
sync.Pool won't save you here — on its own
The natural instinct at this allocation count is to pool the BidRequest struct — reuse the top-level object, avoid the allocation. In practice it saves only 1 of 23 allocations. The nested structs (Site, Device, User) are populated from JSON into newly allocated values on every parse regardless. Pooling the outermost struct while the inner graph allocates freely is a marginal win — unless you rethink the allocation model entirely. That's where Part 2 picks up.
json.RawMessage for extension fields
Model every ext field as json.RawMessage. Extension data from exchanges passes through your validation layer intact to downstream systems, even for extensions your service doesn't understand. The alternative — dropping unknown fields — is silent data loss.
Conclusion#
JSON parsing performance is non-trivial at scale. encoding/json isn't the right default for high-throughput paths: the two-pass design and runtime reflection are structural costs visible directly in the profiler. goccy/go-json is a near-transparent swap that delivers 4.8× throughput and cuts allocation pressure in half — with the profiler output to explain exactly why.
Coming from C++, the thing that surprised me most wasn't the performance characteristics. It was the tooling. go test -bench and go tool pprof are first-class, zero-configuration, and genuinely informative.
For new high-throughput OpenRTB paths: default to goccy/go-json, profile first, and reach for json-iterator if you need a more conservative dependency.