Scope: This post builds on Part 1's benchmarks. It implements a hand-written decoder for the same OpenRTB types and optimizes it until it beats goccy/go-json. All code is on GitHub.
The question#
The go-json profile from Part 1 was instructive:
30ms 12.50% github.com/goccy/go-json/internal/decoder.(*stringDecoder).decodeByte
20ms 8.33% github.com/goccy/go-json/internal/decoder.decodeKeyByBitmapUint16
20ms 8.33% github.com/goccy/go-json/internal/decoder.decodeKeyByBitmapUint8
No reflection. No two-pass validation. Just string processing and key dispatch — exactly what you'd write by hand. The question is whether a hand-written decoder can match it, and what you learn about performance by trying.
This isn't a production recommendation. It's an exercise in understanding exactly what the library is doing.
Building the lexer#
A JSON lexer for OpenRTB doesn't need to be general-purpose. Bid request keys are short, known strings. Values are strings, integers, floats, and nested objects. There's no arbitrary nesting depth to manage.
The lexer advances a cursor through a []byte and exposes the pieces the decoder needs:
type lexer struct {
input []byte
pos int
}
Key operations: readKey scans the key string, readNumberBytes returns raw digit bytes, readStringVal returns a decoded string, scanRaw skips an entire JSON value without decoding (used for ext fields), readSep consumes , or the closing } / ].
The zero-copy string path
The hot case for readStringVal — no backslash escapes — avoids any allocation:
func (l *lexer) scanString() ([]byte, error) {
l.pos++ // skip opening '"'
start := l.pos
for {
n := indexStopByte(l.input[l.pos:])
// ...
if l.input[l.pos+n] == '"' {
val := l.input[start : l.pos+n] // slice into input — no copy
l.pos += n + 1
return val, nil
}
// handle escape sequence...
}
}
The returned []byte is a slice into the original input buffer. Converting it to a string without copying:
func bytesToString(b []byte) string {
return unsafe.String(unsafe.SliceData(b), len(b))
}
unsafe.String constructs a string header pointing at the same memory. Valid as long as the caller doesn't modify input while the string is live — ParseFast documents this contract. In practice this means: if you need to log a field, pass the BidRequest to another goroutine, or hold it past the point where the input buffer could be reused (e.g. returned to a sync.Pool), copy the strings first. The zero-copy path is safe for the common case of parse → use → release within a single request handler.
The decoder#
The decoder is a loop over JSON keys with a switch dispatching to field assignments:
key, err := l.readKey()
// ...
switch string(key) {
case "id":
br.ID, err = l.readStringVal()
case "imp":
br.Imp, err = decodeImpSlice(l, nil, nil)
case "site":
br.Site = new(Site)
err = decodeSite(l, br.Site)
// ...
default:
_, err = l.scanRaw() // skip unknown fields
}
This is the hand-written equivalent of decodeKeyByBitmapUint* from the go-json profile. The generated version uses a precomputed bitmap to dispatch on the first bytes of the key — faster on very large structs — but the logical operation is the same.
Optional pointer fields follow a fixed pattern:
case "at":
val, _ = l.readNumberBytes()
n, _ = parseIntBytes(val)
br.AT = &n // n escapes to the heap
The escape is unavoidable: &n outlives the stack frame and the compiler allocates n on the heap. This is the direct cost of *int in the type definition — one allocation per optional integer field.
All sub-decoders accept a *T instead of allocating internally. decodeSite(l *lexer, site *Site) fills into whatever memory the caller provides, which is what makes the arena optimization in the final section possible.
Profiling the first working version#
With the full decoder working, first stop is the profiler:
go test -bench=BenchmarkParse_HandWritten -cpuprofile=cpu.prof ./openrtb/
go tool pprof -top cpu.prof
strconv.Atoi at 9%
Every *int field — at, tmax, secure, pos, devicetype, yob — went through strconv.Atoi(string(val)). That conversion allocates a temporary string. strconv.Atoi also handles overflow detection and non-numeric input that a valid bid request will never send.
Replacing it with an inline loop:
func parseIntBytes(b []byte) (int, error) {
neg := b[0] == '-'
if neg {
b = b[1:]
}
n := 0
for _, c := range b {
n = n*10 + int(c-'0')
}
if neg {
return -n, nil
}
return n, nil
}
No allocation, no string conversion, handles only what the spec can actually send. strconv.Atoi disappeared from the profile entirely. The tradeoff: no overflow check — a malformed 20-digit value wraps silently. That's acceptable here because this decoder is not a general-purpose JSON parser; input validation is the caller's responsibility.
String scanning was left dominating. That's the next target.
indexStopByte: arm64 NEON#
scanString calls indexStopByte(remaining) to find the next " or \ in bulk, advancing past as many bytes as possible in one call. The benchmark machine is an M2 Pro, so the fast path targets arm64 NEON. indexstop_arm64.go declares the function stub (//go:build arm64, //go:noescape) with the implementation in assembly; indexstop_generic.go (//go:build !arm64) provides a scalar byte loop as the fallback. Everything else in the package calls indexStopByte either way. The x86 equivalent would use SSE2 (_mm_cmpeq_epi8), but that's outside the scope of this post.
On arm64, NEON processes 16 bytes per iteration:
TEXT ·indexStopByte(SB), NOSPLIT, $0-32
MOVD b_base+0(FP), R0 // R0 = &b[0]
MOVD b_len+8(FP), R1 // R1 = len(b)
MOVD $0, R2 // R2 = current index
CBZ R1, notfound
// Only set up NEON registers if we have >= 16 bytes to scan.
CMP $16, R1
BLT tail
MOVD $0x22, R3
VDUP R3, V0.B16 // V0 = ['"' x16]
MOVD $0x5C, R3
VDUP R3, V1.B16 // V1 = ['\' x16]
loop16:
// Load 16 bytes; R0 is post-incremented.
VLD1.P 16(R0), [V2.B16]
// V3[i] = 0xFF if V2[i] == '"', else 0x00
VCMEQ V0.B16, V2.B16, V3.B16
// V4[i] = 0xFF if V2[i] == '\', else 0x00
VCMEQ V1.B16, V2.B16, V4.B16
// V5[i] = 0xFF if V2[i] is a stop byte.
VORR V3.B16, V4.B16, V5.B16
// Check whether any lane matched: OR the two 64-bit halves into one GP register.
VMOV V5.D[0], R3
VMOV V5.D[1], R4
ORR R4, R3, R3
CBNZ R3, found16
ADD $16, R2, R2
SUB $16, R1, R1
CMP $16, R1
BGE loop16
VLD1.P loads 16 bytes and post-increments the pointer in one instruction. VCMEQ compares every lane simultaneously against the target byte, producing a mask vector where matched lanes are 0xFF and others are 0x00. VORR merges the two masks. Folding the result into a single GP register — VMOV V5.D[0] extracts the low 64-bit half, VMOV V5.D[1] the high — lets CBNZ branch on any match across all 16 lanes.
One dead end worth documenting: the obvious approach is VMAXV, which horizontally reduces a vector to its maximum lane value and writes it to a GP register in one instruction. Go's arm64 assembler doesn't recognize VMAXV. The two-VMOV + ORR pattern achieves the same result: if any byte in V5 is 0xFF, the OR of the two 64-bit halves is nonzero.
With a 109-byte User-Agent string in the test fixture, indexStopByte processes it in 6 NEON iterations instead of 109 scalar comparisons.
Arena allocation#
After NEON, BenchmarkParse_HandWritten was tied with BenchmarkParse_GoJson. BenchmarkParse_GoJsonPool still had an edge: it reuses the top-level BidRequest via sync.Pool. Part 1 noted that pooling the outer struct while the inner graph allocates freely is a marginal win. The arena takes this further.
The idea: pack BidRequest and all commonly-allocated sub-objects into one struct, then pool that.
type Arena struct {
BidRequest
site Site
app App
device Device
user User
impBuf [8]Imp
}
One pool entry covers the whole object graph. BidRequest, Site, Device, User are no longer separate heap objects — they're laid out contiguously inside Arena. For requests with up to 8 impressions (the realistic case), the []Imp backing array also lives in the arena.
Reset on reuse
When an arena comes out of the pool, only the fields that will be written need zeroing:
arena.BidRequest = BidRequest{}
arena.site = Site{}
arena.app = App{}
arena.device = Device{}
arena.user = User{}
// impBuf is reset implicitly: decoder receives arena.impBuf[:0]
The imp slice uses arena.impBuf as its backing array. decodeImpSlice(l, arena.impBuf[:0], nil) appends into it, keeping the allocation in the arena for requests with ≤8 impressions.
The public API is ParseFastArena(data []byte) (*BidRequest, error) and ReleaseArena(br *BidRequest). Pool management is hidden from callers — the arena is obtained and returned internally.
Results#
Apple M2 Pro, Go 1.26.2, -bench=. -benchmem -count=5, input: 804-byte banner bid request with site, publisher, device, and user fully populated.
| Implementation | ns/op | B/op | allocs/op |
|---|---|---|---|
encoding/json |
7,685 | 1,840 | 49 |
json-iterator |
2,185 | 1,456 | 48 |
goccy/go-json |
1,590 | 1,963 | 23 |
goccy/go-json + pool |
1,514 | 1,804 | 22 |
ParseFast |
1,360 | 968 | 19 |
ParseFastArena |
1,245 | 392 | 14 |
Parallel throughput (10 goroutines):
| ns/op | |
|---|---|
goccy/go-json + pool |
876 |
ParseFastArena |
348 |
The arena is 18% faster than GoJsonPool single-threaded and 2.5× faster in parallel. The parallel gap is wider because the arena dramatically reduces allocation pressure: fewer objects means fewer GC cycles, fewer write barriers, and less inter-goroutine coordination through the heap.
The 14 remaining allocations are all tied to the type definitions: one Publisher (pointer in Site), one Banner (pointer in Imp), seven *int / *float64 optional scalar fields, and the bcat string slice. Eliminating them would require changing the OpenRTB types.
What the exercise reveals#
Looking at the go-json profile from Part 1 now has a concrete meaning. decodeKeyByBitmapUint* is the switch string(key) dispatch. stringDecoder.decodeByte is scanString. The madvise pressure falls when allocation count falls.
The profile-driven approach is the real takeaway. parseIntBytes and minified test data together cut more time than the NEON path did. The assembly was the last few percent. Profile first, then decide what complexity is worth it.