Speeding up Go's builtin JSON encoder up to 55% for large arrays of objects

Published on March 3, 2022 by Phil Eaton

gojson

I was looking into some of octosql's benchmarks the other day and noticed a large chunk of time in DataStation/dsq is spent in encoding JSON objects. JSON is an intermediate format in DataStation and it's pretty inefficient. But the reason it's used is because almost every scripting language supported by DataStation has a builtin library for reading/writing JSON.

All code for these benchmarks are available on Github.

The resulting JSON encoder library is also available on Github.

Useful datasets

I threw together a quick CLI for generating fake data, fakegen. And generated two datasets: one with 20 columns and 1M rows, and one with 1K columns and 10K rows.

$ mkdir -p ~/tmp/benchmarks && cd ~/tmp/benchmarks
$ go install github.com/multiprocessio/fakegen@latest
$ fakegen --rows 100000 --cols 20 > long.json
$ fakegen --rows 10000 --cols 1000 > wide.json
$ ls -lah *.json
-rw-r--r-- 1 phil phil 1.2G Mar  3 15:42 long.json
-rw-r--r-- 1 phil phil 1.6G Mar  3 15:44 wide.json
$ wc *.json
wc *.json
   1999999  114514109 1214486728 long.json
     19999  213613856 1666306735 wide.json

A benchmark program

Then I started looking into what Go's JSON encoder is actually spending time doing.

First I wrote a program that reads and decodes a JSON file, picks an encoder (just the standard library encoder for now), and encodes the JSON object back into another file. I used pkg/profile to simplify the process of hooking into pprof so that I could get a CPU profile of execution.

$ go mod init main
$ cat main.go
package main

import (
        "encoding/json"
        "os"

        "github.com/pkg/profile"
)

func stdlibEncoder(out *os.File, obj interface{}) error {
        encoder := json.NewEncoder(out)
        return encoder.Encode(obj)
}

func main() {
        var in string
        encoderArg := "stdlib"
        encoder := stdlibEncoder

        for i, arg := range os.Args {
                if arg == "--in" {
                        in = os.Args[i+1]
                        i += 1
                        continue
                }

                if arg == "--encoder" {
                        encoderArg = os.Args[i+1]
                        switch encoderArg {
                        case "stdlib":
                                encoder = stdlibEncoder
                        default:
                                panic("Unknown encoder: " + encoderArg)
                        }
                        i += 1
                        continue
                }
        }

        fr, err := os.Open(in + ".json")
        if err != nil {
                panic(err)
        }
        defer fr.Close()

        decoder := json.NewDecoder(fr)
        var o interface{}
        err = decoder.Decode(&o)
        if err != nil {
                panic(err)
        }

        fw, err := os.OpenFile(in+"-"+encoderArg+".json", os.O_TRUNC|os.O_WRONLY|os.O_CREATE, os.ModePerm)
        if err != nil {
                panic(err)
        }
        defer fw.Close()

        p := profile.Start()
        defer p.Stop()
        err = encoder(fw, o)
        if err != nil {
                panic(err)
        }
}

Compile and run it:

$ go mod tidy
$ go build -o main main.go
$ ./main --in long
2022/03/03 15:49:00 profile: cpu profiling enabled, /tmp/profile2956118756/cpu.pprof
2022/03/03 15:49:08 profile: cpu profiling disabled, /tmp/profile2956118756/cpu.pprof

Examining pprof results

Now we can run go tool pprof against this profile to see where we're spending the most time:

$ go tool pprof -top /tmp/profile2956118756/cpu.pprof
File: main
Type: cpu
Time: Mar 3, 2022 at 3:49pm (UTC)
Duration: 8.15s, Total samples = 9.66s (118.54%)
Showing nodes accounting for 8.75s, 90.58% of 9.66s total
Dropped 95 nodes (cum <= 0.05s)
      flat  flat%   sum%        cum   cum%
     1.06s 10.97% 10.97%      1.55s 16.05%  encoding/json.(*encodeState).string
     0.57s  5.90% 16.87%      1.85s 19.15%  runtime.scanobject
     0.55s  5.69% 22.57%      0.55s  5.69%  runtime.memmove
     0.51s  5.28% 27.85%      0.51s  5.28%  cmpbody
     0.42s  4.35% 32.19%      1.53s 15.84%  runtime.mallocgc
     0.33s  3.42% 35.61%      0.33s  3.42%  runtime.pageIndexOf (inline)
     0.25s  2.59% 38.20%      0.40s  4.14%  runtime.findObject
     0.24s  2.48% 40.68%      0.33s  3.42%  runtime.heapBitsSetType
     0.21s  2.17% 42.86%      0.71s  7.35%  runtime.greyobject
     0.21s  2.17% 45.03%      0.33s  3.42%  runtime.mapiternext
     0.21s  2.17% 47.20%      0.85s  8.80%  sort.insertionSort_func
     0.20s  2.07% 49.28%      0.20s  2.07%  strconv.ryuDigits32
     0.19s  1.97% 51.24%      7.59s 78.57%  encoding/json.mapEncoder.encode
     0.19s  1.97% 53.21%      0.74s  7.66%  sort.doPivot_func
     0.18s  1.86% 55.07%      0.70s  7.25%  encoding/json.mapEncoder.encode.func1
     0.17s  1.76% 56.83%      0.17s  1.76%  runtime.nextFreeFast (inline)
     0.16s  1.66% 58.49%      0.48s  4.97%  runtime.typedmemmove
     0.14s  1.45% 59.94%      0.14s  1.45%  runtime.memclrNoHeapPointers
     0.13s  1.35% 61.28%      7.66s 79.30%  encoding/json.(*encodeState).reflectValue
     0.13s  1.35% 62.63%      0.35s  3.62%  strconv.ryuDigits
     0.12s  1.24% 63.87%      0.12s  1.24%  bytes.(*Buffer).tryGrowByReslice (partial-inline)
     0.11s  1.14% 65.01%      0.43s  4.45%  bytes.(*Buffer).WriteString
     0.10s  1.04% 66.05%      0.50s  5.18%  internal/reflectlite.Swapper.func9
     0.10s  1.04% 67.08%      0.39s  4.04%  internal/reflectlite.typedmemmove
     0.10s  1.04% 68.12%      0.10s  1.04%  runtime.procyield
     0.10s  1.04% 69.15%      0.56s  5.80%  strconv.genericFtoa
     0.09s  0.93% 70.08%      0.49s  5.07%  reflect.(*MapIter).Next
     0.09s  0.93% 71.01%      0.09s  0.93%  runtime.heapBits.bits (inline)
     0.09s  0.93% 71.95%      0.37s  3.83%  runtime.mapaccess2
     0.08s  0.83% 72.77%      0.13s  1.35%  bytes.(*Buffer).WriteByte
     0.08s  0.83% 73.60%      1.12s 11.59%  reflect.copyVal
     0.08s  0.83% 74.43%      0.09s  0.93%  runtime.heapBitsForAddr (inline)
     0.07s  0.72% 75.16%      0.57s  5.90%  encoding/json.valueEncoder
     0.07s  0.72% 75.88%      0.56s  5.80%  reflect.(*MapIter).Key
     0.07s  0.72% 76.60%      0.07s  0.72%  runtime.(*mspan).divideByElemSize (inline)
     0.07s  0.72% 77.33%      0.07s  0.72%  runtime.add (partial-inline)
     0.07s  0.72% 78.05%      0.07s  0.72%  runtime.heapBits.next (inline)
     0.07s  0.72% 78.78%      0.07s  0.72%  runtime.memhash64
     0.06s  0.62% 79.40%      0.69s  7.14%  encoding/json.floatEncoder.encode
     0.06s  0.62% 80.02%      0.80s  8.28%  reflect.unsafe_New
     0.06s  0.62% 80.64%      0.10s  1.04%  runtime.(*bmap).overflow (inline)
     0.06s  0.62% 81.26%      0.09s  0.93%  runtime.(*gcBits).bitp (inline)
     0.06s  0.62% 81.88%      0.18s  1.86%  runtime.nilinterhash
     0.06s  0.62% 82.51%      0.07s  0.72%  runtime.spanOf (inline)
     0.06s  0.62% 83.13%      0.18s  1.86%  sort.medianOfThree_func
     0.05s  0.52% 83.64%      1.16s 12.01%  encoding/json.stringEncoder
     0.05s  0.52% 84.16%      0.53s  5.49%  reflect.typedmemmove (partial-inline)
     0.05s  0.52% 84.68%      0.05s  0.52%  runtime.(*gcWork).putFast (inline)
     0.05s  0.52% 85.20%      0.05s  0.52%  runtime.acquirem (inline)
     0.05s  0.52% 85.71%      1.86s 19.25%  runtime.gcDrain
     0.05s  0.52% 86.23%      0.08s  0.83%  runtime.nilinterequal
     0.05s  0.52% 86.75%      1.71s 17.70%  sort.quickSort_func
     0.04s  0.41% 87.16%      0.11s  1.14%  runtime.typehash
     0.04s  0.41% 87.58%      0.39s  4.04%  strconv.ryuFtoaShortest
     0.04s  0.41% 87.99%      0.46s  4.76%  sync.(*Map).Load
     0.03s  0.31% 88.30%      7.66s 79.30%  encoding/json.interfaceEncoder
     0.03s  0.31% 88.61%      0.70s  7.25%  reflect.(*MapIter).Value
     0.03s  0.31% 88.92%      0.09s  0.93%  reflect.Value.Elem
     0.03s  0.31% 89.23%      0.06s  0.62%  strconv.formatDigits
     0.02s  0.21% 89.44%      0.05s  0.52%  bytes.(*Buffer).Write
     0.02s  0.21% 89.65%      0.48s  4.97%  encoding/json.typeEncoder
     0.02s  0.21% 89.86%      0.23s  2.38%  runtime.(*mheap).alloc
     0.02s  0.21% 90.06%      0.11s  1.14%  runtime.(*mheap).allocSpan
     0.01s   0.1% 90.17%      0.08s  0.83%  reflect.mapiterinit
     0.01s   0.1% 90.27%      0.31s  3.21%  reflect.mapiternext
     0.01s   0.1% 90.37%      0.29s  3.00%  runtime.(*mcache).nextFree
     0.01s   0.1% 90.48%      0.15s  1.55%  runtime.newobject
     0.01s   0.1% 90.58%      0.08s  0.83%  runtime.sweepone

Roughly 8.2 seconds. Now let's also run against the wide JSON dataset and profile that result.

$ ./main --in wide
2022/03/03 15:50:30 profile: cpu profiling enabled, /tmp/profile800187419/cpu.pprof
2022/03/03 15:50:36 profile: cpu profiling disabled, /tmp/profile800187419/cpu.pprof
$ go tool pprof -top /tmp/profile800187419/cpu.pprof
File: main
Type: cpu
Time: Mar 3, 2022 at 3:50pm (UTC)
Duration: 6.36s, Total samples = 7.11s (111.88%)
Showing nodes accounting for 6.67s, 93.81% of 7.11s total
Dropped 61 nodes (cum <= 0.04s)
      flat  flat%   sum%        cum   cum%
     1.28s 18.00% 18.00%      1.75s 24.61%  encoding/json.(*encodeState).string
     0.77s 10.83% 28.83%      0.77s 10.83%  cmpbody
     0.41s  5.77% 34.60%      0.41s  5.77%  runtime.memmove
     0.34s  4.78% 39.38%      1.15s 16.17%  encoding/json.mapEncoder.encode.func1
     0.32s  4.50% 43.88%      1.05s 14.77%  runtime.scanobject
     0.29s  4.08% 47.96%      1.53s 21.52%  sort.doPivot_func
     0.26s  3.66% 51.62%      0.68s  9.56%  runtime.mallocgc
     0.26s  3.66% 55.27%      0.26s  3.66%  runtime.procyield
     0.17s  2.39% 57.67%      0.38s  5.34%  runtime.greyobject
     0.17s  2.39% 60.06%      0.17s  2.39%  runtime.pageIndexOf (inline)
     0.15s  2.11% 62.17%      0.22s  3.09%  runtime.findObject
     0.15s  2.11% 64.28%      0.28s  3.94%  runtime.typedmemmove
     0.13s  1.83% 66.10%      5.80s 81.58%  encoding/json.mapEncoder.encode
     0.12s  1.69% 67.79%      0.48s  6.75%  sort.insertionSort_func
     0.10s  1.41% 69.20%      0.29s  4.08%  internal/reflectlite.typedmemmove
     0.10s  1.41% 70.60%      0.10s  1.41%  runtime.memclrNoHeapPointers
     0.09s  1.27% 71.87%      0.45s  6.33%  internal/reflectlite.Swapper.func9
     0.09s  1.27% 73.14%      0.36s  5.06%  reflect.(*MapIter).Key
     0.09s  1.27% 74.40%      0.09s  1.27%  runtime.heapBits.bits (inline)
     0.09s  1.27% 75.67%      0.12s  1.69%  runtime.heapBitsSetType
     0.08s  1.13% 76.79%      0.12s  1.69%  bytes.(*Buffer).WriteByte
     0.08s  1.13% 77.92%      0.08s  1.13%  runtime.nextFreeFast (inline)
     0.07s  0.98% 78.90%      0.17s  2.39%  runtime.mapaccess2
     0.06s  0.84% 79.75%      0.11s  1.55%  runtime.mapiternext
     0.06s  0.84% 80.59%      0.06s  0.84%  strconv.ryuDigits32
     0.06s  0.84% 81.43%      0.25s  3.52%  sync.(*Map).Load
     0.05s   0.7% 82.14%      0.41s  5.77%  bytes.(*Buffer).WriteString
     0.05s   0.7% 82.84%      0.07s  0.98%  internal/reflectlite.arrayAt (inline)
     0.05s   0.7% 83.54%      0.18s  2.53%  reflect.(*MapIter).Next
     0.05s   0.7% 84.25%      0.05s   0.7%  runtime.add (partial-inline)
     0.05s   0.7% 84.95%      0.05s   0.7%  runtime.memhash64
     0.05s   0.7% 85.65%      0.05s   0.7%  runtime.spanOf (inline)
     0.05s   0.7% 86.36%      0.13s  1.83%  strconv.ryuDigits
     0.04s  0.56% 86.92%      0.04s  0.56%  bytes.(*Buffer).tryGrowByReslice (inline)
     0.04s  0.56% 87.48%      5.80s 81.58%  encoding/json.(*encodeState).reflectValue
     0.04s  0.56% 88.05%      0.04s  0.56%  reflect.Value.IsNil (inline)
     0.04s  0.56% 88.61%      0.04s  0.56%  runtime.cmpstring
     0.04s  0.56% 89.17%      0.12s  1.69%  sort.medianOfThree_func
     0.04s  0.56% 89.73%      2.09s 29.40%  sort.quickSort_func
     0.04s  0.56% 90.30%      0.21s  2.95%  strconv.genericFtoa
     0.03s  0.42% 90.72%      0.28s  3.94%  encoding/json.typeEncoder
     0.03s  0.42% 91.14%      0.36s  5.06%  reflect.(*MapIter).Value
     0.03s  0.42% 91.56%      1.26s 17.72%  runtime.gcDrain
     0.03s  0.42% 91.98%      0.04s  0.56%  runtime.heapBits.next (inline)
     0.02s  0.28% 92.26%      0.26s  3.66%  encoding/json.floatEncoder.encode
     0.02s  0.28% 92.55%      5.80s 81.58%  encoding/json.interfaceEncoder
     0.02s  0.28% 92.83%      0.06s  0.84%  reflect.Value.Elem
     0.02s  0.28% 93.11%      0.60s  8.44%  reflect.copyVal
     0.01s  0.14% 93.25%      0.12s  1.69%  reflect.mapiternext
     0.01s  0.14% 93.39%      0.29s  4.08%  reflect.typedmemmove
     0.01s  0.14% 93.53%      0.04s  0.56%  reflect.unpackEface (inline)
     0.01s  0.14% 93.67%      0.28s  3.94%  runtime.suspendG
     0.01s  0.14% 93.81%      0.14s  1.97%  strconv.ryuFtoaShortest

Roughly 6.4 seconds.

Sorting object keys

Now one thing we notice is that it spends a good chunk of time in sort functions. In fact, it is hardcoded in Go's JSON implementation to require sorting of object keys. The most common data in DataStation is arrays of objects representing rows of data. This JSON is an intermediate representation so there's no value in DataStation/dsq to having keys sorted.

Could we improve performance if we wrote a specialization of the builtin JSON library that skips sorting JSON object keys if the overall object to be written is an array of objects? We'll only care about the top-level object keys. If there are nested objects we won't bother about that. Within nested objects we'll just use the existing Go JSON encoder. Having a fallback and internally using the Go JSON encoder makes this a pretty safe and simple approach.

Let's try a basic implementation.

$ cp main.go nosort.go
$ diff -u main.go nosort.go
--- main.go     2022-03-03 14:25:01.530812750 +0000
+++ nosort.go   2022-03-03 18:58:35.227829357 +0000
@@ -2,11 +2,94 @@

 import (
        "encoding/json"
+       "log"
        "os"
+       "strconv"

        "github.com/pkg/profile"
 )

+func nosortEncoder(out *os.File, obj interface{}) error {
+       a, ok := obj.([]interface{})
+       // Fall back to normal encoder
+       if !ok {
+               log.Println("Falling back to stdlib")
+               return stdlibEncoder(out, obj)
+       }
+
+       _, err := out.Write([]byte("["))
+       if err != nil {
+               return err
+       }
+
+       for i, row := range a {
+               // Write a comma before the current object
+               if i > 0 {
+                       _, err = out.Write([]byte(",\n"))
+                       if err != nil {
+                               return err
+                       }
+               }
+
+               r, ok := row.(map[string]interface{})
+               if !ok {
+                       log.Println("Falling back to stdlib")
+                       bs, err := json.Marshal(row)
+                       if err != nil {
+                               return err
+                       }
+
+                       _, err = out.Write(bs)
+                       if err != nil {
+                               return err
+                       }
+
+                       continue
+               }
+
+               _, err := out.Write([]byte("{"))
+               if err != nil {
+                       return err
+               }
+
+               j := -1
+               for col, val := range r {
+                       j += 1
+
+                       // Write a comma before the current key-value
+                       if j > 0 {
+                               _, err = out.Write([]byte(","))
+                               if err != nil {
+                                       return err
+                               }
+                       }
+
+                       _, err = out.Write([]byte(strconv.QuoteToASCII(col) + ":"))
+                       if err != nil {
+                               return err
+                       }
+
+                       bs, err := json.Marshal(val)
+                       if err != nil {
+                               return err
+                       }
+
+                       _, err = out.Write(bs)
+                       if err != nil {
+                               return err
+                       }
+               }
+
+               _, err = out.Write([]byte("}"))
+               if err != nil {
+                       return err
+               }
+       }
+
+       _, err = out.Write([]byte("]"))
+       return err
+}
+
 func stdlibEncoder(out *os.File, obj interface{}) error {
        encoder := json.NewEncoder(out)
        return encoder.Encode(obj)
@@ -29,6 +112,8 @@
                        switch encoderArg {
                        case "stdlib":
                                encoder = stdlibEncoder
+                       case "nosort":
+                               encoder = nosortEncoder
                        default:
                                panic("Unknown encoder: " + encoderArg)
                        }

Very simple code that just does some type checking and mostly spends time writing JSON wrapper syntax, with calls to Go's builtin JSON library inside of objects. The only funky thing in there you may notice is the strconv.QuoteToAscii call. This merely quotes and escapes nested quotes. This is necessary since escaped nested quotes are valid within a JSON object key.

Let's build and run, passing the new encoder name to main.

$ go build -o main nosort.go
$ ./main --in wide --encoder nosort
2022/03/03 15:53:51 profile: cpu profiling enabled, /tmp/profile1940788787/cpu.pprof
2022/03/03 15:54:40 profile: cpu profiling disabled, /tmp/profile1940788787/cpu.pprof

Roughly 49 seconds. Woah. That's way slower than the builtin JSON library. But let's dig in with pprof to understand why. Since we should be doing exactly what the Go library does but not sorting, it shouldn't be possible that we're slower.

$ go tool pprof -top /tmp/profile1940788787/cpu.pprof
File: main
Type: cpu
Time: Mar 3, 2022 at 3:53pm (UTC)
Duration: 48.86s, Total samples = 47.64s (97.51%)
Showing nodes accounting for 45.44s, 95.38% of 47.64s total
Dropped 87 nodes (cum <= 0.24s)
      flat  flat%   sum%        cum   cum%
    39.58s 83.08% 83.08%     41.57s 87.26%  syscall.Syscall
     1.60s  3.36% 86.44%      1.72s  3.61%  encoding/json.(*encodeState).string
     0.62s  1.30% 87.74%      0.71s  1.49%  runtime.casgstatus
     0.51s  1.07% 88.81%      0.97s  2.04%  strconv.appendQuotedWith
     0.40s  0.84% 89.65%      0.93s  1.95%  runtime.reentersyscall
     0.34s  0.71% 90.37%      0.46s  0.97%  strconv.appendEscapedRune
     0.29s  0.61% 90.97%      0.48s  1.01%  runtime.exitsyscallfast
     0.27s  0.57% 91.54%      1.04s  2.18%  runtime.exitsyscall
     0.26s  0.55% 92.09%      0.26s  0.55%  runtime.memmove
     0.23s  0.48% 92.57%      0.42s  0.88%  runtime.mallocgc
     0.22s  0.46% 93.03%      0.31s  0.65%  runtime.mapaccess2
     0.21s  0.44% 93.47%     42.35s 88.90%  internal/poll.(*FD).Write
     0.14s  0.29% 93.77%      0.28s  0.59%  runtime.concatstrings
     0.13s  0.27% 94.04%     47.63s   100%  main.nosortEncoder
     0.08s  0.17% 94.21%      0.49s  1.03%  sync.(*Map).Load
     0.07s  0.15% 94.35%     17.57s 36.88%  encoding/json.(*Encoder).Encode
     0.07s  0.15% 94.50%     42.42s 89.04%  os.(*File).write (inline)
     0.06s  0.13% 94.63%     42.49s 89.19%  os.(*File).Write
     0.05s   0.1% 94.73%      0.26s  0.55%  internal/poll.(*FD).writeUnlock
     0.05s   0.1% 94.84%      0.24s   0.5%  strconv.genericFtoa
     0.05s   0.1% 94.94%     41.62s 87.36%  syscall.write
     0.04s 0.084% 95.03%      2.65s  5.56%  encoding/json.(*encodeState).reflectValue
     0.03s 0.063% 95.09%      2.72s  5.71%  encoding/json.(*encodeState).marshal
     0.03s 0.063% 95.15%      0.31s  0.65%  runtime.concatstring2
     0.02s 0.042% 95.19%      0.30s  0.63%  encoding/json.floatEncoder.encode
     0.02s 0.042% 95.24%     41.64s 87.41%  internal/poll.ignoringEINTRIO (inline)
     0.02s 0.042% 95.28%      0.95s  1.99%  runtime.entersyscall
     0.02s 0.042% 95.32%      0.44s  0.92%  runtime.makeslice
     0.01s 0.021% 95.34%      1.81s  3.80%  encoding/json.stringEncoder
     0.01s 0.021% 95.36%      0.50s  1.05%  encoding/json.typeEncoder
     0.01s 0.021% 95.38%      1.48s  3.11%  strconv.quoteWith (inline)
         0     0% 95.38%      0.50s  1.05%  encoding/json.valueEncoder
         0     0% 95.38%     47.63s   100%  main.main
         0     0% 95.38%     47.63s   100%  runtime.main
         0     0% 95.38%      0.24s   0.5%  strconv.AppendFloat (inline)
         0     0% 95.38%      1.48s  3.11%  strconv.QuoteToASCII (inline)
         0     0% 95.38%     41.62s 87.36%  syscall.Write (inline)

Buffered I/O

Ok so in this case we spend a huge amount of time in the write syscall. The traditional way to get around this is to used buffered IO so you're not actually calling the write syscall all the time. Let's give that a shot.

$ cp nosort.go bufio.go
$ diff -u nosort.go bufio.go
--- nosort.go   2022-03-03 18:58:35.227829357 +0000
+++ bufio.go    2022-03-03 19:02:03.913590177 +0000
@@ -1,6 +1,7 @@
 package main

 import (
+       "bufio"
        "encoding/json"
        "log"
        "os"
@@ -17,7 +18,9 @@
                return stdlibEncoder(out, obj)
        }

-       _, err := out.Write([]byte("["))
+       bo := bufio.NewWriter(out)
+       defer bo.Flush()
+       _, err := bo.Write([]byte("["))
        if err != nil {
                return err
        }
@@ -25,7 +28,7 @@
        for i, row := range a {
                // Write a comma before the current object
                if i > 0 {
-                       _, err = out.Write([]byte(",\n"))
+                       _, err = bo.Write([]byte(",\n"))
                        if err != nil {
                                return err
                        }
@@ -39,15 +42,14 @@
                                return err
                        }

-                       _, err = out.Write(bs)
+                       _, err = bo.Write(bs)
                        if err != nil {
                                return err
                        }
-
                        continue
                }

-               _, err := out.Write([]byte("{"))
+               _, err := bo.Write([]byte("{"))
                if err != nil {
                        return err
                }
@@ -58,13 +60,13 @@

                        // Write a comma before the current key-value
                        if j > 0 {
-                               _, err = out.Write([]byte(","))
+                               _, err = bo.Write([]byte(","))
                                if err != nil {
                                        return err
                                }
                        }

-                       _, err = out.Write([]byte(strconv.QuoteToASCII(col) + ":"))
+                       _, err = bo.Write([]byte(strconv.QuoteToASCII(col) + ":"))
                        if err != nil {
                                return err
                        }
@@ -74,19 +76,19 @@
                                return err
                        }

-                       _, err = out.Write(bs)
+                       _, err = bo.Write(bs)
                        if err != nil {
                                return err
                        }
                }

-               _, err = out.Write([]byte("}"))
+               _, err = bo.Write([]byte("}"))
                if err != nil {
                        return err
                }
        }

-       _, err = out.Write([]byte("]"))
+       _, err = bo.Write([]byte("]"))
        return err
 }

Build it and run it:

$ go build -o main bufio.go
$ ./main --in wide --encoder nosort
2022/03/03 19:11:12 profile: cpu profiling enabled, /tmp/profile1195717494/cpu.pprof
2022/03/03 19:11:19 profile: cpu profiling disabled, /tmp/profile1195717494/cpu.pprof

Roughly 7 seconds. Not bad down from 49 seconds! But let's see where we're spending time now.

$ go tool pprof -top /tmp/profile1195717494/cpu.pprof
File: main
Type: cpu
Time: Mar 3, 2022 at 7:11pm (UTC)
Duration: 6.41s, Total samples = 6.26s (97.60%)
Showing nodes accounting for 5.79s, 92.49% of 6.26s total
Dropped 47 nodes (cum <= 0.03s)
      flat  flat%   sum%        cum   cum%
     1.69s 27.00% 27.00%      1.88s 30.03%  encoding/json.(*encodeState).string
     0.64s 10.22% 37.22%      0.71s 11.34%  syscall.Syscall
     0.44s  7.03% 44.25%      0.76s 12.14%  strconv.appendQuotedWith
     0.43s  6.87% 51.12%      0.43s  6.87%  runtime.memmove
     0.29s  4.63% 55.75%      0.69s 11.02%  runtime.mallocgc
     0.29s  4.63% 60.38%      0.29s  4.63%  runtime.memclrNoHeapPointers
     0.24s  3.83% 64.22%      0.31s  4.95%  strconv.appendEscapedRune
     0.16s  2.56% 66.77%      0.16s  2.56%  runtime.nextFreeFast (inline)
     0.14s  2.24% 69.01%      0.97s 15.50%  bufio.(*Writer).Write
     0.11s  1.76% 70.77%      0.15s  2.40%  runtime.mapaccess2
     0.10s  1.60% 72.36%      0.19s  3.04%  runtime.mapiternext
     0.09s  1.44% 73.80%      0.16s  2.56%  runtime.concatstrings
     0.08s  1.28% 75.08%      0.08s  1.28%  reflect.Value.String
     0.07s  1.12% 76.20%      0.07s  1.12%  strconv.IsPrint
     0.06s  0.96% 77.16%      3.38s 53.99%  encoding/json.Marshal
     0.06s  0.96% 78.12%      6.26s   100%  main.nosortEncoder
     0.06s  0.96% 79.07%      0.31s  4.95%  runtime.rawbyteslice
     0.06s  0.96% 80.03%      0.10s  1.60%  sync.(*Pool).Put
     0.06s  0.96% 80.99%      0.09s  1.44%  sync.(*Pool).pin
     0.05s   0.8% 81.79%      2.49s 39.78%  encoding/json.(*encodeState).marshal
     0.05s   0.8% 82.59%      0.06s  0.96%  runtime.(*bmap).overflow (inline)
     0.05s   0.8% 83.39%      0.55s  8.79%  runtime.growslice
     0.05s   0.8% 84.19%      0.06s  0.96%  runtime.reentersyscall
     0.05s   0.8% 84.98%      0.05s   0.8%  strconv.ryuDigits32
     0.04s  0.64% 85.62%      0.04s  0.64%  bytes.(*Buffer).tryGrowByReslice (inline)
     0.04s  0.64% 86.26%      0.04s  0.64%  runtime.acquirem (inline)
     0.04s  0.64% 86.90%      0.20s  3.19%  runtime.concatstring2
     0.04s  0.64% 87.54%      0.21s  3.35%  sync.(*Map).Load
     0.03s  0.48% 88.02%      0.12s  1.92%  encoding/json.newEncodeState
     0.03s  0.48% 88.50%      0.22s  3.51%  runtime.makeslice
     0.03s  0.48% 88.98%      0.38s  6.07%  runtime.stringtoslicebyte
     0.03s  0.48% 89.46%      0.09s  1.44%  strconv.ryuDigits
     0.02s  0.32% 89.78%      1.98s 31.63%  encoding/json.stringEncoder
     0.02s  0.32% 90.10%      0.25s  3.99%  encoding/json.valueEncoder
     0.02s  0.32% 90.42%      0.05s   0.8%  runtime.slicebytetostring
     0.02s  0.32% 90.73%      0.15s  2.40%  strconv.genericFtoa
     0.02s  0.32% 91.05%      1.05s 16.77%  strconv.quoteWith (inline)
     0.02s  0.32% 91.37%      0.08s  1.28%  sync.(*Pool).Get
     0.01s  0.16% 91.53%      0.74s 11.82%  bufio.(*Writer).Flush
     0.01s  0.16% 91.69%      0.04s  0.64%  bytes.(*Buffer).WriteByte
     0.01s  0.16% 91.85%      0.15s  2.40%  bytes.(*Buffer).WriteString
     0.01s  0.16% 92.01%      0.73s 11.66%  os.(*File).Write
     0.01s  0.16% 92.17%      0.13s  2.08%  runtime.(*mcache).refill
     0.01s  0.16% 92.33%      0.07s  1.12%  runtime.(*mheap).allocSpan
     0.01s  0.16% 92.49%      0.10s  1.60%  strconv.ryuFtoaShortest
         0     0% 92.49%      2.41s 38.50%  encoding/json.(*encodeState).reflectValue
         0     0% 92.49%      0.18s  2.88%  encoding/json.floatEncoder.encode
         0     0% 92.49%      0.21s  3.35%  encoding/json.typeEncoder
         0     0% 92.49%      0.72s 11.50%  internal/poll.(*FD).Write
         0     0% 92.49%      0.71s 11.34%  internal/poll.ignoringEINTRIO (inline)
         0     0% 92.49%      6.26s   100%  main.main
         0     0% 92.49%      0.72s 11.50%  os.(*File).write (inline)
         0     0% 92.49%      0.14s  2.24%  runtime.(*mcache).nextFree
         0     0% 92.49%      0.11s  1.76%  runtime.(*mcentral).cacheSpan
         0     0% 92.49%      0.10s  1.60%  runtime.(*mcentral).grow
         0     0% 92.49%      0.10s  1.60%  runtime.(*mheap).alloc
         0     0% 92.49%      0.07s  1.12%  runtime.(*mheap).alloc.func1
         0     0% 92.49%      0.06s  0.96%  runtime.entersyscall
         0     0% 92.49%      6.26s   100%  runtime.main
         0     0% 92.49%      0.07s  1.12%  runtime.systemstack
         0     0% 92.49%      0.15s  2.40%  strconv.AppendFloat (inline)
         0     0% 92.49%      1.05s 16.77%  strconv.QuoteToASCII (inline)
         0     0% 92.49%      0.71s 11.34%  syscall.Write (inline)
         0     0% 92.49%      0.71s 11.34%  syscall.write

Infinite buffer

Surprisingly, we're still spending a big chunk of time in a write syscall. If we look at the source code for bufio, we can see that the default size if 4096. So in a big file like this we'll still be calling the write syscall a lot.

Now bufio.Writer has an odd interface. We can only specify an exact size for the internal buffer. We could raise this to be the max size of integers but then that would be a huge chunk of memory we always allocate. That doesn't really make sense.

We could instead use a bytes.Buffer that is allowed to grow. Then we only write to the file once it's full. A lazier approach would be to ignore it getting full (which will just cause an error to be returned eventually and the whole encoder to fail) and just copy from the buffer to the file once at the end. Let's give that a shot.

$ cp bufio.go buffer.go
$ diff -u bufio.go buffer.go
--- bufio.go    2022-03-03 19:02:03.913590177 +0000
+++ buffer.go   2022-03-03 19:03:38.007564957 +0000
@@ -1,7 +1,7 @@
 package main
 
 import (
-       "bufio"
+       "bytes"
        "encoding/json"
        "log"
        "os"
@@ -18,8 +18,7 @@
                return stdlibEncoder(out, obj)
        }
 
-       bo := bufio.NewWriter(out)
-       defer bo.Flush()
+       bo := bytes.NewBuffer(nil)
        _, err := bo.Write([]byte("["))
        if err != nil {
                return err
@@ -89,6 +88,14 @@
        }
 
        _, err = bo.Write([]byte("]"))
+
+       for bo.Len() > 0 {
+               _, err := bo.WriteTo(out)
+               if err != nil {
+                       return err
+               }
+       }
+
        return err
 }

And run it.

$ go build -o main buffer.go
$ ./main --in wide --encoder nosort
2022/03/03 19:12:58 profile: cpu profiling enabled, /tmp/profile3980759756/cpu.pprof
2022/03/03 19:13:04 profile: cpu profiling disabled, /tmp/profile3980759756/cpu.pprof
$ go tool pprof -top /tmp/profile3980759756/cpu.pprof
File: main
Type: cpu
Time: Mar 3, 2022 at 7:12pm (UTC)
Duration: 5.77s, Total samples = 6.57s (113.88%)
Showing nodes accounting for 5.93s, 90.26% of 6.57s total
Dropped 78 nodes (cum <= 0.03s)
      flat  flat%   sum%        cum   cum%
     1.34s 20.40% 20.40%      1.52s 23.14%  encoding/json.(*encodeState).string
     0.70s 10.65% 31.05%      0.70s 10.65%  runtime.memmove
     0.50s  7.61% 38.66%      0.74s 11.26%  strconv.appendQuotedWith
     0.32s  4.87% 43.53%      1.11s 16.89%  runtime.scanobject
     0.29s  4.41% 47.95%      0.29s  4.41%  runtime.memclrNoHeapPointers
     0.25s  3.81% 51.75%      0.65s  9.89%  runtime.mallocgc
     0.20s  3.04% 54.79%      0.71s 10.81%  bytes.(*Buffer).Write
     0.19s  2.89% 57.69%      0.23s  3.50%  strconv.appendEscapedRune
     0.17s  2.59% 60.27%      0.38s  5.78%  runtime.greyobject
     0.16s  2.44% 62.71%      0.31s  4.72%  runtime.findObject
     0.16s  2.44% 65.14%      0.16s  2.44%  runtime.pageIndexOf (inline)
     0.14s  2.13% 67.28%      0.21s  3.20%  runtime.mapaccess2
     0.10s  1.52% 68.80%      5.27s 80.21%  main.nosortEncoder
     0.10s  1.52% 70.32%      0.16s  2.44%  runtime.concatstrings
     0.10s  1.52% 71.84%      0.10s  1.52%  runtime.nextFreeFast (inline)
     0.09s  1.37% 73.21%      0.09s  1.37%  runtime.procyield
     0.08s  1.22% 74.43%      0.09s  1.37%  runtime.spanOf (inline)
     0.08s  1.22% 75.65%      0.15s  2.28%  strconv.genericFtoa
     0.08s  1.22% 76.86%      0.11s  1.67%  sync.(*Pool).Put
     0.07s  1.07% 77.93%      0.07s  1.07%  reflect.Value.String
     0.07s  1.07% 79.00%      0.56s  8.52%  runtime.growslice
     0.07s  1.07% 80.06%      0.08s  1.22%  runtime.mapiternext
     0.06s  0.91% 80.97%      0.06s  0.91%  runtime.(*mspan).divideByElemSize (inline)
     0.06s  0.91% 81.89%      0.10s  1.52%  runtime.slicebytetostring
     0.05s  0.76% 82.65%      0.07s  1.07%  sync.(*Pool).Get
     0.04s  0.61% 83.26%      3.07s 46.73%  encoding/json.Marshal
     0.04s  0.61% 83.87%      0.05s  0.76%  runtime.heapBits.next (inline)
     0.04s  0.61% 84.47%      0.04s  0.61%  strconv.IsPrint
     0.04s  0.61% 85.08%      0.28s  4.26%  sync.(*Map).Load
     0.03s  0.46% 85.54%      0.15s  2.28%  bytes.(*Buffer).WriteString
     0.03s  0.46% 86.00%      2.17s 33.03%  encoding/json.(*encodeState).reflectValue
     0.03s  0.46% 86.45%      0.31s  4.72%  encoding/json.typeEncoder
     0.03s  0.46% 86.91%      0.05s  0.76%  runtime.stringtoslicebyte
     0.03s  0.46% 87.37%      1.12s 17.05%  strconv.quoteWith (inline)
     0.02s   0.3% 87.67%      0.10s  1.52%  encoding/json.newEncodeState
     0.02s   0.3% 87.98%      1.20s 18.26%  runtime.gcDrain
     0.02s   0.3% 88.28%      0.06s  0.91%  runtime.sweepone
     0.02s   0.3% 88.58%      0.05s  0.76%  strconv.ryuDigits
     0.02s   0.3% 88.89%      0.04s  0.61%  sync.(*Pool).pin
     0.01s  0.15% 89.04%      2.24s 34.09%  encoding/json.(*encodeState).marshal
     0.01s  0.15% 89.19%      1.60s 24.35%  encoding/json.stringEncoder
     0.01s  0.15% 89.35%      0.16s  2.44%  runtime.(*mcache).refill
     0.01s  0.15% 89.50%      0.12s  1.83%  runtime.(*mcentral).grow
     0.01s  0.15% 89.65%      0.04s  0.61%  runtime.(*sweepLocked).sweep
     0.01s  0.15% 89.80%      0.17s  2.59%  runtime.concatstring2
     0.01s  0.15% 89.95%      0.31s  4.72%  runtime.makeslice
     0.01s  0.15% 90.11%      0.05s  0.76%  runtime.nilinterhash
     0.01s  0.15% 90.26%      0.04s  0.61%  runtime.typehash
         0     0% 90.26%      0.37s  5.63%  bytes.(*Buffer).grow
         0     0% 90.26%      0.06s  0.91%  bytes.makeSlice
         0     0% 90.26%      0.21s  3.20%  encoding/json.floatEncoder.encode
         0     0% 90.26%      0.33s  5.02%  encoding/json.valueEncoder
         0     0% 90.26%      5.27s 80.21%  main.main
         0     0% 90.26%      0.16s  2.44%  runtime.(*mcache).nextFree
         0     0% 90.26%      0.15s  2.28%  runtime.(*mcentral).cacheSpan
         0     0% 90.26%      0.10s  1.52%  runtime.(*mheap).alloc
         0     0% 90.26%      0.06s  0.91%  runtime.(*mspan).objIndex
         0     0% 90.26%      0.06s  0.91%  runtime.bgsweep
         0     0% 90.26%      1.20s 18.26%  runtime.gcBgMarkWorker
         0     0% 90.26%      1.20s 18.26%  runtime.gcBgMarkWorker.func2
         0     0% 90.26%      0.04s  0.61%  runtime.heapBits.initSpan
         0     0% 90.26%      5.27s 80.21%  runtime.main
         0     0% 90.26%      0.09s  1.37%  runtime.markroot
         0     0% 90.26%      0.09s  1.37%  runtime.markroot.func1
         0     0% 90.26%      0.09s  1.37%  runtime.suspendG
         0     0% 90.26%      1.28s 19.48%  runtime.systemstack
         0     0% 90.26%      0.15s  2.28%  strconv.AppendFloat (inline)
         0     0% 90.26%      1.12s 17.05%  strconv.QuoteToASCII (inline)
         0     0% 90.26%      0.05s  0.76%  strconv.ryuFtoaShortest

Down another half second-ish. Nice! And hey! Syscall is no longer in there.

Column caching

Now we're getting to the end of useful changes we can make but I notice strconv.appendQuotedWith and strconv.appendEscapedRune. We may be able to shave off a little bit by caching the columns rather than escaping all columns again each time for every row. Let's try it.

$ cp buffer.go cache-columns.go
$ diff -u buffer.go cache-columns.go
--- buffer.go   2022-03-03 19:02:13.628485570 +0000
+++ cache-columns.go    2022-03-03 19:01:54.121695243 +0000
@@ -24,6 +24,8 @@
                return err
        }

+       quotedColumns := map[string][]byte{}
+
        for i, row := range a {
                // Write a comma before the current object
                if i > 0 {
@@ -65,7 +67,12 @@
                                }
                        }

-                       _, err = bo.Write([]byte(strconv.QuoteToASCII(col) + ":"))
+                       quoted := quotedColumns[col]
+                       if quoted == nil {
+                               quoted = []byte(strconv.QuoteToASCII(col) + ":")
+                               quotedColumns[col] = quoted
+                       }
+                       _, err = bo.Write(quoted)
                        if err != nil {
                                return err
                        }
@@ -75,7 +82,7 @@
                                return err
                        }

-                       _, err = out.Write(bs)
+                       _, err = bo.Write(bs)
                        if err != nil {
                                return err
                        }

And run it.

$ go build -o main.go cache-columns.go
$ ./main --in wide --encoder nosort
2022/03/03 19:14:08 profile: cpu profiling enabled, /tmp/profile2651087124/cpu.pprof
2022/03/03 19:14:13 profile: cpu profiling disabled, /tmp/profile2651087124/cpu.pprof
$ go tool pprof -top  /tmp/profile2651087124/cpu.pprof
File: main
Type: cpu
Time: Mar 3, 2022 at 7:14pm (UTC)
Duration: 4.98s, Total samples = 5.79s (116.28%)
Showing nodes accounting for 5.28s, 91.19% of 5.79s total
Dropped 69 nodes (cum <= 0.03s)
      flat  flat%   sum%        cum   cum%
     1.35s 23.32% 23.32%      1.47s 25.39%  encoding/json.(*encodeState).string
     0.49s  8.46% 31.78%      0.49s  8.46%  runtime.memmove
     0.35s  6.04% 37.82%      1.09s 18.83%  runtime.scanobject
     0.23s  3.97% 41.80%      0.23s  3.97%  runtime.memclrNoHeapPointers
     0.22s  3.80% 45.60%      0.22s  3.80%  runtime.procyield
     0.18s  3.11% 48.70%      0.18s  3.11%  runtime.pageIndexOf (inline)
     0.17s  2.94% 51.64%      0.42s  7.25%  runtime.mapaccess1_faststr
     0.14s  2.42% 54.06%      0.14s  2.42%  aeshashbody
     0.14s  2.42% 56.48%      0.27s  4.66%  runtime.findObject
     0.14s  2.42% 58.89%      0.24s  4.15%  runtime.mapiternext
     0.13s  2.25% 61.14%      0.57s  9.84%  bytes.(*Buffer).Write
     0.11s  1.90% 63.04%      0.11s  1.90%  reflect.Value.String
     0.10s  1.73% 64.77%      4.34s 74.96%  main.nosortEncoder
     0.09s  1.55% 66.32%      0.09s  1.55%  memeqbody
     0.09s  1.55% 67.88%      0.34s  5.87%  runtime.greyobject
     0.09s  1.55% 69.43%      0.09s  1.55%  strconv.ryuDigits32
     0.09s  1.55% 70.98%      0.25s  4.32%  sync.(*Map).Load
     0.08s  1.38% 72.37%      0.15s  2.59%  runtime.mapaccess2
     0.07s  1.21% 73.58%      0.07s  1.21%  reflect.Value.Type
     0.07s  1.21% 74.78%      0.07s  1.21%  runtime.heapBits.bits
     0.06s  1.04% 75.82%      2.25s 38.86%  encoding/json.(*encodeState).marshal
     0.06s  1.04% 76.86%      2.15s 37.13%  encoding/json.(*encodeState).reflectValue
     0.06s  1.04% 77.89%      0.08s  1.38%  runtime.(*bmap).overflow (inline)
     0.06s  1.04% 78.93%      0.33s  5.70%  runtime.mallocgc
     0.05s  0.86% 79.79%      0.13s  2.25%  encoding/json.newEncodeState
     0.05s  0.86% 80.66%      0.05s  0.86%  runtime.(*mspan).divideByElemSize (inline)
     0.05s  0.86% 81.52%      0.05s  0.86%  runtime.nextFreeFast (inline)
     0.05s  0.86% 82.38%      0.06s  1.04%  runtime.spanOf (inline)
     0.05s  0.86% 83.25%      0.10s  1.73%  sync.(*Pool).Put
     0.04s  0.69% 83.94%      0.10s  1.73%  bytes.(*Buffer).WriteString
     0.04s  0.69% 84.63%      0.04s  0.69%  runtime.heapBits.next (inline)
     0.04s  0.69% 85.32%      0.04s  0.69%  runtime.memhash64
     0.04s  0.69% 86.01%      0.05s  0.86%  sync.runtime_procPin
     0.03s  0.52% 86.53%      0.03s  0.52%  bytes.(*Buffer).tryGrowByReslice (partial-inline)
     0.03s  0.52% 87.05%      0.03s  0.52%  runtime.(*gcBits).bitp (inline)
     0.03s  0.52% 87.56%      0.05s  0.86%  runtime.(*mheap).allocSpan
     0.03s  0.52% 88.08%      0.03s  0.52%  runtime.add (inline)
     0.03s  0.52% 88.60%      0.03s  0.52%  runtime.markBits.isMarked (inline)
     0.03s  0.52% 89.12%      0.10s  1.73%  sync.(*Pool).pin
     0.02s  0.35% 89.46%      0.18s  3.11%  encoding/json.floatEncoder.encode
     0.02s  0.35% 89.81%      0.43s  7.43%  runtime.growslice
     0.02s  0.35% 90.16%      0.08s  1.38%  sync.(*Pool).Get
     0.01s  0.17% 90.33%      3.01s 51.99%  encoding/json.Marshal
     0.01s  0.17% 90.50%      0.26s  4.49%  encoding/json.typeEncoder
     0.01s  0.17% 90.67%      0.03s  0.52%  runtime.(*spanSet).pop
     0.01s  0.17% 90.85%      0.05s  0.86%  runtime.nilinterhash
     0.01s  0.17% 91.02%      0.14s  2.42%  strconv.genericFtoa
     0.01s  0.17% 91.19%      0.11s  1.90%  strconv.ryuFtoaShortest
         0     0% 91.19%      0.25s  4.32%  bytes.(*Buffer).grow
         0     0% 91.19%      0.06s  1.04%  bytes.makeSlice
         0     0% 91.19%      1.63s 28.15%  encoding/json.stringEncoder
         0     0% 91.19%      0.28s  4.84%  encoding/json.valueEncoder
         0     0% 91.19%      4.34s 74.96%  main.main
         0     0% 91.19%      0.03s  0.52%  runtime.(*mcache).allocLarge
         0     0% 91.19%      0.11s  1.90%  runtime.(*mcache).nextFree
         0     0% 91.19%      0.09s  1.55%  runtime.(*mcache).refill
         0     0% 91.19%      0.09s  1.55%  runtime.(*mcentral).cacheSpan
         0     0% 91.19%      0.08s  1.38%  runtime.(*mcentral).grow
         0     0% 91.19%      0.07s  1.21%  runtime.(*mheap).alloc
         0     0% 91.19%      0.05s  0.86%  runtime.(*mheap).alloc.func1
         0     0% 91.19%      0.03s  0.52%  runtime.(*mheap).freeSpan
         0     0% 91.19%      0.03s  0.52%  runtime.(*mheap).freeSpan.func1
         0     0% 91.19%      0.03s  0.52%  runtime.(*mspan).markBitsForIndex (inline)
         0     0% 91.19%      0.05s  0.86%  runtime.(*mspan).objIndex (inline)
         0     0% 91.19%      0.05s  0.86%  runtime.(*sweepLocked).sweep
         0     0% 91.19%      0.08s  1.38%  runtime.bgsweep
         0     0% 91.19%      1.33s 22.97%  runtime.gcBgMarkWorker
         0     0% 91.19%      1.33s 22.97%  runtime.gcBgMarkWorker.func2
         0     0% 91.19%      1.33s 22.97%  runtime.gcDrain
         0     0% 91.19%      0.03s  0.52%  runtime.goschedImpl
         0     0% 91.19%      0.03s  0.52%  runtime.gosched_m
         0     0% 91.19%      0.04s  0.69%  runtime.heapBits.initSpan
         0     0% 91.19%      4.34s 74.96%  runtime.main
         0     0% 91.19%      0.06s  1.04%  runtime.makeslice
         0     0% 91.19%      0.24s  4.15%  runtime.markroot
         0     0% 91.19%      0.24s  4.15%  runtime.markroot.func1
         0     0% 91.19%      0.03s  0.52%  runtime.mcall
         0     0% 91.19%      0.03s  0.52%  runtime.memclrNoHeapPointersChunked
         0     0% 91.19%      0.03s  0.52%  runtime.schedule
         0     0% 91.19%      0.24s  4.15%  runtime.suspendG
         0     0% 91.19%      0.08s  1.38%  runtime.sweepone
         0     0% 91.19%      1.42s 24.53%  runtime.systemstack
         0     0% 91.19%      0.04s  0.69%  runtime.typehash
         0     0% 91.19%      0.14s  2.42%  strconv.AppendFloat (inline)
         0     0% 91.19%      0.10s  1.73%  strconv.ryuDigits

Not bad! This is about as far as I can figure out how to take this without making massive new changes. So let's call it a day on this nosort implementation.

goccy/go-json

Now I wonder how this implementation compares to other existing libraries that improve on the Go standard libraries JSON encoder.

Let's add goccy/go-json which bills itself as the fastest encoder. Let's drop pkg/profile and rely solely on timings taken before and after the encode function is called. And let's beef up the benchmark script a bit more to be able to run multiple iterations and multiple kinds of encoders.

$ cp cache-columns.go goccy.go
$ diff -u cache-columns.go goccy.go
--- cache-columns.go    2022-03-03 19:01:54.121695243 +0000
+++ goccy.go    2022-03-03 18:51:15.265545778 +0000
@@ -3,11 +3,15 @@
 import (
        "bytes"
        "encoding/json"
+       "fmt"
        "log"
        "os"
+       "runtime"
        "strconv"
+       "strings"
+       "time"

-       "github.com/pkg/profile"
+       goccy_json "github.com/goccy/go-json"
 )

 func nosortEncoder(out *os.File, obj interface{}) error {
@@ -111,10 +115,16 @@
        return encoder.Encode(obj)
 }

+func goccy_jsonEncoder(out *os.File, obj interface{}) error {
+       encoder := goccy_json.NewEncoder(out)
+       return encoder.Encode(obj)
+}
+
 func main() {
        var in string
-       encoderArg := "stdlib"
-       encoder := stdlibEncoder
+       nTimes := 1
+       encoders := []func(*os.File, interface{}) error{stdlibEncoder}
+       encoderArgs := []string{"stdlib"}

        for i, arg := range os.Args {
                if arg == "--in" {
@@ -123,15 +133,29 @@
                        continue
                }

-               if arg == "--encoder" {
-                       encoderArg = os.Args[i+1]
-                       switch encoderArg {
-                       case "stdlib":
-                               encoder = stdlibEncoder
-                       case "nosort":
-                               encoder = nosortEncoder
-                       default:
-                               panic("Unknown encoder: " + encoderArg)
+               if arg == "--ntimes" {
+                       var err error
+                       nTimes, err = strconv.Atoi(os.Args[i+1])
+                       if err != nil {
+                               panic(err)
+                       }
+
+                       i += 1
+                       continue
+               }
+
+               if arg == "--encoders" {
+                       encoderArgs = strings.Split(os.Args[i+1], ",")
+                       encoders = nil
+                       for _, a := range encoderArgs {
+                               switch a {
+                               case "stdlib":
+                                       encoders = append(encoders, stdlibEncoder)
+                               case "nosort":
+                                       encoders = append(encoders, nosortEncoder)
+                               case "goccy_json":
+                                       encoders = append(encoders, goccy_jsonEncoder)
+                               }
                        }
                        i += 1
                        continue
@@ -151,16 +175,30 @@
                panic(err)
        }

-       fw, err := os.OpenFile(in+"-"+encoderArg+".json", os.O_TRUNC|os.O_WRONLY|os.O_CREATE, os.ModePerm)
-       if err != nil {
-               panic(err)
-       }
-       defer fw.Close()
+       fmt.Println("sample,encoder,time")
+       for i, encoder := range encoders {
+               encoderArg := encoderArgs[i]

-       p := profile.Start()
-       defer p.Stop()
-       err = encoder(fw, o)
-       if err != nil {
-               panic(err)
+               for i := 0; i < nTimes; i++ {
+                       fw, err := os.OpenFile(in+"-"+encoderArg+".json", os.O_TRUNC|os.O_WRONLY|os.O_CREATE, os.ModePerm)
+                       if err != nil {
+                               panic(err)
+                       }
+
+                       t1 := time.Now()
+                       err = encoder(fw, o)
+                       t2 := time.Now()
+                       if err != nil {
+                               panic(err)
+                       }
+
+                       fmt.Printf("%s,%s,%s\n", in, encoderArg, t2.Sub(t1))
+                       runtime.GC()
+
+                       err = fw.Close()
+                       if err != nil{
+                               panic(err)
+                       }
+               }
        }
 }

One important thing to call out here is runtime.GC() which forces the GC to run in-between tests rather than during them. This helps make the GC more predictable and less likely to influence timings during encoding. Without this you'd see a massive growth in time taken ever 3/4 runs.

We also can't defer closing the file since we're closing it in a loop and there's no defer-like mechanism in Go for scheduling a thing to run at the end of the (loop) block.

Let's try it out.

$ go build -o main goccy.go
$ ./main --in wide --encoders nosort,goccy_json,stdlib --ntimes 5
sample,encoder,time
wide,nosort,4.956871093s
wide,nosort,5.139428155s
wide,nosort,5.093279538s
wide,nosort,5.387918932s
wide,nosort,5.134292666s
wide,goccy_json,4.955825312s
wide,goccy_json,5.029368983s
wide,goccy_json,3.728623564s
wide,goccy_json,5.130309986s
wide,goccy_json,5.073473831s
wide,stdlib,6.788736114s
wide,stdlib,6.766644459s
wide,stdlib,6.194967849s
wide,stdlib,6.234464379s
wide,stdlib,6.809717451s

Woah! This was a genuine surprise. Most other benchmarks I did showed goccy/go-json as being much faster than this nosort implementation. So let's try some other datasets.

$ ./main --in long --encoders nosort,goccy_json,stdlib --ntimes 5
sample,encoder,time
long,nosort,5.800225163s
long,nosort,5.989665541s
long,nosort,6.004557873s
long,nosort,6.072411619s
long,nosort,5.899225226s
long,goccy_json,5.755024859s
long,goccy_json,5.434443782s
long,goccy_json,5.499432986s
long,goccy_json,5.411100532s
long,goccy_json,5.351709453s
long,stdlib,8.102073085s
long,stdlib,8.297254786s
long,stdlib,8.107278916s
long,stdlib,8.131379033s
long,stdlib,8.154066185s

So what we can see already is that this nosort implementation seems to perform pretty well with large datasets, shaving at most 10-15% time off of the standard library implementation. But it isn't necessarily the fastest library out there.

Let's try it out against a medium-sized dataset:

$ fakegen --rows 100000 --cols 10 > med.json
$./main --in med --encoders nosort,goccy_json,stdlib --ntimes 5
sample,encoder,time
med,nosort,786.904739ms
med,nosort,801.232731ms
med,nosort,743.066301ms
med,nosort,700.651086ms
med,nosort,736.939572ms
med,goccy_json,527.869523ms
med,goccy_json,500.694309ms
med,goccy_json,529.841332ms
med,goccy_json,521.776871ms
med,goccy_json,520.731622ms
med,stdlib,862.493399ms
med,stdlib,840.786104ms
med,stdlib,714.077849ms
med,stdlib,862.17802ms
med,stdlib,854.657331ms

And a small dataset:

$ fakegen --rows 10000 --cols 10> small.json
./main --in small --encoders nosort,goccy_json,stdlib --ntimes 5
sample,encoder,time
small,nosort,40.682862ms
small,nosort,38.091065ms
small,nosort,37.631626ms
small,nosort,37.592715ms
small,nosort,37.357291ms
small,goccy_json,34.784181ms
small,goccy_json,23.023397ms
small,goccy_json,67.313957ms
small,goccy_json,24.348643ms
small,goccy_json,66.196051ms
small,stdlib,47.655781ms
small,stdlib,43.749318ms
small,stdlib,46.656674ms
small,stdlib,46.858409ms
small,stdlib,47.482409ms

And finally let's use the dataset that octosql uses. It is a csv but we can use dsq to convert it to JSON.

$ curl https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-04.csv -o taxi.csv
$ dsq taxi.csv > taxi.json
$ ls -lah taxi.json
-rw-r--r-- 1 phil phil 877M Mar  3 16:40 taxi.json
$ wc taxi.json
  2171186   6513561 919169646 taxi.json
$ ./main --in taxi --encoders nosort,goccy_json,stdlib --ntimes 5
sample,encoder,time
taxi,nosort,7.318133704s
taxi,nosort,6.865036526s
taxi,nosort,6.924191823s
taxi,nosort,6.972792412s
taxi,nosort,6.912377632s
taxi,goccy_json,7.504571929s
taxi,goccy_json,7.481574296s
taxi,goccy_json,7.520564746s
taxi,goccy_json,7.543578814s
taxi,goccy_json,7.561023391s
taxi,stdlib,12.401270896s
taxi,stdlib,12.328176325s
taxi,stdlib,12.383455251s
taxi,stdlib,12.316689475s
taxi,stdlib,12.374483769s

Not bad!

Composability

Now the cool thing about this nosort implementation is that we can swap out the underlying JSON encoder library. For example, instead of using the standard library's encoder under the hood we could use goccy:

$ cp goccy.go goccy_nosort.go
$ diff -u goccy.go goccy_nosort.go
--- goccy.go    2022-03-03 18:51:15.265545778 +0000
+++ goccy_nosort.go     2022-03-03 19:25:44.119269651 +0000
@@ -14,7 +14,7 @@
        goccy_json "github.com/goccy/go-json"
 )

-func nosortEncoder(out *os.File, obj interface{}) error {
+func nosortEncoder(out *os.File, obj interface{}, marshalFn func(o interface{}) ([]byte, error)) error {
        a, ok := obj.([]interface{})
        // Fall back to normal encoder
        if !ok {
@@ -42,7 +42,7 @@
                r, ok := row.(map[string]interface{})
                if !ok {
                        log.Println("Falling back to stdlib")
-                       bs, err := json.Marshal(row)
+                       bs, err := marshalFn(row)
                        if err != nil {
                                return err
                        }
@@ -81,7 +81,7 @@
                                return err
                        }

-                       bs, err := json.Marshal(val)
+                       bs, err := marshalFn(val)
                        if err != nil {
                                return err
                        }
@@ -152,7 +152,13 @@
                                case "stdlib":
                                        encoders = append(encoders, stdlibEncoder)
                                case "nosort":
-                                       encoders = append(encoders, nosortEncoder)
+                                       encoders = append(encoders, func(out *os.File, o interface{}) error {
+                                               return nosortEncoder(out, o, json.Marshal)
+                                       })
+                               case "nosort_goccy":
+                                       encoders = append(encoders, func(out *os.File, o interface{}) error {
+                                               return nosortEncoder(out, o, goccy_json.Marshal)
+                                       })
                                case "goccy_json":
                                        encoders = append(encoders, goccy_jsonEncoder)
                                }

Let's try it out against the small dataset:

$ ./main --in small --encoders nosort,goccy_json,stdlib --ntimes 5
sample,encoder,time
small,nosort,41.760112ms
small,nosort,37.335286ms
small,nosort,39.368679ms
small,nosort,38.287757ms
small,nosort,38.271166ms
small,nosort_goccy,27.599888ms
small,nosort_goccy,28.923579ms
small,nosort_goccy,28.135659ms
small,nosort_goccy,28.628808ms
small,nosort_goccy,28.463473ms
small,goccy_json,31.275921ms
small,goccy_json,67.421084ms
small,goccy_json,33.94129ms
small,goccy_json,31.408809ms
small,goccy_json,31.221386ms
small,stdlib,47.177755ms
small,stdlib,46.994301ms
small,stdlib,43.017482ms
small,stdlib,47.505036ms
small,stdlib,46.647982ms

The mid-sized dataset:

$ ./main --in med --encoders nosort,goccy_json,stdlib --ntimes 5
sample,encoder,time
med,nosort,782.66375ms
med,nosort,737.980525ms
med,nosort,721.492919ms
med,nosort,724.356263ms
med,nosort,736.395523ms
med,nosort_goccy,498.971623ms
med,nosort_goccy,502.579077ms
med,nosort_goccy,489.708421ms
med,nosort_goccy,526.048878ms
med,nosort_goccy,512.748401ms
med,goccy_json,532.466736ms
med,goccy_json,534.507837ms
med,goccy_json,543.230595ms
med,goccy_json,533.324592ms
med,goccy_json,530.899822ms
med,stdlib,863.344366ms
med,stdlib,697.312404ms
med,stdlib,868.055438ms
med,stdlib,868.654749ms
med,stdlib,897.503696ms

The long dataset:

$ ./main --in long --encoders nosort,goccy_json,stdlib --ntimes 5
sample,encoder,time
long,nosort,5.916875793s
long,nosort,6.039521791s
long,nosort,5.935543183s
long,nosort,6.036550077s
long,nosort,6.011574246s
long,nosort_goccy,4.813237568s
long,nosort_goccy,4.761463935s
long,nosort_goccy,4.752175775s
long,nosort_goccy,4.798997185s
long,nosort_goccy,4.695403759s
long,goccy_json,5.800176224s
long,goccy_json,5.409573733s
long,goccy_json,5.36512912s
long,goccy_json,5.418035519s
long,goccy_json,5.411432234s
long,stdlib,8.105650617s
long,stdlib,8.024426906s
long,stdlib,8.038723257s
long,stdlib,8.089056904s
long,stdlib,8.046626953s

The wide dataset:

$ ./main --in wide --encoders nosort,goccy_json,stdlib --ntimes 5
sample,encoder,time
wide,nosort,4.966890211s
wide,nosort,5.10385764s
wide,nosort,5.209565609s
wide,nosort,5.05580971s
wide,nosort,5.128548152s
wide,nosort_goccy,3.807389446s
wide,nosort_goccy,3.946759042s
wide,nosort_goccy,3.786206428s
wide,nosort_goccy,3.857447258s
wide,nosort_goccy,3.719183236s
wide,goccy_json,4.982229773s
wide,goccy_json,5.056007992s
wide,goccy_json,4.98831937s
wide,goccy_json,4.9149848s
wide,goccy_json,4.89111462s
wide,stdlib,6.818520461s
wide,stdlib,6.79955319s
wide,stdlib,6.785183651s
wide,stdlib,6.713526793s
wide,stdlib,6.741904743s

And the taxi dataset:

./main --in taxi --encoders nosort,nosort_goccy,goccy_json,stdlib --ntimes 5
sample,encoder,time
taxi,nosort,7.441821343s
taxi,nosort,7.16027916s
taxi,nosort,7.114210802s
taxi,nosort,6.970729169s
taxi,nosort,6.976680741s
taxi,nosort_goccy,5.306546435s
taxi,nosort_goccy,5.314004524s
taxi,nosort_goccy,5.21857126s
taxi,nosort_goccy,5.254188054s
taxi,nosort_goccy,5.161521902s
taxi,goccy_json,7.484639889s
taxi,goccy_json,6.2012459s
taxi,goccy_json,7.671579687s
taxi,goccy_json,7.594733731s
taxi,goccy_json,7.514168117s
taxi,stdlib,12.551869909s
taxi,stdlib,12.294643142s
taxi,stdlib,11.52487468s
taxi,stdlib,11.410906914s
taxi,stdlib,12.228855201s

That's about a 55% speed improvement on the standard library encoder. That's pretty good!

Validate

Finally, let's pass through the nosort generated JSON and have it be encoded with the standard libary JSON encoder. Then if there is no diff between that result and the stdlib encoded JSON, we'll know that we have emitted correct and valid JSON.

$ ./main --in taxi-nosort
sample,encoder,time
taxi-nosort,stdlib,13.00740141s
$ diff taxi-nosort-stdlib.json taxi-stdlib.json
$ echo $?
0

Okee doke. :)

Fin

What's neat is how much we can improve on the standard library without reimplementing the hairiest parts of JSON encoding like actual value reflection.

If you want to run these benchmarks for yourself, the code is all on Github.

I'm also working on breaking this out as a standalone library. You can check it out on Github.

Caveat

The results here are surprisingly good. Surprisingly good in these situations normally means you messed up. So, Internet, do your thing and correct me where I'm wrong!

Machine specs

I am running these benchmarks on a dedicated bare metal instance, OVH Rise-1.

RAM: 64 GB DDR4 ECC 2,133 MHz
Disk: 2x450 GB SSD NVMe in Soft RAID
Processor: Intel Xeon E3-1230v6 - 4c/8t - 3.5 GHz/3.9 GHz

New blog post! On the slight modification we make to DataStation's JSON encoder to improve performance when writing large arrays of objects.https://t.co/wkwP7aW6HB pic.twitter.com/y0wNDV3Bma
— Multiprocess Labs (@multiprocessio) March 3, 2022

With questions, criticism or ideas, email or Tweet me.

Also, check out DataStation and dsq.