May 9, 2022: DataStation 0.10.0 is released! Read more
Blog

Analyzing large JSON files via partial JSON parsing

Published on by

Multiprocess's shape library allows you to get a schema for arbitrary data. It does some sampling on its own internally. But what happens when you're trying to analyze the schema of a JSON file that can't be loaded into memory? The naive approach will fail once your JSON file is greater than 4Mb large because Node.js strings can only be that large.

import fs from 'fs';

const f = fs.readFileSync('myfile.json');
JSON.parse(f.toString()); // Fails once myfile.json > 4Mb

Interestingly, Python has no such low limit on string size. Which goes to explain yet another reason why it's the goto for data scientists. But (this part of) DataStation is written in JavaScript/Node.js.

One option is to switch from the builtin JSON.parse function to a streaming parser. Another option is to sample the JSON file using partial JSON parsing. This post explores how DataStation uses partial JSON parsing to solve this problem.

Finding a valid subset of JSON

In an arbitrary JSON object, you can't just select the first N bytes of a file or even the first N lines of a file and get a meaningful string. Even if you know your JSON object is an array of some kind of object you can't select N lines because there could be lines within a JSON object.

// Annoying JSON examples
$ cat notanarray.json
{
  "a": [
    "my value1",
  ], "b": "some more stuff"
}
$ cat arrayofobjects.json
[
  {
    "a": "some value"
  },
  { "b": "some more stuff" }
]

But JSON grammar is pretty simple. There are only a few kind of state delimiters: "", [] and {}. If we could keep track of opening markers as we read the first N bytes of a JSON file, we could add all the appropriate closing markers once we hit the Nth byte. Then we'd have a well-formed subset of the JSON file that is just over N bytes long (since the closing markers get added after the Nth byte).

Of course there's the caveat that this subset may or may not be meaningful since who knows if the first M parts of a JSON file are representative of the rest of the JSON in the file. But for now we'll ignore the possibility of unrepresentative sampling.

Two additional restrictions we'll add is that if we hit N bytes and are still in a string, we'll keep reading until the end of a string. And similarly if we're in the middle of an array or object we won't stop reading until we reach the end of that last array or object. This let's us deal with the case of stopping after an object key but before the value.

After we hit the last object or array in the first N bytes we pop off the opening marker stack and add the corresponding closing marker to create a valid JSON string. Now we have a string we can pass to JSON.parse. That's a little wasteful as an algorithm but it saves some code for now. And thankfully the performance of this algorithm is not related to the size of the input once the input is beyond N bytes long.

Here's the algorithm in full (link to source):

import fs from 'fs';
import { preview } from 'preview';
import { shape } from 'shape';
import { NoResultError } from '../shared/errors';

export function parsePartialJSONFile(
  file: string,
  maxBytesToRead: number = 100_000
) {
  let fd: number;
  try {
    fd = fs.openSync(file, 'r');
  } catch (e) {
    throw new NoResultError();
  }

  const { size } = fs.statSync(file);

  if (size < maxBytesToRead) {
    const f = fs.readFileSync(file).toString();
    const value = JSON.parse(f);
    return {
      size,
      value,
      arrayCount: Array.isArray(value) ? value.length : null,
      shape: shape(value),
      preview: preview(value),
      skipWrite: true,
      contentType: 'application/json',
    };
  }

  try {
    let done = false;
    let f = '';
    const incomplete = [];
    let inString = false;

    while (!done) {
      const bufferSize = 1024;
      const b = Buffer.alloc(bufferSize);
      const bytesRead = fs.readSync(fd, b);

      // To be able to iterate over code points
      let bs = Array.from(b.toString());
      outer: for (let i = 0; i < bs.length; i++) {
        const c = bs[i];
        if (c !== '"' && inString) {
          continue;
        }

        switch (c) {
          case '"':
            const previous =
              i + bs.length === 0
                ? ''
                : i > 0
                ? bs[i - 1]
                : f.charAt(f.length - 1);
            const isEscaped = previous === '\\';
            if (!isEscaped) {
              inString = !inString;
            }
            break;
          case '{':
          case '[':
            incomplete.push(c);
            break;
          case ']':
          case '}':
            if (f.length + bufferSize >= maxBytesToRead) {
              bs = bs.slice(0, i);
              // Need to not count additional openings after this
              done = true;
              break outer;
            }

            // Otherwise, pop it
            incomplete.pop();
            break;
        }
      }

      f += bs.join('');
      if (bytesRead < bufferSize) {
        break;
      }
    }

    while (incomplete.length) {
      if (incomplete.pop() === '{') {
        f += '}';
      } else {
        f += ']';
      }
    }

    const value = JSON.parse(f);

    let arrayCount = null;
    if (Array.isArray(value)) {
      let averageRowSize = 0;
      const n = 100;
      for (const row of value.slice(0, n)) {
        averageRowSize += JSON.stringify(row).length;
      }

      arrayCount = Math.ceil(size / (averageRowSize / n));
    }

    return {
      size,
      value,
      shape: shape(value),
      preview: preview(value),
      arrayCount,
      skipWrite: true,
      contentType: 'application/json',
    };
  } finally {
    fs.closeSync(fd);
  }
}

It is similar to the common interview question about counting and validating parenthesis in a string. You'll notice that this implementation doesn't handle malformed JSON at all. For the moment that's ok since DataStation internally guarantees that the JSON that goes into this algorithm will be well-formed.

Thanks to this algorithm, DataStation can easily analyze the schema of files as large as 600Mb. It's possible it can handle larger I just haven't tested larger files yet.

Share

With questions, criticism or ideas, email or Tweet me.

Also, check out DataStation and dsq.