What's new in dsq? March 2022 update

Published on March 23, 2022 by Phil Eaton

dsqgosqltutorial

dsq is an open-source commandline tool for running SQL queries against many different file types. It is built on top of DataStation, an app that lets you query files, databases, APIs; script data in Python, JavaScript, Ruby,etc.; and build graphs, dashboards and exports. Everything you can do in dsq you can do in DataStation, but DataStation can do quite a bit more.

The introduction post from January goes into more basic details about how you can use dsq. That post covers dsq before its first release.

There have been 10 releases since then and I've been able to lift almost all initial restrictions. So I wanted to give you an update now that the 0.9.0 release is out.

Nested objects

You can now query nested objects with dsq. Given the following file user_address.json:

$ cat user_addresses.json
[
  {"name": "Agarrah", "location": {"city": "Toronto", "address": { "number": 1002 }}},
  {"name": "Minoara", "location": {"city": "Mexico City", "address": { "number": 19 }}},
  {"name": "Fontoon", "location": {"city": "New London", "address": { "number": 12 }}}
]

You can query nested fields like so:

$ dsq user_addresses.json 'SELECT name, "location.city" FROM {}' | jq
[
  {
    "name": "Agarrah",
    "location.city": "Toronto"
  },
  {
    "name": "Minoara",
    "location.city": "Mexico City"
  },
  {
    "location.city": "New London",
    "name": "Fontoon"
  }
]

Data within a nested array

DataStation defaults to assuming it's working with an array of objects. But some systems store their array data within an object. For example you might have an API response like this:

$ cat api-results.json
{
  "data": {
    "data": [
      {"id": 1, "name": "Corah"},
      {"id": 3, "name": "Minh"}
    ]
  },
  "total": 2
}

You can now query this in dsq with:

$ dsq api-results.json 'SELECT * FROM {0, "data.data"} ORDER BY id DESC' | jq
[
  {
    "id": "3",
	"name": "Minh"
  },
  {
    "id": "1",
	"name": "Corah"
  }
]

Similarly if you have an Excel file with multiple sheets you can use that same path syntax. Swap out the path string for the quoted name of your sheet:

$ dsq data.xlsx 'SELECT COUNT(1) FROM {0, "Sheet2"}'

Array data

You can now access array data within an object. DataStation transforms arrays into a JSON string. Since SQLite has builtin JSON operators you can act on these columns as structured data.

So if you have data like this in fields.json:

[
  {"field1": [1]},
  {"field1": [2]},
]

You can request the entire field:

$ dsq fields.json "SELECT field1 FROM {}" | jq
[
  {
    "field1": "[1]"
  },
  {
    "field1": "[2]",
  }
]

Or you can get the first value in the array using SQL JSON operators.

$ dsq fields.json "SELECT field1->0 FROM {}" | jq
[
  {
    "field1->0": "1"
  },
  {
    "field1->0": "2"
  }
]

New file types

Since the first blog post you can now query newline-delimited JSON, concatenated JSON, Apache ORC, and ODS (OpenOffice Sheets) files.

Apache ORC support was contributed by a first-time OSS developer.

Performance improvements

Thanks to switching to goccy/go-json and switching to prepared statements for insertions into SQLite, dsq is now twice as fast on OctoSQL's benchmark (in my own testing; the OctoSQL benchmark is not up-to-date using newer dsq releases). It's still pretty slow but 1) DataStation/dsq does a lot more than the other tools, 2) it's the youngest among those tools, and 3) performance is not my greatest concern right now.

Package managers

You can now grab dsq on Homebrew, on AUR and even as a Babel plugin for Emacs.

Try it out, become a contributor

Sound interesting? Try out DataStation and dsq! There is a ton of relatively easy tasks to pick up if you're trying to break into OSS. Join the Discord #dev channel and say hi!

A few months after the first blog post announcing dsq here's an update on the last 10 releases and some major new featureshttps://t.co/ImZ3rdMB5X
— Multiprocess Labs (@multiprocessio) March 23, 2022

With questions, criticism or ideas, email or Tweet me.

Also, check out DataStation and dsq.