May 9, 2022: DataStation 0.10.0 is released! Read more
Blog

The world of PostgreSQL wire compatibility

Published on by

A wire protocol is the format for interactions between a database server and its clients. It encompasses authentication, sending queries, receiving responses, and so on. It is a description of the exact bytes sent and received by servers and clients. It does NOT encompass the actual query language itself, let alone database semantics.

Every server-client database has a wire protocol. (SQLite is an embedded database, not a server-client database, so it does not have a wire protocol.) Wire protocols are important to both database developers and database driver developers.

Proprietary databases like Oracle and IBM Db2 find value in developing their own drivers. They don't tend to publish their wire protocol. Among other reasons, it's just not necessary since they develop their own drivers. Proprietary vendors tend to prefer a monolithic driver like ODBC/JDBC rather than native drivers since it's easier to develop one driver for all languages than one for each language. If someone wants to develop a native driver for a database without a published wire protocol they'll need to reverse-engineer the wire protocol. An example of this is the go-ora Oracle client for Go.

Worth noting, among proprietary vendors, SQL Server actually does the opposite. Their wire protocol has been publicly documented for over a decade. And more native drivers exist for SQL Server than for other proprietary vendors with their own wire protocol.

In contrast, the wire protocols for open-source databases tend to be documented since it's just another part of an open-source system. Due to their popularity and age, PostgreSQL and MySQL wire protocols are particularly well documented. Native drivers exist for both databases in every major language.

For this reason, a large number of newer database vendors just reuse the PostgreSQL or MySQL wire protocol. This means that every existing driver (and SQL GUI like DataStation) will work with them without any modification to driver code. This doesn't mean that any PostgreSQL or MySQL query will work with them since, as mentioned previously, query language and database semantics are independent of the wire protocol.

In Practice

Let's write a simple program that can handle query text and read data from disk. If it implements the PostgreSQL wire protocol protocol, it would work with any existing PostgreSQL client. That would be neat!

Someone already wrote a small library implementing the PostgreSQL server protocol in Python2. I ported it to Python3 for the sake of this tutorial. So we can just grab that Python3 fork's code and edit the query method with some code that handles the query text and reads data from disk, and we'd be done!

To reiterate, the wire protocol doesn't specify anything about the query language. So let's support a single made-up query: SELECT FROM x where x can be a valid table name:

$ diff original.py pgserver.py
0a1
> import json
189,190c190,202
<     fields = [IntField('a'), IntField('b')]
<     rows = [[1, 2], [3, 4], [5, 6]]
---
>     sql = sql.decode().upper()
>     prefix = 'SELECT FROM '
>     if not sql.startswith(prefix):
>       raise Exception('Bad query.')
>
>     table = sql.strip()[len(prefix):-2].strip().lower()
>     try:
>       with open('data/'+table+'.json') as f:
>         d = json.load(f)
>         fields = [IntField(k) for k in d[0].keys()]
>         rows = [[row[field.name] for field in fields] for row in d]
>     except Exception as e:
>       raise Exception('Invalid table.') from e

Now let's create a data directory where all the table JSON files will be stored. And let's create a few tables:

$ cat data/users.json
[
    {"id": 0, "age": 100},
    {"id": 1, "age": 45},
    {"id": 2, "age": 52}
]
$ cat data/users_organization.json
[
    {"user_id": 0, "organization_id": 0},
    {"user_id": 1, "organization_id": 1},
    {"user_id": 2, "organization_id": 0}
]

Run it:

$ python pgserver.py
server running, try: $ psql -h localhost -p 55432

And in a separate terminal, open psql as suggested (enter any text when prompted for a password, this code doesn't do anything with it), and run some queries:

$ psql -h localhost -p 55432
Password for user phil:
psql (13.4, server 0.0.0)
Type "help" for help.

phil=> select from users;
 id | age
----+-----
  0 | 100
  1 |  45
  2 |  52
(3 rows)

phil=> select from users_organization;
 user_id | organization_id
---------+-----------------
       0 |               0
       1 |               1
       2 |               0
(3 rows)

That is really neat! Incidentally a similar technique can be used to help in testing a la pgmock.

Vendors

But let's get back to the real world, dig into some of the databases that implement the PostgreSQL wire protocol, and see where else they diverge.

And although this table mentions compatibility, don't take that to be a judgement call. Just because a database implements the PostgreSQL wire protocol does not mean it intends to be a drop-in replacement for PostgreSQL.

Name Category Source Implementation language Parser Language compatibility ACID? Scaling strategy
PostgreSQL (the control group) Relational Open C YACC N/A Yes Limited
QuestDB Time series Open Java Custom hand-written Generally compatible* No Enterprise only, strategy unclear
CockroachDB Relational Available Go YACC grammar adapted from PostgreSQL Unlikely to fall far behind PostgreSQL due to sharing a grammar specification Yes Distributed using Raft
CrateDB Document Open Java Custom ANTLR grammar Generally compatible* No Shared nothing, eventually consistent
YugabyteDB Relational Open C, C++ PostgreSQL's parser as a library Unlikely to fall far behind PostgreSQL due to using the PostgreSQL parser Yes Sharding on primary key
Timescale Relational Open C Timescale is a PostgreSQL extension, so it uses PostgreSQL's parser Cannot fall behind PostgreSQL Yes Sharding on user-chosen column
immudb Document Open Go Custom YACC grammar Somewhat compatible* Yes No
Materialize Streaming Available Rust Custom hand-written Generally compatible* No Experimental: Enterprise only, data-parallel
RavenDB Document Open C# Custom hand-written parser Not compatible Yes Distributed using Raft
Cloud Spanner Relational Proprietary ** ** Generally compatible* Yes **
Aurora Relational Proprietary ** ** Generally compatible* Yes **
Yellowbrick Analytics Proprietary ** ** Generally compatible* ** ** **

* Not reusing PostgreSQL's parser does make it somewhat more likely for there to be differences in complex SQL queries compared to PostgreSQL
** Unknown because I didn't take the time to look into it. Feel free to message me and I'll fill it in.

And a special shoutout to ClickHouse that according to its config example can pretend to be PostgreSQL and, separately, MySQL. But since this isn't documented nor can I find the PostgreSQL parser code (but I can find the MySQL parser code) I won't include it in this list.

I'm sure I missed some. Message me with other PostgreSQL wire-compatible databases I should include here!

Syntax and engine support

A parser is only the start of compatibility though. Just because a parser supports some syntax doesn't mean that the engine supports it too. But you can be more sure that if a parser doesn't support some syntax, the engine can't support it either (unless the project keeps dead code around, which isn't common).

And an implementation that falls behind on parser support is likely to fall behind in other areas of compatibility. This is not to say that one vendor is better or worse. Excellent engineering can almost always overcome these types of obstacles. But there are always implications to technical design decisions.

So just looking at parser support is not enough to talk about PostgreSQL query or semantic compatibility. But it is a starting point.

Wire protocol edge cases, another time

Lastly, I said this post was about wire protocols but then I ended up focusing on the query language (which as I mentioned is separate from the wire protocol). The wire protocol does more than just wrap queries though. It includes more complex interactions like COPY-ing data or asynchronous requests or cancelling running requests.

In this post I did not evaluate the level of completeness of each database's implementation of the wire protocol. I only looked into what they'd do with basic queries. More could be done on evaluating the edge cases of the wire protocol in each implementation, in another post.

Let me know about your ideas, questions, corrections!

Share

With questions, criticism or ideas, email or Tweet me.

Also, check out DataStation and dsq.