Postgres JSON, Developer Productivity, and The MongoDB Advantage

11-Nov-2021

Updated 11-Nov-2-2021 Snowflake has the same fundamental challenge in dealing with rich shapes as postgres
Original post 15-Dec-2017

Adding support for a JSON data type does not make Postgres, Snowflake, or Oracle a document store. While Postgres allows more expressive querying and manipulation of JSON data than almost any other relational database, it suffers from three constraints:

SQL syntax and, indeed, the relational model as a whole are designed to work with single, scalar values which carry the same type from row to row, not rich shapes like JSON that can contain substructures and arrays and different elements from row to row. And the extensions to SQL utilized by Postgres to manipulate JSON are proprietary, unknown to most SQL developers, and not supported or even recognized by most 3rd party SQL tools.
Nearly all traditional Postgres interaction with applications -- written in almost every language -- works via O/JDBC drivers and these drivers do not have the capability to properly and precisely convert the JSON into a useful native type (such as a Map or List in Java or a Dictionary or List in Python).
Perhaps most importantly, pure JSON itself has a smaller set of basic types than SQL or BSON. In pure JSON one cannot distinguish between a 32 bit integer, a 64 bit integer, a floating point number, or an arbitrary precision decimal. It also has no native dates or binary data. At best, dates are carried as ISO 8601 strings; at worst, homegrown, brittle, and unsupported hacks thereof.

Let's explore just the fundamental issues involved in Postgres and Snowflake handling of JSON and why MongoDB does not experience them.

Scalars Are Tough Enough...

Consider the following piece of JSON that we might insert into a Postgres table with a JSON column type named content:

{"props": {"a": 12, "fn": [10,20,30]}, "cd":"2017-11-22T15:20:34.326Z"}

Diving into the array with the Postgres proprietary arrow operators initially looks promising -- although Postgres JSON arrays are zero-based not one-based like O/JDBC -- so that is a source of confusion from the start for seasoned SQL developers:

select content->'props'->'fn'->0 from foo;
 ?column?
 ----------
  10

We will make this point only once:

The expression above is not standard SQL. Every RDBMS handles access and operators on JSON differently; for example, Snowflake uses dot notation e.g. select content.props.fn. Thus, the investments you make to deal with JSON in one vendor RDBMS are not portable to any other.

A different usage story emerges when you try to actually use the output. Suppose we wish to divide what we believe is an integer by 2; we expect 5 but instead we are given 10 again:

select content->'props'->'fn'->0 / 2 from foo;
 ?column?
 ----------
  10

There are two factors in play here:

The expression 0/2 has a higher precedence than single arrow operator, so it is still evaluating to ->0. This is very easy to miss because no errors or warnings are produced as 0/2 is silently evaluated, then truncated for use as an integer index, and is a source of bugs and tedious investigation.
The arrow operator in Postgres always returns "json" as a type, so that a chain of arrows (a "path") will work. To actually get the data we want, we must use the double arrow operator (->>) to turn the terminal component of the path from "json" into a "text" type, followed by an explicit cast to the type we seek, like an integer. Note that the double arrow operator converts only to text which is why we need the cast:

select (content->'props'->'fn'->>0)::int / 2 from foo;
 ?column?
 ----------
  5

Of course, the problem is the developer had to know to cast to an integer. What if the intended type was actually a float or a decimal? There is no way to tell exactly what numeric type is being carried in the array. The same problem applies to the cd field. It "looks" like a date to the eye but but there is nothing to precisely tell you it is date; the developer must realize this and manually cast it to ::timestamp to perform desired datetime operations upon it. And none of the special JSON-casting is well supported by any tools, commercial or otherwise. Developers accustomed to having the schema provide a very large set of types (smallint, bigint, float, decimal, date, datetime, etc. etc.) and using the field type methods of O/JDBC drivers suddenly find themselves without these essential tools. This is the first hit on developer productivity in Postgres: the difficulty involved in managing basic types in pure JSON.

...but Rich Shapes Are Even Harder

One of the reasons to use JSON in the first place is to carry dynamic fields and work with a set of them at once. Suppose we wish to extract the props field of content as a complete rich shape:

select content->'props' from foo;
 ?column?
  ---------------------------------
   {"a": 12, "fn": [10,20,30]}

This "looks" good on the command line -- but all we are seeing is a toString() representation of the substructure in the props field; the returned type is just JSON, which is a string, complete with whitespace and commas.

The challenge is even more obvious in the O/JDBC drivers. Below is an example in Java:

ResultSet rs = execSQL("select content->'props' from foo");
while (rs.next()) {
  Object obj = rs.getObject(1);
  System.out.println(obj.getClass() + ": " + obj.toString());
  if(obj instanceof PGobject) {
    PGobject pgo = (PGobject)oo;
    System.out.println(" PG type: " + pgo.getType());
    System.out.println(" PG val:" + pgo.getValue());
    System.out.println(" PG val class: " + pgo.getValue().getClass());
  }
}

Upon execution we see this:

class org.postgresql.util.PGobject: {"a":12, "fn": [10, 20, 30]}
      PG type: json
       PG val: {"a": 12, "fn": [10, 20, 30]} 
 PG val class: class java.lang.String

First, the object returned isn't a standard java.sql.* type or a java.lang.* type; it is a custom Postgres type (org.postgresql.util.PGobject).
But much more important, the getValue() method returns exactly what JSON is: a string. The getType() method affirms this ("json") but that is no help. No parsing is performed; we have to manually parse the JSON in our application. This introduces a whole set of issues around parsing including:

Which JSON parser should be used and what is the process around keeping it current in the software inventory (since parsing JSON from the database is certainly a critical capability that must be well-managed)?
If a single vendor cannot provide a JSON parser for a set of languages, then what is the strategy to ensure precision and consistency of JSON data across languages?

And in the end, this still does not answer the fundamental type fidelity problem: fn is an array of what, exactly? Even if we make a terrible assumption that 10.0 is a float and 10 is a 64 bit int and eliminate 32 bit ints altogether, we are still exposed to problems of floating point numbers in financial calculations. What if fn was a vector of penny-precise fees? Even if somehow all these conversions could be catalogued, how would you ensure that every single data access point honors them? Postgres has no capability of performing JSON fieldname-type validation; it can only validate raw JSON syntax. And storing values as strings doesn't help because what might be gained from a consistent fromString() framework is offset by removing the ability to easily perform numeric comparisons and functions like grouping in the database. In 2017 no one should be hobbling a data design by turning numbers and dates into strings.

The problem remains if we try to use the double arrow operator:

ResultSet rs = execSQL("select content->>'props' from foo");
...
class java.lang.String: {"a":12, "fn": [10, 20, 30]}

The returned object is once again, ultimately, just a string we must parse ourselves. Postgres JSON has a one-two negative productivity impact of lower scalar type fidelity (e.g. no dates) and (ironically), no capability to turn the array and object types into something useful.

...and Polymorphism is Effectively Impossible

Assume fn is not just an array of numbers but may also contain strings. Let's add another row to Postgres:

{"props": {"a":5, "fn":["mix", 7.0, new ISODate("2017-11-22"), true, {x:3.0} ]}}

And reissue our query to get the zeroeth element from fn:

select (content->'props'->'fn'->>0)::int from foo;
ERROR: invalid input syntax for integer: "mix"

The error is a consequence of basic SQL and relational modeling: columns have defined types that must be honored row after row. Narrowing Postgres JSON queries to individual scalars and then force-casting them to one type dooms the polymorphic data design principle that is so effective in allowing MongoDB to use a single collection to represent many (sometimes dozens) of individual tables in Postgres. And in order to do anything useful in the Postgres engine like type-correct filtering, we must get to individual scalars. Most of the JSON processing functions in Postgres have two important caveats:

They only operate on the topmost fields in the JSON record. This limitation is largely due to the inability of SQL to deal with the results of recursion.
Any functions that extract single scalars return them as type text -- which puts us right back at that beginning regarding not knowing the basic type!

In summary, using JSON columns and JSON operators in Postgres is scarcely better than storing JSON in a BLOB.

MongoDB: JSON done right through BSON

MongoDB was designed from the start to work with rich shapes and a variety of important fundamental types, including dates and decimal types. Many examples in MongoDB use Javascript because it is a great way to easily expose both the JSON-like structure of a document as well as the broad set of native scalar types. Let's insert the same pieces of data again from the MongoDB CLI using Javascript, only this time using more types at our disposal:

> db.foo.insert([ {"props": {"a": NumberInt("12"),
                             "fn": [NumberDecimal("10"),NumberInt("20"),NumberLong("30")] }, 
                             "cd": new ISODate("2017-11-22T15:20:34.326Z") } }, 
                   {"props": {"a": NumberInt("5"),
                              "fn": ["mix", 7.0, new ISODate("2017-11-22"), true, {x:3.0}] } } 
]);

MongoDB supports drivers in more than 10 languages. Here is a sample in Java that is the equivalent of the JDBC example:

MongoCursor cc = myCollection.find({},{"props":1}).iterator();
while(cc.hasNext()) {
  Map mm = (Map) cc.next();
  walkMap(mm); // a representative java.util.Map-based recursive object walker
}

_id: org.bson.BsonObjectId: BsonObjectId{value=5a160154fee383d2d2e746da} 
props: org.bson.BsonDocument
  a: org.bson.BsonInt32: BsonInt32{value=12}
  fn: org.bson.BsonArray
    0: org.bson.BsonDecimal128: BsonDecimal128{value=10}
    1: org.bson.BsonInt32: BsonInt32{value=20}
    2: org.bson.BsonInt64: BsonInt64{value=30}
_id: org.bson.BsonObjectId: BsonObjectId{value=5a160154fee383d2d2e8a55}
props: org.bson.BsonDocument
  a: org.bson.BsonInt32: BsonInt32{value=5}
  fn: org.bson.BsonArray
    0: org.bson.BsonString: BsonString{value='mix'}
    1: org.bson.BsonDouble: BsonDouble{value=7.0}
    2: org.bson.BsonDateTime: BsonDateTime{value=1511308800000}
    3: org.bson.BsonBoolean: BsonBoolean{value=true}
    4: org.bson.BsonDocument
      x: org.bson.BsonDouble: BsonDouble{value=3.0}

Unlike the objects vended by the Postgres JDBC driver, each and every data element in the material returned by MongoDB has a specific type, including useful Map-based Document objects (props itself and item 4 in the array). There is no parsing, no ambiguity, no misinterpretation, and no wasting time converting data in and out of the database.

Data extracted in Java, adjusted, saved, then extracted later in another language like Python enjoys 100% driver-guaranteed "roundtripability":

Dates will not be turned into strings or worse, as strings, accidentally changed from YYYYMMDD to YYYYDDMM.
Integers will not be silently promoted into floating point numbers, and floating point numbers will not get fractional components accidentally truncated to yield an integer.
Penny-precise decimal representations critical for financial services applications remain penny-precise.
Binary data (another fundamental type in MongoDB not appearing in JSON) like SHA2 digital signatures, encrypted data, audio and video clips, etc. moves to and from the client seamlessly without concerns about encoding, UTF string management, etc.

How This All Impacts Productivity

This is only the tip of the Developer Productivity iceberg that is JSON in Postgres. At the heart of it, Postgres, SQL, O/JDBC, and relational modeling is simply not designed to accommodate key/value objects and array objects as first class native types. More advanced and useful needs like array manipulation and processing, sophisticated facet grouping, functional operators (e.g. map and reduce and filter), graph-capable self- and other-collection lookups are either not possible or extremely difficult to code, test, and robustly deploy in SQL.

Let's assume that "Developer Productivity" may be defined as a simple percentage

    time developing revenue features
    --------------------------------
              total time

where total time includes debugging, testing, framework development, 3rd party tool installation, stack version conflicts and resolution, etc., etc.

Activities such as designing and writing an auction algorithm and a visual representation of asset distribution are revenue features. This is where you want to spend all your time.
Non-revenue activities include installing Tomcat, configuring an ORM, base64 wrapping all potential byte arrays, resolving the 3 open source dependencies on different versions of apache commons, and dealing with 2 security scan issues on code that maven dragged in that you didn't know about. You want to spend zero amount of time on these activities.

Productivity is negatively impacted by non-revenue activities because time and bug remediation effort is required to accommodate them. It is not possible to simply dismiss them because doing so means something else must step in to provide the capabilities. For example, you cannot simply set aside the fact JSON does not support dates as a native type; you must do something, somewhere in your stack to accomodate for this so that a real date e.g. java.util.Date is used by your application. Letting the application itself handle the problem is a non-scalable architecture and dooms the system to bugs and reconciliation errors as individual applications make decisions about how to deal with the data.

Modern programming teams typically seek object-oriented capabilities in their data access layers and this is typically is in the form of Object/Relational Manager software or ORM. It is widely recognized and accepted that ORMs quickly generate negative productivity impact as data designs evolve past the most basic, and ultimately become the blockers for innovation. On the surface, using Postgres JSON would appear to perhaps attenuate this problem a bit by moving the really rich structure from an n-way JOIN to a piece of JSON, but in reality you are trading one problem for another.

In summary, although developers have long enjoyed a broad set of useful scalar types that can be precisely and easily moved from Postgres into a rectangular ResultSets, this ecosystem does not extend to Documents and rich shapes. Only MongoDB brings such capabilities to the Document space.

Like this? Dislike this? Let me know