Postgres JSON, Developer Productivity, and The MongoDB Advantage
Updated 11-Nov-2-2021 Snowflake has the same fundamental challenge in dealing
with rich shapes as postgres
Original post 15-Dec-2017
Adding support for a JSON data type does not make Postgres, Snowflake, or Oracle a document
store. While Postgres allows more expressive querying and manipulation of JSON
data than almost any other relational database, it suffers from three
constraints:
- SQL syntax and, indeed, the relational model as a whole are
designed to work with single, scalar values which carry the same type from row
to row, not rich shapes like JSON that can contain substructures and arrays and
different elements from row to row. And the extensions to SQL utilized by
Postgres to manipulate JSON are proprietary, unknown to most SQL developers,
and not supported or even recognized by most 3rd party SQL tools.
- Nearly all traditional Postgres interaction with applications -- written in almost every
language -- works via O/JDBC drivers and these drivers do not have the
capability to properly and precisely convert the JSON into a useful native type
(such as a Map or List in Java or a Dictionary or List in Python).
- Perhaps
most importantly, pure JSON itself has a smaller set of basic types than SQL or
BSON. In pure JSON one cannot distinguish between a 32 bit integer, a 64 bit
integer, a floating point number, or an arbitrary precision decimal. It also
has no native dates or binary data. At best, dates are carried as ISO 8601
strings; at worst, homegrown, brittle, and unsupported hacks thereof.
Let's explore just the
fundamental issues involved in Postgres and Snowflake handling of JSON and why MongoDB does
not experience them.
Scalars Are Tough Enough...
Consider the following piece of JSON that we might
insert into a Postgres table with a JSON column type named content:
{"props": {"a": 12, "fn": [10,20,30]}, "cd":"2017-11-22T15:20:34.326Z"}
Diving into the array with the Postgres proprietary arrow operators initially
looks promising -- although Postgres JSON arrays are zero-based not one-based
like O/JDBC -- so that is a source of confusion from the start for seasoned SQL
developers:
select content->'props'->'fn'->0 from foo;
?column?
----------
10
We will make this point only once:
-
The expression above is not standard SQL.
Every RDBMS handles access and operators on JSON differently; for example,
Snowflake uses dot notation e.g. select content.props.fn. Thus, the
investments you make to deal with JSON in one vendor RDBMS are not portable
to any other.
A different usage story emerges when you try to actually use the output. Suppose
we wish to divide what we believe is an integer by 2; we expect 5 but instead
we are given 10 again:
select content->'props'->'fn'->0 / 2 from foo;
?column?
----------
10
There are two factors in play here:
- The expression 0/2 has a higher precedence
than single arrow operator, so it is still evaluating to ->0.
This is very easy to miss because
no errors or warnings are produced as 0/2 is silently evaluated, then truncated
for use as an integer index, and is a source of bugs and tedious investigation.
- The arrow operator in Postgres always returns "json" as a type,
so that a chain of arrows (a "path") will work. To actually get the data
we want, we must use
the double arrow operator (->>) to turn the terminal component of the path from
"json" into a "text" type, followed by an explicit cast to the type we seek,
like an integer. Note that the double arrow operator converts only
to text which is why we need the cast:
select (content->'props'->'fn'->>0)::int / 2 from foo;
?column?
----------
5
Of course, the problem is the developer had to know to cast to an integer.
What if the intended type was actually a float or a decimal? There is no way to
tell exactly what numeric type is being carried in the array. The same problem
applies to the cd field. It "looks" like a date to the eye but but there is
nothing to precisely tell you it is date; the developer must realize this and
manually cast it to ::timestamp to perform desired datetime operations upon
it. And none of the special JSON-casting is well supported by any tools,
commercial or otherwise. Developers accustomed to having the schema provide a
very large set of types (smallint, bigint, float, decimal, date, datetime, etc. etc.) and using the field type methods of O/JDBC drivers suddenly find
themselves without these essential tools. This is the first hit on developer
productivity in Postgres: the difficulty involved in managing basic types in
pure JSON.
...but Rich Shapes Are Even Harder
One of the reasons to use JSON in the first
place is to carry dynamic fields and work with a set of them at once. Suppose
we wish to extract the props field of content as a complete rich shape:
select content->'props' from foo;
?column?
---------------------------------
{"a": 12, "fn": [10,20,30]}
This "looks" good on the command line -- but all we are seeing is a toString()
representation of the substructure in the props field; the returned type is
just JSON, which is a string, complete with whitespace and commas.
The challenge is even more obvious in the O/JDBC
drivers. Below is an example in Java:
ResultSet rs = execSQL("select content->'props' from foo");
while (rs.next()) {
Object obj = rs.getObject(1);
System.out.println(obj.getClass() + ": " + obj.toString());
if(obj instanceof PGobject) {
PGobject pgo = (PGobject)oo;
System.out.println(" PG type: " + pgo.getType());
System.out.println(" PG val:" + pgo.getValue());
System.out.println(" PG val class: " + pgo.getValue().getClass());
}
}
Upon execution we see this:
class org.postgresql.util.PGobject: {"a":12, "fn": [10, 20, 30]}
PG type: json
PG val: {"a": 12, "fn": [10, 20, 30]}
PG val class: class java.lang.String
First, the object returned isn't a standard java.sql.* type or a java.lang.*
type; it is a custom Postgres type (org.postgresql.util.PGobject).
But much more important, the getValue() method returns exactly what JSON is: a string. The getType() method affirms this ("json") but that is no help. No parsing is performed; we have to manually parse the JSON in our application. This
introduces a whole set of issues around parsing including:
- Which JSON parser
should be used and what is the process around keeping it current in the
software inventory (since parsing JSON from the database is certainly a
critical capability that must be well-managed)?
- If a single vendor cannot
provide a JSON parser for a set of languages, then what is the strategy to
ensure precision and consistency of JSON data across languages?
And in the end, this still does not answer the fundamental type fidelity
problem: fn is an array of what, exactly? Even if we make a terrible assumption
that 10.0 is a float and 10 is a 64 bit int and eliminate 32 bit ints
altogether, we are still exposed to problems of floating point numbers in
financial calculations. What if fn was a vector of penny-precise fees? Even if
somehow all these conversions could be catalogued, how would you ensure that
every single data access point honors them? Postgres has no capability of
performing JSON fieldname-type validation; it can only validate raw JSON
syntax. And storing values as strings doesn't help because what might be gained
from a consistent fromString() framework is offset by removing the ability
to easily perform numeric comparisons and functions like grouping in the
database. In 2017 no one should be hobbling a data design by turning numbers
and dates into strings.
The problem remains if we try to use the double arrow operator:
ResultSet rs = execSQL("select content->>'props' from foo");
...
class java.lang.String: {"a":12, "fn": [10, 20, 30]}
The returned object is once again, ultimately, just a string we must parse
ourselves. Postgres JSON has a one-two negative productivity impact of lower
scalar type fidelity (e.g. no dates) and (ironically), no capability to turn
the array and object types into something useful.
...and Polymorphism is Effectively Impossible
Assume fn is not just an array of numbers but may also contain
strings. Let's add another row to Postgres:
{"props": {"a":5, "fn":["mix", 7.0, new ISODate("2017-11-22"), true, {x:3.0} ]}}
And reissue our query to get the zeroeth element from fn:
select (content->'props'->'fn'->>0)::int from foo;
ERROR: invalid input syntax for integer: "mix"
The error is a consequence of basic SQL and relational modeling: columns have
defined types that must be honored row after row. Narrowing Postgres JSON
queries to individual scalars and then force-casting them to one type dooms the
polymorphic data design principle that is so effective in allowing MongoDB to
use a single collection to represent many (sometimes dozens) of individual
tables in Postgres. And in order to do anything useful in the Postgres engine
like type-correct filtering, we must get to individual scalars. Most of the
JSON processing functions in Postgres have two important caveats:
- They only
operate on the topmost fields in the JSON record. This limitation is largely
due to the inability of SQL to deal with the results of recursion.
- Any
functions that extract single scalars return them as type text -- which puts us
right back at that beginning regarding not knowing the basic type!
In summary, using JSON columns and JSON operators in Postgres is scarcely
better than storing JSON in a BLOB.
MongoDB: JSON done right through BSON
MongoDB was designed from the start to
work with rich shapes and a variety of important fundamental types, including
dates and decimal types. Many examples in MongoDB use Javascript because it is
a great way to easily expose both the JSON-like structure of a document as well
as the broad set of native scalar types. Let's insert the same pieces of data
again from the MongoDB CLI using Javascript, only this time using more types at
our disposal:
> db.foo.insert([ {"props": {"a": NumberInt("12"),
"fn": [NumberDecimal("10"),NumberInt("20"),NumberLong("30")] },
"cd": new ISODate("2017-11-22T15:20:34.326Z") } },
{"props": {"a": NumberInt("5"),
"fn": ["mix", 7.0, new ISODate("2017-11-22"), true, {x:3.0}] } }
]);
MongoDB supports drivers in more than 10 languages. Here is a sample in Java
that is the equivalent of the JDBC example:
MongoCursor cc = myCollection.find({},{"props":1}).iterator();
while(cc.hasNext()) {
Map mm = (Map) cc.next();
walkMap(mm); // a representative java.util.Map-based recursive object walker
}
_id: org.bson.BsonObjectId: BsonObjectId{value=5a160154fee383d2d2e746da}
props: org.bson.BsonDocument
a: org.bson.BsonInt32: BsonInt32{value=12}
fn: org.bson.BsonArray
0: org.bson.BsonDecimal128: BsonDecimal128{value=10}
1: org.bson.BsonInt32: BsonInt32{value=20}
2: org.bson.BsonInt64: BsonInt64{value=30}
_id: org.bson.BsonObjectId: BsonObjectId{value=5a160154fee383d2d2e8a55}
props: org.bson.BsonDocument
a: org.bson.BsonInt32: BsonInt32{value=5}
fn: org.bson.BsonArray
0: org.bson.BsonString: BsonString{value='mix'}
1: org.bson.BsonDouble: BsonDouble{value=7.0}
2: org.bson.BsonDateTime: BsonDateTime{value=1511308800000}
3: org.bson.BsonBoolean: BsonBoolean{value=true}
4: org.bson.BsonDocument
x: org.bson.BsonDouble: BsonDouble{value=3.0}
Unlike the objects vended by the Postgres JDBC driver, each and every data
element in the material returned by MongoDB has a specific type, including
useful Map-based Document objects (props itself and item 4 in the
array). There is no parsing, no ambiguity, no misinterpretation, and no wasting
time converting data in and out of the database.
Data extracted in Java, adjusted, saved, then extracted later in another
language like Python enjoys 100% driver-guaranteed "roundtripability":
- Dates
will not be turned into strings or worse, as strings, accidentally changed from
YYYYMMDD to YYYYDDMM.
- Integers will not be silently promoted into floating
point numbers, and floating point numbers will not get fractional components
accidentally truncated to yield an integer.
- Penny-precise decimal
representations critical for financial services applications remain
penny-precise.
- Binary data (another fundamental type in MongoDB not appearing
in JSON) like SHA2 digital signatures, encrypted data, audio and video clips,
etc. moves to and from the client seamlessly without concerns about encoding,
UTF string management, etc.
How This All Impacts Productivity
This is only the tip of the Developer Productivity iceberg that is JSON in
Postgres. At the heart of it, Postgres, SQL, O/JDBC, and relational modeling is
simply not designed to accommodate key/value objects and array objects as first
class native types. More advanced and useful needs like array manipulation and processing,
sophisticated facet grouping, functional operators (e.g. map and reduce and filter),
graph-capable self- and other-collection lookups are either not possible or
extremely difficult to code, test, and robustly deploy in SQL.
Let's assume that "Developer Productivity" may be defined as a simple
percentage
time developing revenue features
--------------------------------
total time
where total time includes
debugging, testing, framework development, 3rd party tool installation,
stack version conflicts and resolution, etc., etc.
- Activities such as
designing and writing an auction algorithm and a visual representation
of asset distribution are revenue features. This is where you want to spend all your time.
- Non-revenue activities include installing Tomcat, configuring an ORM, base64 wrapping all potential
byte arrays, resolving the 3 open source dependencies on different versions
of apache commons, and dealing with 2 security scan issues on code that maven dragged in that
you didn't know about. You want to spend zero amount of time on these activities.
Productivity is
negatively impacted by non-revenue activities because
time and bug remediation effort is required to accommodate them. It
is not possible to simply dismiss them because doing so means something else must step in to provide the capabilities. For example, you cannot simply set
aside the fact JSON does not support dates as a native type; you must do
something, somewhere in your stack to accomodate for this so that a real date
e.g. java.util.Date is used by your application. Letting the application
itself handle the problem is a non-scalable architecture and dooms
the system to bugs and reconciliation errors as individual applications make
decisions about how to deal with the data.
Modern programming teams typically seek object-oriented capabilities in
their data access layers and
this is typically is in the form of Object/Relational Manager software or
ORM. It is widely recognized and accepted that ORMs quickly generate negative
productivity impact as data designs evolve past the most basic, and ultimately
become the blockers for innovation. On the surface, using Postgres JSON would
appear to perhaps attenuate this problem a bit by moving the really rich
structure from an n-way JOIN to a piece of JSON, but in reality you are
trading one problem for another.
In summary, although developers have long enjoyed a broad set of useful scalar
types that can be precisely and easily moved from Postgres into a rectangular
ResultSets, this ecosystem does not extend to Documents and rich shapes. Only
MongoDB brings such capabilities to the Document space.
Like this? Dislike this? Let me know
Site copyright © 2013-2024 Buzz Moschetti. All rights reserved