MongoDB: It's More Than Just JSON
Updated 1-Oct-2015. At the time of original writing, support
for an arbitrary precision decimal type was only on the roadmap.
With the release of MongoDB 3.4 in Sep 2015, decimal128 (a.k.a. BigDecimal) was formally
introduced into the type system of MongoDB with support in all drivers and
utilities like mongoimport.
JSON, short for Javascript Object Notation, is a very good specification for
creating so-called rich shapes of data in a plain ASCII format. A rich
shape can be substantially more complex than traditional tables of data which
are "flat" rectangles. The rich shape can contain substructures a.k.a. nested
structures. It can contain lists of simple data like a list of favorite
colors or lists of complex substructures. It is as detailed as necessary to
well-represent the data it is carrying.
JSON originally started as a data specification localized to Javascript but has
evolved into a standard for encoding rich data shapes and both generators and
parsers are available in many languages. JSON has a number of advantages over
CSV and XML, the other most common data markups:
- JSON can carry rich structure natively and CSV cannot. This is arguably the biggest advantage of JSON over CSV.
- JSON is self-descriptive: every field carries a name and value. This avoids
"comma hell" that is a common problem in CSV.
- It is easy to add fields of any complexity to JSON at any time. This avoid
the brittleness in parsing CSV that leads to the "new data means new feed" problem.
- JSON has native syntactic support for arrays. This is a major advantage over
XML, where conventions must be followed to indicate to the consumer that enclosed
tags represent entries in an array.
- JSON has no attributes or "out-of-nesting" character content that can
complicate XML parsing
- Conceptually, JSON was designed to carried structured data whereas XML is
a repurposing of markup language intended to be in-line with unstructured text.
These features make JSON an excellent choice for big file feeds, messaging,
and other cross-language, cross-platform, easily introspectible, easily
interoperable environments.
BUT: Once inside a program written in a particular
language, the last thing you want to be manipulating is JSON.
- JSON has no native support for Dates
- JSON cannot identify a number as floating point (unless given a format of m.n)
or 32bit or 64bit integer
- JSON is a String
The last point is probably the most important. The ability to carry rich shape
in an convenient, ASCII-friendly externalized form in a feed or a message is
great -- but inside a program, it is far less useful.
The goal is
to capture JSON at the edge of your program and parse it from the
big String that it is to an actual assembly of hashmap, lists, and scalar
objects. Similarly, in creating data, you want your program to work with maps,
lists, and scalars as long as possible and only at the last moment, just prior
to writing bytes on a socket, is it good to convert it in a single big String
in JSON format, possibly even with whitespace for readability.
This design approach is not out-of-the box easy. You need a parser in each language
that you will be using. Encoders have to be careful about quoting strings and
escaping special characters. And there's
the unanswered issue about unsupported types. But we'll get to that in a moment.
MongoDB is actually a Rich Shape database
There are many, many references to MongoDB as a "JSON document database."
There are two twists here:
- With the exception of the mongoexport and mongoimport utilities, a developer
rarely sees any JSON in standard stringified form.
- Internally, MongoDB stores content in BSON which among things
adds Date and other very important basic types to the basic JSON type set.
Truly, the developer rarely uses BSON either.
The API to MongoDB -- insert, find, update, etc. -- involves rich shapes in the
idiom appropriate for the driver language, not JSON as a String:
// Suppose we wish to save this slightly complex shape:
// {
// owner: { fname: "buzz", lname: "moschetti" },
// operator: { fname: "steve", lname: "roberts" },
// createDate: now
// }
// In Java this would be:
DBObject m1 = new BasicDBObject();
{
DBObject m2 = new BasicDBObject();
m2.put("fname", "buzz");
m2.put("lname", "moschetti");
m1.put("owner", m2); // ah HA! Map goes into another map!
}
{
DBObject m2 = new BasicDBObject();
m2.put("fname", "steve");
m2.put("lname", "roberts");
m1.put("operator", m2); // ah HA! Map goes into another map!
}
m1.put("createDate", new Date()); // ah HA! Actual real Date object!
coll.insert(m1); // insert the Map into collection coll
Note that we do not have to encode or "stringify" the content going into MongoDB. Although
we can describe & comment (like above), log, debug, or otherwise emit the
data in m1 as JSON, in practice the actual method calls involve
full-featured objects like BasicDBObject and the interfaces
Map and List. The other important thing to note is that
MongoDB APIs support several more native types than strict JSON including
the critically important Date and binary objects.
Retrieving the data involves the same set of objects:
DBObject bdo = new BasicDBObject();
// Very simple lookup: owner.fname = "buzz"
bdo.put("owner.fname", "buzz");
DBCursor c = coll.find(bdo);
while(c.hasNext()) {
DBObject result = c.next();
Date d = (Date) result.get("createDate"); // actual Date
}
Again, JSON as a string does not appear in the API.
Containers (Maps and Lists) are presented directly as containers and
scalar types especially floating point numbers and Dates are presented in the
objects appropriate for the language without parsing.
The rich shape paradigm in MongoDB makes the APIs for scripting languages
even easier and more intuitive to use because popular languages like python,
perl, and javascript natively support construction of rich shapes:
for n in range(0, 5):
content = {"author": "Mike",
"idx": n,
"text": "My first blog post!",
"tags": ["mongodb", "python", "pymongo"],
"date": datetime.datetime.utcnow()
}
coll.insert(content);
for c in coll.find( { "$or": [ { "idx": { "$gt": 2 }}, { "idx": 0 } ] } ):
print [ s.upper() for s in sorted(c.keys()) ]
The developer is freed from the tedium of constructing a String with
whitespace and special grammar and escaping quotes and can use native
structures like dictionaries and array to construct the material for the APIs.
So why does MongoDB say it is a JSON document database?
JSON is a very good way to "conceptualize" and visualize the rich shapes.
Calling MongoDB a JSON database helps people to quickly recognize that it
stores and queries shapes much richer than a flat rectangle. Keeping the
JSON paradigm in mind is a good way to approach modeling data in MongoDB.
What about representing unsupported types in JSON?
The best way to do this involves conventions that will not break standard JSON
parsers. MongoDB has taken the approach of using type metadata and
this is probably the most portable and extensible way to explicitly identify
type. In this scheme, the value is substituted with a substructure containing
a stringified form of the value, married to a type name with a leading dollar
sign as sugar. The substructure is not strictly a single name:string pair
as evidenced with the $binary type:
{
"hireDate": {"$date": "2015-02-03T00:00:00.000Z"}, // ISO 8601
"bigInt": {"$numberLong": "742675423656352"},
"image": {"$binary": "i43MTgyOA==", "$type": "00"}, // base64 encoded byte[]
"notDate": { "date": "and figs" } // No leading $ means not type metadata
}
The stringified form prevents the standard JSON parser from doing anything
special with material after it parses it. In Java, post-processing the example
above would yield:
java.util.Map:
hireDate: java.util.Date
bigInt: java.lang.Long
image: byte[]
notDate: java.util.Map
date: java.lang.String
All the MongoDB drivers provide utils that recognize type metadata and will
post-process the material into the desired target types. It is relatively
straightforward to roll-your-own as well simply by "walking" the Map or
dictionary structure returned by any JSON parser and looking for the special
dollar sign sugar. There is no additional parsing or quote decoding necessary;
that was done already by the JSON parser.
The ability to move just dates and binary content in a self-descriptive and
lossless way is a tremendous step up in improving the robustness and day 2
modification of inter-system communications, although there is typically
some initial discomfort with the expanded size of files using this scheme.
Like this? Dislike this? Let me know
Site copyright © 2013-2024 Buzz Moschetti. All rights reserved