Dynamic Aggregation of Unspecified Fields in MongoDB

2-Oct-2017

MongoDB is document data model database and as such requires no formal schema for each document in a collection. If you are unfamiliar with these terms, just consider a document like a "row" but with rich structure and a collection as a table. Here is an example of a document as it might appear in the MongoDB command line interface:

    {
      "_id" : ObjectId("59d2465474d1e2d00ebebee9"),
      "device":"D1",
      "created": ISODate("2017-10-02T13:59:48.194Z"),
      "participants": [ "A1","A8","A45" ],
      "data": {
        "color":"red",
        "maxAmps":10,
        "maxVolts":240
      }
    }

As you can see, instead of simple scalar columns, the document contains both native arrays (the participants field) and substructure (the data field).

There are a number of design benefits of a schema-free document model but for this article we will focus on two:

The capability to "compartmentalize" or "namespace" data within substructures.
The capability of having different fields in the substructures from document to document, commonly called polymorphism

To be clear and fair up front, the latter requires that the consuming software be nimble enough to deal with such variations but very often the tradeoff is a vastly simpler and far more agile physical database design. Substructure and polymorphism together can often reduce a 10 or 20 (or more) table design in traditional RDBMS to a single collection in MongoDB. But the capabilities of MongoDB 3.4.4 and higher go well beyond this. It is also possible to perform aggregation functions that adapt dynamically to the different fields presented in each document.

Setting Up The Data and The Question

Let's start with a typical set of data that is gathered from devices and exploits a polymorphic design. Apols for the length but for this example it is important to use more than one or two documents.

    {
      "_id" : ObjectId("59d24a3c74d1e2d00ebebef4"),
      "device" : "D1",
      "ts" : ISODate("2017-10-02T14:16:28.028Z"),
      "data" : { "color" : "red",  "maxAmps" : 10, "maxVolts" : 240 }
    }
    {
      "_id" : ObjectId("59d24a3c74d1e2d00ebebef5"),
      "device" : "D2",
      "ts" : ISODate("2017-10-02T14:16:28.028Z"),
      "data" : { "color" : "green", "maxAmps" : 10, "maxVolts" : 240 }
    }
    {
      "_id" : ObjectId("59d24a3c74d1e2d00ebebef6"),
      "device" : "D3",
      "ts" : ISODate("2017-10-02T14:16:28.028Z"),
      "data" : { "weight" : 12, "maxAmps" : 10, "maxVolts" : 240 }
    }
    {
      "_id" : ObjectId("59d24a3c74d1e2d00ebebef7"),
      "device" : "D3",
      "ts" : ISODate("2017-10-02T14:16:28.028Z"),
      "data" : { "weight" : 12, "maxAmps" : 10, "maxVolts" : 120 }
    }
    {
      "_id" : ObjectId("59d24a3c74d1e2d00ebebef8"),
      "device" : "D4",
      "ts" : ISODate("2017-10-02T14:16:28.028Z"),
      "data" : { "weight" : 12 }
    }
    {
      "_id" : ObjectId("59d24a3c74d1e2d00ebebef9"),
      "device" : "D5",
      "ts" : ISODate("2017-10-02T14:16:28.028Z"),
      "data" : { "maxAmps" : 10, "temp" : 77 }
    }
    {
      "_id" : ObjectId("59d24a3c74d1e2d00ebebefa"),
      "device" : "D6",
      "ts" : ISODate("2017-10-02T14:16:28.028Z"),
      "data" : { "maxAmps" : 10, "temp" : 90, "fish" : "cod" }
    }

Note that the data is separated into device independent fields like device and ts (the timestamp) and device dependent fields in the data field. Although at first it may appear that distinct field names that appear in the data may come from a well-bounded set, we cannot assume this, and in fact we do not want to. We want to be able to adapt very quickly to new inbound device data, e.g.:

    {
      "_id" : ObjectId("59d24a3c74d1e2d00ebebefb"),
      "device" : "D7",
      "ts" : ISODate("2017-10-02T14:16:28.028Z"),
      "data" : { "OS" : "Linux", "language" : "Java" }
    }
    {
      "_id" : ObjectId("59d24a3c74d1e2d00ebebefc"),
      "device" : "D8",
      "ts" : ISODate("2017-10-02T14:16:28.028Z"),
      "data" : { "OS" : "Linux", "language" : "Python" }
    }
    {
      "_id" : ObjectId("59d24a3c74d1e2d00ebebefd"),
      "device" : "D9",
      "ts" : ISODate("2017-10-02T14:16:28.028Z"),
      "data" : { "OS" : "Windows", "language" : "Python" }
    }

We wish to answer the question: What are the unique values for all data?
And yes, the question is deliberately "open scoped." It is not "What are the unique values for OS, language, maxVolts, maxAmps, etc." It is "What are the unique values for any field that MIGHT appear in the data substructure?"

The Traditional Approach

Setting aside the open scope for a moment, you might answer this question as follows:

    db.foo.aggregate([
      {$group: {_id: null,
          "distinctColor":{$addToSet: "$data.color"},
          "distinctMaxAmps":{$addToSet: "$data.maxAmps"},
          "distinctMaxVolts":{$addToSet: "$data.maxVolts"}  }}
    ]);

    {
      "_id" : null,
      "distinctColor" : [ "green", "red" ],
      "distinctMaxAmps" : [ 10 ],
      "distinctMaxVolts" : [ 120, 240 ]
    }

Clearly this approach is on the road to Awfulville. Every new field requires a new expression in the $group function.

The Better Way: $objectToArray and $arrayToObject

MongoDB 3.4.4 introduced the $objectToArray and $arrayToObject operators. Here is a simple example of what they do:

    // This is the data field:
    > db.foo.aggregate([{$project: {_id:0, z: "$data"}} ]);
    { "z" : { "x" : "A", "y" : 1 } }

    // How do we dynamically fetch keys x and y from "data"?  Use $objectToArray!
    > db.foo.aggregate([{$project: {_id:0, z: {$objectToArray: "$data"}}} ]);
    { "z" : [ { "k" : "x", "v" : "A" }, { "k" : "y", "v" : 1 } ] }

    // ... and going in the other direction.  
    > db.foo.aggregate([{$project: {_id:0, z: "$data2"}} ]);
    { "z" : [ [ "crop", "corn" ], [ "amount", 113 ] ] }

    // $arrayToObject supports two input forms:
    // array of arrays:  [ [fname1,val1], [fname2,val2], ... ]
    // array of k-v objects:  [ {k:"fname1",v:val1}, {k:"fname2",v:val2}, etc. ]
    > db.foo.aggregate([{$project: {_id:0, z: {$arrayToObject: "$data2"}}} ]);
    { "z" : { "crop" : "corn", "amount" : 113 } }

In essence, the two functions give you the capability to fluidly interchange and manipulate the LHS (left hand side) and RHS (right hand side) of the key:value construct.

Let's begin to answer our question by first "breaking down" the fields inside the data substructure.

    db.foo.aggregate([
        {$project: {x: {$objectToArray: "$data"}}}
    ]);

    {
      "_id" : ObjectId("59d253b074d1e2d00ebebf2a"),
      "x" : [
        { "k" : "color", "v" : "red" },
        { "k" : "maxAmps", "v" : 10 },
        { "k" : "maxVolts", "v" : 240 }
      ]
    }
    {
      "_id" : ObjectId("59d253b074d1e2d00ebebf2a"),
      "x" : [
        { "k" : "color", "v" : "green" },
        { "k" : "maxAmps", "v" : 10 },
        { "k" : "maxVolts", "v" : 240 }
      ]
    }
    // etc.....

Two things have happened:

We have plucked just the data field from the document using $project and left behind the data-independent fields like ts because they are not important for this exercise.
The key names and values are now all RHS with k and v becoming the LHS for each, respectively. The full power of MongoDB query language can now be brought to bear on the RHS values.

Next, we $unwind the x array in preparation for grouping:

    db.foo.aggregate([
        {$project: {x: {$objectToArray: "$data"}}}
        ,{$unwind: "$x"}
    ]);

    {
        "_id" : ObjectId("59d253b074d1e2d00ebebf2a"),
        "x" : { "k" : "color", "v" : "red" }
    }
    {
        "_id" : ObjectId("59d253b074d1e2d00ebebf2a"),
        "x" : { "k" : "maxAmps", "v" : 10 }
    }
    {
       "_id" : ObjectId("59d253b074d1e2d00ebebf2a"),
        "x" : { "k" : "maxVolts", "v" : 240 }
    }
    {
        "_id" : ObjectId("59d2549f74d1e2d00ebebf35"),
        "x" : { "k" : "color", "v" : "green" }
    }
    // etc.....

Now let's group by k and use the $addToSet operator:

    db.foo.aggregate([
        {$project: {x: {$objectToArray: "$data"}}}
        ,{$unwind: "$x"}
        ,{$group: {_id: "$x.k", distinctVals: {$addToSet: "$x.v"}}}
    ]);

    { "_id" : "language", "distinctVals" : [ "Python", "Java" ] }
    { "_id" : "OS", "distinctVals" : [ "Windows", "Linux" ] }
    { "_id" : "temp", "distinctVals" : [ 90, 77 ] }
    { "_id" : "weight", "distinctVals" : [ 12 ] }
    { "_id" : "maxVolts", "distinctVals" : [ 120, 240 ] }
    { "_id" : "fish", "distinctVals" : [ "cod" ] }
    { "_id" : "maxAmps", "distinctVals" : [ 10 ] }
    { "_id" : "color", "distinctVals" : [ "green", "red" ] }

This is starting to look very good. In fact, some might say we're done and might argue that this is the most flexible and dynamic form of the answer. But let's keep going to explore all the nuances of LHS/RHS manipulation.

We can use either form of $arrayToObject to turn the RHS of each doc into an LHS:

// For k-v, must have two fields k and v only (no _id or anything else)
,{$project: {_id:0, k: "$_id", v:"$distinctVals"}}  //

// Now take the resulting {k: "fname", v: value}, wrap it in an array
// (note the [] wrapping $$CURRENT) and pass to $arrayToObject:
,{$project: {xx2: {$arrayToObject: [ [ "$$CURRENT"] ] }}}


// For array of array, perhaps a little simpler because not using $$CURRENT:
,{$project: { _id:0, xx: [ [ "$_id", "$distinctVals" ] ] }}
,{$project: {xx2: {$arrayToObject: "$xx"}}}

Let's go with array of arrays and for extra flair, we'll add the prefix distinct_ to each field:

    db.foo.aggregate([
    {$project: {x: {$objectToArray: "$data"}}}
    ,{$unwind: "$x"}
    ,{$group: {_id: "$x.k", distinctVals: {$addToSet: "$x.v"}}}
    ,{$project: { _id:0, xx: [ [ {$concat: [ "distinct_", "$_id"]}, "$distinctVals" ] ] }}
    ]);
    
    { "xx" : [ [ "distinct_language", [ "Python", "Java" ] ] ] }
    { "xx" : [ [ "distinct_OS", [ "Windows", "Linux" ] ] ] }
    { "xx" : [ [ "distinct_temp", [ 90, 77 ] ] ] }
    { "xx" : [ [ "distinct_weight", [ 12 ] ] ] }
    { "xx" : [ [ "distinct_maxVolts", [ 120, 240 ] ] ] }
    { "xx" : [ [ "distinct_fish", [ "cod" ] ] ] }
    { "xx" : [ [ "distinct_maxAmps", [ 10 ] ] ] }
    { "xx" : [ [ "distinct_color", [ "green", "red" ] ] ] }

Finally, convert back to an object. Also, we will use the $replaceRoot operator to eliminate the xx field altogether and "lift" the subfields into the parent doc:

    // Note that typically we'd merge the last $project into $replaceRoot for
    // efficiency and a shorter pipeline, e.g.
    //     ,{$replaceRoot: { newRoot: {$arrayToObject: "$xx"}}}
    // but it's expanded out below so you can see what's going on.

    db.foo.aggregate([
      {$project: {x: {$objectToArray: "$data"}}}
      ,{$unwind: "$x"}
      ,{$group: {_id: "$x.k", distinctVals: {$addToSet: "$x.v"}}}
      ,{$project: { _id:0, xx: [ [ {$concat: [ "distinct_", "$_id"]}, "$distinctVals" ] ] }}
      ,{$project: {xx2: {$arrayToObject: "$xx"}}}
      ,{$replaceRoot: { newRoot: "$xx2"}}
			]);

    { "distinct_language" : [ "Python", "Java" ] }
    { "distinct_OS" : [ "Windows", "Linux" ] }
    { "distinct_temp" : [ 90, 77 ] }
    { "distinct_weight" : [ 12 ] }
    { "distinct_maxVolts" : [ 120, 240 ] }
    { "distinct_fish" : [ "cod" ] }
    { "distinct_maxAmps" : [ 10 ] }
    { "distinct_color" : [ "green", "red" ] }

Complete!

If you understood how $$CURRENT was used in the example above, then you should also realize that you don't necessarily need to start with a substructure (although it's always a really good idea to separate truly dynamic fields into a substructure). If your data looked like this:

    { 
      "_id" : ObjectId("59d2465474d1e2d00ebebee9"),
      "name":"buzz",
      "city":"NYC",
      "favoriteCuisine":"Italian"
    }
    { 
      "_id" : ObjectId("59d2465474d1e2d00ebebeea"),
      "name":"buzz",
      "city":"Miami",
      "favoriteCuisine":"Cuban"
    }

Then you can start the ball rolling using $$CURRENT -- but be careful: In most circumstances, you don't want to capture _id as one of the input fields. It is ALWAYS present on docs fetched from the database (so no dynamic behavior) and it is unique (so not very exciting from a grouping perspective). Best to filter it out:

    db.foo.aggregate([
      {$project: {x: {$objectToArray: "$$CURRENT"}}}
      ,{$unwind: "$x"}
      ,{$match: {"x.k": {$ne: "_id"}}}
      // etc....

    // Note!  Holding back _id will NOT work:
    //     {$project: {_id:0, x: {$objectToArray: "$$CURRENT"}}}
    // because $objectToArray will have already operated upon $$CURRENT and
    // {k:"_id", v:ObjectId("...")} is now "buried" in the output array.

Like this? Dislike this? Let me know