| 5-Sep-2018 | Like this? Dislike this? Let me know |
I have come across several unanswered questions on stackoverflow regarding how to do moving averages in MongoDB. In the process of answering, I discovered some irritating consequences of using very large arrays in the $bucket function. I decided to capture the alternative, successful approach in more detail here.
{"n":1, "val": 4.23, "date": new ISODate("2017-12-31") }
{"n":2, "val": 4.23, "date": new ISODate("2018-01-01") }
{"n":3, "val": 8.91, "date": new ISODate("2018-01-02") }
{"n":4, "val": 0.03, "date": new ISODate("2018-01-03") }
... assume n and date keep increasing by 1 until March 1:
{"n":61, "val": 6.44, "date": new ISODate("2018-03-01") }
// Make a bucket array of dates:
var arr = [];
["2018-01-01","2018-01-15","2018-02-15","2018-02-28"].forEach(function(d) {
arr.push(new ISODate(d));
});
db.collection.aggregate([
{$bucket: {
groupBy: "$date",
boundaries: arr,
default: "Other",
output: {
"count": { $sum: 1 },
"avg": {$avg:"$val"}
}
}
}
]);
{ "_id" : "Other", "count" : 3, "avg" : 63.435 }
{ "_id" : ISODate("2018-01-01T00:00:00Z"), "count" : 14, "avg" : 11.435 }
{ "_id" : ISODate("2018-01-15T00:00:00Z"), "count" : 31, "avg" : 33.95322580645161 }
{ "_id" : ISODate("2018-02-15T00:00:00Z"), "count" : 13, "avg" : 55.89153846153847 }
// Let 1970-01-01 be the low date and 9999-01-01 be the high date. We are not dealing
// with archaeological date manipulation (e.g. 100s of millions of years or more)
// here and such use cases likely do not benefit from the datetime type anyway.
["1970-01-01", "2018-01-01","2018-01-15","2018-02-15","2018-02-28","9999-01-01"].forEach(function(d) {
arr.push(new ISODate(d));
});
{ "_id" : ISODate("1970-01-01T00:00:00Z"), "count" : 1, "avg" : 5.5 }
{ "_id" : ISODate("2018-01-01T00:00:00Z"), "count" : 14, "avg" : 12.435 }
{ "_id" : ISODate("2018-01-15T00:00:00Z"), "count" : 31, "avg" : 34.91677419354839 }
{ "_id" : ISODate("2018-02-15T00:00:00Z"), "count" : 13, "avg" : 56.97846153846154 }
{ "_id" : ISODate("2018-02-28T00:00:00Z"), "count" : 2, "avg" : 64.435 }
The same approach can be used for numbers:
var arr = [10, 20, 30, 40];
db.foo.aggregate([
{$bucket: {
groupBy: "$n", // grouping by n now, not date!
boundaries: arr,
default: "Other",
output: {
"count": { $sum: 1 },
"avg": {$avg: "$val"}
}
}
}
]);
{ "_id" : 10, "count" : 10, "avg" : 18.435 }
{ "_id" : 20, "count" : 10, "avg" : 28.435000000000002 }
{ "_id" : 30, "count" : 10, "avg" : 38.434999999999995 }
{ "_id" : "Other", "count" : 31, "avg" : 41.24354838709677 }
var arr = [-Infinity, 0, 20, 30, 40, Infinity];
{ "_id" : -Infinity, "count" : 9, "avg" : 8.997777777777777 }
{ "_id" : 10, "count" : 10, "avg" : 18.435 }
{ "_id" : 20, "count" : 10, "avg" : 28.435000000000002 }
{ "_id" : 30, "count" : 10, "avg" : 38.434999999999995 }
{ "_id" : 40, "count" : 22, "avg" : 54.434999999999995 }
{ "n" : 0, "val" : 107.05756543841737, "date" : ISODate("2018-01-01T00:00:00Z") }
{ "n" : 1, "val" : 112.15474746119355, "date" : ISODate("2018-01-01T00:00:01Z") }
{ "n" : 2, "val" : 114.9389559965861, "date" : ISODate("2018-01-01T00:00:02Z") }
{ "n" : 3, "val" : 100.96869281305015, "date" : ISODate("2018-01-01T00:00:03Z") }
{ "n" : 4, "val" : 119.4575361398694, "date" : ISODate("2018-01-01T00:00:04Z") }
{ "n" : 5, "val" : 117.81224531915713, "date" : ISODate("2018-01-01T00:00:05Z") }
{ "n" : 6, "val" : 101.8992107912693, "date" : ISODate("2018-01-01T00:00:06Z") }
{ "n" : 7, "val" : 118.00930576370166, "date" : ISODate("2018-01-01T00:00:07Z") }
{ "n" : 8, "val" : 111.10144964537338, "date" : ISODate("2018-01-01T00:00:08Z") }
{ "n" : 9, "val" : 109.83909498986932, "date" : ISODate("2018-01-01T00:00:09Z") }
{ "n" : 10, "val" : 108.16311721339869, "date" : ISODate("2018-01-01T00:00:10Z") }
{ "n" : 11, "val" : 109.8721409530211, "date" : ISODate("2018-01-01T00:00:11Z") }
...
var arr = [];
var sv = new ISODate("2018-01-01").getTime();
var endv = new ISODate("2018-01-02").getTime();
var incr = (1000 * 5); // 5000 ms incremenet
while(endv > sv) {
arr.push(new Date(sv));
sv += incr;
}
var arr = make array of 17280 dates;
c=db.collection.aggregate([ no $bucket! ]);
c.forEach(function(r) {
var x = findSlot(arr, r['date']);
if(buckets[x] == undefined) {
buckets[x] = {lb: arr[x], ub: arr[x+1], n: 0, v:0, a:0};
}
var zz = buckets[x];
zz['n']++;
zz['v'] += r['val'];
zz['a'] = zz['v']/zz['n'];
});
function findSlotLinear(arr, val) {
var max = arr.length - 1;
for(var k = 0; k < max; k++) {
if(val < arr[k+1]) {
return k;
}
}
}
But clearly, we'd rather aggregate in the database than write client-side code.
doc n in ms start in ms diff diff/1000 1514764800000 1514764800000 0 0 1514764801000 1514764800000 1000 1 1514764802000 1514764800000 2000 2 1514764803000 1514764800000 3000 3 1514764804000 1514764800000 4000 4 1514764805000 1514764800000 5000 5 1514764806000 1514764800000 6000 6 ...
doc n in ms start in ms diff /1000 div/5 floor 1514764800000 1514764800000 0 0 0 0 1514764801000 1514764800000 1000 1 0.2 0 1514764802000 1514764800000 2000 2 0.4 0 1514764803000 1514764800000 3000 3 0.6 0 1514764804000 1514764800000 4000 4 0.8 0 1514764805000 1514764800000 5000 5 1 1 1514764806000 1514764800000 6000 6 1.2 1 ...
secondsBucket = 5;
startDate = new ISODate("2018-01-01");
db.collection.aggregate([
// Perform any or no $matching here to constrain the bucketing, e.g.
// {$match:{"sym":sym, "date":{$gte:startDate, $lt:new Date(endv) }}}
// The Juice! Get the floor of the diff / seconds to yield a "slot".
// Also note use of $subtract on dates:
,{$addFields: {"slot": {$floor: {$divide: [ {$divide: [ {$subtract: [ "$date", startDate ]}, 1000.0 ]}, secondsBucket ] }} }}
// Now just group on the slot. We throw in n for the count but
// it is not really necessary...
,{$group: {_id: "$slot", n: {$sum:1}, avg: {$avg: "$val"}} }
// Get it in 0-n order. Not vital but certainly useful:
,{$sort: {_id: 1}}
]);
{ "_id" : 0, "n" : 5, "avg" : 110.91549956982333 }
{ "_id" : 1, "n" : 5, "avg" : 111.72435237482561 }
{ "_id" : 2, "n" : 5, "avg" : 109.16745278376398 }
...
A similar approach using modulo ($mod) is shown here.
The array operators in MongoDB allow for simple, powerful manipulation of subsets of arrays, which is what we want from a moving average. Consider the following daily close data:
{ "val" : 10, "date" : ISODate("2018-01-01T00:00:00Z") }
{ "val" : 11.43, "date" : ISODate("2018-01-02T00:00:00Z") }
{ "val" : 12.52, "date" : ISODate("2018-01-03T00:00:00Z") }
{ "val" : 12.99, "date" : ISODate("2018-01-04T00:00:00Z") }
{ "val" : 12.72, "date" : ISODate("2018-01-05T00:00:00Z") }
{ "val" : 11.79, "date" : ISODate("2018-01-06T00:00:00Z") }
{ "val" : 10.42, "date" : ISODate("2018-01-07T00:00:00Z") }
{ "val" : 8.94, "date" : ISODate("2018-01-08T00:00:00Z") }
{ "val" : 7.72, "date" : ISODate("2018-01-09T00:00:00Z") }
{ "val" : 7.06, "date" : ISODate("2018-01-10T00:00:00Z") }
{ "val" : 7.12, "date" : ISODate("2018-01-11T00:00:00Z") }
{ "val" : 7.88, "date" : ISODate("2018-01-12T00:00:00Z") }
This is the solution but note: Our data is set up so that each doc is one actual day apart; therefore, each doc as an observation (or data point) is the same as a day. Our example solution is a general solution for calculating moving averages over a given number of observations in an input set, not days specifically. If your data has multiple datapoints per day, then you will first have to $group the data to get it into the "atom" that will become part of the moving average technique shown below.
// Control the size of the moving average frame:
datapts = 4;
db.collection.aggregate([
// Filter down to what you want. This can be anything or nothing at all.
{$match: {"sym": "S1"}}
// Ensure dates are going earliest to latest:
,{$sort: {d:1}}
// Turn docs into a single doc with a big vector of observations, e.g.
// {sym: "A", d: d1, val: 10}
// {sym: "A", d: d2, val: 11}
// {sym: "A", d: d3, val: 13}
// becomes
// {_id: "A", prx: [ {v:10,d:d1}, {v:11,d:d2}, {v:13,d:d3} ] }
//
// This will set us up to take advantage of array processing functions!
,{$group: {_id: "$sym", prx: {$push: {v:"$val",d:"$date"}} }}
// Nice additional info. Note use of dot notation on array to get
// just scalar date at elem 0, not the object {v:val,d:date}:
,{$addFields: {frameSize: datapts, startDate: {$arrayElemAt: [ "$prx.d", 0 ]}} }
// The Juice! Basically, use the map function to start at index 0 and keep
// slicing out subsets, calcing the average, and emitting that number.
//
// Note that we only run the vector to (len(vector) - (datapts-1).
// Also, for extra info, we also add the as-of date which is the tail date
// of the segment.
//
// Again we take advantage of dot notation to turn the vector of
// object {v:val, d:date} into two vectors of simple scalars [v1,v2,...]
// and [d1,d2,...] with $prx.v and $prx.d. Also, rather than create another
// vector next to prx in the doc, we will overwrite the existing one (which
// we don't need at the end anyway) by using $addFields with the same name (prx).
//
,{$addFields: {"prx": {$map: {
input: {$range:[0,{$subtract:[{$size:"$prx"}, (datapts-1)]}]} ,
as: "z",
in: {
avg: {$avg: {$slice: [ "$prx.v", "$$z", datapts ] } },
d: {$arrayElemAt: [ "$prx.d", {$add: ["$$z", (datapts-1)] } ]}
}
}}
}}
]);
{
"_id" : "S1",
"prx" : [
{
"avg" : 11.738793632512115,
"d" : ISODate("2018-01-04T00:00:00Z")
},
{
"avg" : 12.420766702631376,
"d" : ISODate("2018-01-05T00:00:00Z")
},
{
"avg" : 12.510051656756191,
"d" : ISODate("2018-01-06T00:00:00Z")
},
...
{
"avg" : 12.534681008446219,
"d" : ISODate("2018-01-31T00:00:00Z")
},
{
"avg" : 12.08669329998585,
"d" : ISODate("2018-02-01T00:00:00Z")
}
],
"frameSize" : 4,
"startDate" : ISODate("2018-01-01T00:00:00Z")
}
Like this? Dislike this? Let me know