Quick tips for using pandas with MongoDB

17-Oct-2016 Revised 7-Jan-2017

This article covers two topics at a high level:

How to most effectively move array-based data from MongoDB into pandas
Important and useful ways to exploit MongoDB query and aggregation language to scope the data in pandas to just what you need to perform your tasks.

For Starters: What is pandas?

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
pandas is well suited for many different kinds of data:

Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
Ordered and unordered (not necessarily fixed-frequency) time series data.
Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

How is pandas different than R?

There is plenty of material on the internet on this topic, but here are some high levels bullets:

Both pandas and R are used for statistical and numeric analysis, including capabilities to import data and plot it.
Both pandas and R feature a 2-dimensional space called a dataframe as a foundational object.
R is its own environment with its own language and ecosystem of modules.
pandas is a module you import into python, so the language is python and the ecosystem is python.
R has a somewhat broader native set of statistical functions.
pandas benefits from the python object-oriented environment and many more "peer" modules and functions can be easily integrated
Both python and R have drivers/APIs to MongoDB

How do I get data from MongoDB into pandas?

This is the "classic" 5-liner that will produce a dataframe. find() with no arguments is essentially select * from myCollection; all the fields, no filtering. The python native function list() is used to crank the cursor (an iterable) returned by find() until the cursor is done, creating and appending to a list object. This is fine for smaller datasets but see the "Final Tip" below for hazards and remedies:

  import pandas as pd
  from pymongo import MongoClient
  client = MongoClient()
  db = client.sourceDatabase
  df = pd.DataFrame(list(db.myCollection.find()))

pandas is very flexible in terms of its ingestion of data and the MongoDB python driver will yield rich types (i.e. real dates, real arrays, etc., not string representations or opaque blobs) so pretty much anything that is in myCollection will be slurped into the dataframe. So in action, the 5-Liner run with this data:

  db.myCollection.insert({"a":1, "b":2, "c":3})
  db.myCollection.insert({"a":4, "b":5, "c":6})

would yield this output:

                          _id    a    b    c
  0  5804e627119c0d2c13c0a7e1  1.0  2.0  3.0
  1  5804e627119c0d2c13c0a7e2  4.0  5.0  6.0

Notice how the special MongoDB unique document identifier _id is also brought into the frame. There are simple techniques to hold that out from the returned dataset; we'll explore that in just a bit. As mentioned before, pandas will properly slurp even complex data, including mixed types in arrays (field f in the second insert in the following example):

  db.myCollection.insert({"a":"hello", "b": datetime.datetime.now(), "c": {"d":"AA","e":"BB"}, "f": [2,3,4]})
  db.myCollection.insert({"a":"goodbye", "c": {"e":"WW"}, "f": [0, "grimble", datetime.datetime.now(), 7.7 ]})

                          _id        a                       b  \
  0  5804e758119c0d2c13c0a7e5    hello 2016-10-17 14:59:36.029   
  1  5804e758119c0d2c13c0a7e6  goodbye 

                              c                                                f  
  0  {u'e': u'BB', u'd': u'AA'}                                  [2.0, 3.0, 4.0]  
  1               {u'e': u'WW'}  [0.0, grimble, 2016-10-17 14:59:36.107000, 7.7]

Although The 5-Liner highlights the ease of integration, it sidesteps an important issue: how to best deal with arrays in MongoDB.

Some background on pandas and dataframe creation

Backing up just a little bit, most use cases of pandas center around creating matrixes of numbers, often physically supplied as rows of columns of numbers. Arrays of arrays are truly the easiest to construct and nicely drive dynamic dataframe creation (i.e. the lengths of the arrays drive the frame size), although like all integer-offset based addressing, larger datasets or those missing values can get a bit cumbersome to deal with (just like CSV files):

  values = [
      [ 4, 5, 6, 7],
      [ 7, 8, 9, 10],
      [ 10, 11, 12]
  ]
  print pd.DataFrame(values)

      0   1   2     3
  0   4   5   6   7.0
  1   7   8   9  10.0
  2  10  11  12   NaN

Arrays of key:value structures of simple scalars make the field management easier and in fact are the main representation of data as vended by traditional RDBMS platforms (think of a cursor of ResultSet):

  values = [
      { "a":4, "b":5, "c":6, "d":7},
      { "a":7, "b":8, "c":9, "d":10},
      { "b":11, "c":12 }
  ]
  print pd.DataFrame(values)

       a   b   c     d
  0  4.0   5   6   7.0
  1  7.0   8   9  10.0
  2  NaN  11  12   NaN

Combining key:value and arrays, however, yields an interesting result:

  values = [
      { "a": [ 4,5,6,7]},
      { "a": [ 8,9,10] },
      { "a": [ 11, 12] }
  ]
  print pd.DataFrame(values)
                a
  0  [4, 5, 6, 7]
  1    [8, 9, 10]
  2      [11, 12]

The arrays do not dynamically drive the number of columns. Instead only a single column exists of type array! This is typically not the structure we seek.

In MongoDB it is often very advantageous to store arrays of values. But 5-Liner approach suffers from the same problem:

  data = [
      { "a": [ 4,5,6,7]},
      { "a": [ 8,9,10] },
      { "a": [ 11, 12] }
	   ]
  db.myCollection.insert(data)

  print pd.DataFrame(list(db.myCollection.find()))

                          _id                     a
  0  58056115119c0d2c13c0a7e9  [4.0, 5.0, 6.0, 7.0]
  1  58056115119c0d2c13c0a7ea      [8.0, 9.0, 10.0]
  2  58056115119c0d2c13c0a7eb          [11.0, 12.0]

Optimizing consumption of arrays from MongoDB

The technique for maximizing the utility of moving data from MongoDB arrays to dynamic columns in pandas is simple: instead of calling list() on find(), iterate the cursor and build at least one list for values (this will be a list of arrays), optionally one for columns labels, and optionally one for an index. Starting simple (and without list comprehensions):

  values = []
  for cc in db.myCollection.find():
      values.append(cc['a'])

  print pd.DataFrame(values)

        0     1     2    3
  0   4.0   5.0   6.0  7.0
  1   8.0   9.0  10.0  NaN
  2  11.0  12.0   NaN  NaN

Note how it is no longer just column "a"; the columns are dynamically driven by the size of array a. Also, because we are explicitly building the values array (again, a list of arrays), the _id does not get dragged in. pandas will automatically assign NaN to the short arrays.

Each document will typically have some kind of label associated with the value array. Let's add that as field n:

  data = [
      { "a": [ 4,5,6,7], "n": "foo"},
      { "a": [ 8,9,10] , "n": "bar"},
      { "a": [ 11, 12] , "n": "baz"}
  ]
  db.myCollection.insert(data)

  values = []
  seriesLbls = []
  for cc in db.myCollection.find():
      values.append(cc['a'])
      seriesLbls.append(cc['n'])

  print pd.DataFrame(values, index=seriesLbls)

          0     1     2    3
  foo   4.0   5.0   6.0  7.0
  bar   8.0   9.0  10.0  NaN
  baz  11.0  12.0   NaN  NaN

Note that 0,1,2 have been replaced by foo, bar, and baz. It should be clear that if desired, cc['_id'] could be used instead of cc['n'] to create the index. Building multiple arrays in the for loop is somewhat clearer than getting fancy with list comprehensions.

Datetimes also flow well from MongoDB into pandas for use as an index. Let's add a date to each record and use that instead of the name:

  data = [
      { "a": [ 4,5,6,7], "n": "foo", "d": datetime.datetime(2016,7,1) },
      { "a": [ 8,9,10] , "n": "bar", "d": datetime.datetime(2016,8,1) },
      { "a": [ 11, 12] , "n": "baz", "d": datetime.datetime(2016,9,1) }
  ]
  db.myCollection.insert(data)

  values = []
  dates = []
  for cc in db.myCollection.find():
      values.append(cc['a'])
      dates.append(cc['d'])

  didx = pd.DatetimeIndex(dates)  # not strictly necessary
  print pd.DataFrame(values, index=didx)

                 0     1     2    3
  2016-07-01   4.0   5.0   6.0  7.0
  2016-08-01   8.0   9.0  10.0  NaN
  2016-09-01  11.0  12.0   NaN  NaN

The dataframe constructor will accept a regular array of datetime objects as the value of index but we show explicit construction of a DatetimeIndex object here for completeness.

Finally, there are a few ways to craft column labels to make the dataframe nice and complete. Here we simply look for the longest array to drive the generation of column names C0 through Cn (via a list comprehension):

  values = []
  seriesLbls = []
  max = 0
  for cc in db.myCollection.find():
      if len(cc['a']) > max:
          max = len(cc['a'])
      values.append(cc['a'])
      seriesLbls.append(cc['n'])
  
  df = pd.DataFrame(values, index=seriesLbls, columns=[ "C"+str(n) for n in range(max)])
  
  print df
  print df.T
  
         C0    C1    C2   C3
  foo   4.0   5.0   6.0  7.0
  bar   8.0   9.0  10.0  NaN
  baz  11.0  12.0   NaN  NaN
  
      foo   bar   baz
  C0  4.0   8.0  11.0
  C1  5.0   9.0  12.0
  C2  6.0  10.0   NaN
  C3  7.0   NaN   NaN

We show both the nominal and transposed (df.T) dataframes here.

Matrixes

Outside of the dataframe, matrixes are often implemented as an array of arrays. This implementation translates easily for MongoDB. Here, we are storing two documents, each with a 3x3 matrix. This is a slightly different data design than the examples above because the matrix is a single field in one document instead of each row being carried as an array in 2 or more documents:

  values = [
        { "name":"A", "v": [ [1,2,3], [4,5,6], [7,8,9] ] }
        ,{ "name":"B", "v": [ [2,3,4], [5,6,4], [8,9,10] ] }
  ]

  db.myCollection.drop()
  db.myCollection.insert(values)

As with regular vectors, just slurping the MongoDB doc into the frame will yield a representation that you probably don't want.

  print pd.DataFrame(list(db.myCollection.find()))
  
                          _id name                                       v
  0  587115a8ed58db9467d94d34    A       [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
  1  587115a8ed58db9467d94d35    B      [[2, 3, 4], [5, 6, 4], [8, 9, 10]]

Fortunately, pandas works well with regular arrays, so iteratively appending simple arrays will yield the 3x3 structure we seek. In the following example, we use the find_one method to return just the one document with name "B":

  values = []
  item = db.myCollection.find_one({"name":"B"})
  for row in item['v']:   # item['v'] is array of array, so row is array
      values.append(row)
  
  print pd.DataFrame(values)

     0  1   2
  0  2  3   4
  1  5  6   4
  2  8  9  10

Don't forget about the power of MongoDB!

Very often developers and analysts adopt an approach of "dragging everything" out of the database (MongoDB or otherwise) and loading up a very big dataframe -- even if they only need a small subset of the data. Very often this is because:

The filtering or aggregation functions need to work on arrays and the pandas environment offers capabilities not extant in the data provider
The pandas environment is more performant or more convenient
All of the above

But clearly it is highly desirable not to move unnecessary data across the network and create large dataframes if possible.

The examples above focused on efficiently moving arrays from MongoDB into dataframes but did not explore filtering and aggregation. Below is an example of how the powerful MongoDB aggregation framework can be brought into the picture. These filtering commands are executed in an index-optimized fashion on the server side, dramatically reducing the amount of material populating the dataframe and improving performance all around:

  values = []
  seriesLbls = []
  max = 0

  for cc in db.myCollection.aggregate([
  #  First, only get things after 2016-07-15.  Typically this would be on an 
  #  indexed field and will rapidly cut down the material to a fraction
  #  of what might be stored in the DB.  "First cut" filtering on dates, 
  #  portfolios, owners, compute run ids, etc. is a very important and useful 
  #  capability.
  #  
  {"$match": {"d": {"$gt": datetime.datetime(2016,7,15)}}}
  
  #  Next, compute stdDevPop of the array.  MongoDB offers powerful array
  #  handling functions:
  ,{"$addFields": {"sdev": {"$stdDevPop": "$a"}}}
  
  #  Next, only permit those items where stdDevPop is <.75 to come through:
  ,{"$match": {"sdev": {"$lt": 0.75}}}
  ]):
      if len(cc['a']) > max:
          max = len(cc['a'])
      values.append(cc['a'])
      seriesLbls.append(cc['n'])
  
  df = pd.DataFrame(values, index=seriesLbls, columns=[ "C"+str(n) for n in range(max)])
  
  print df
  
       C0  C1
  baz  11  12

As a check, if you comment out the last match, you will see that two items will flow through to the dataframe. The third is still not present because it was filtered out by the first $match expression on date.

  
  #  Next, only permit those items where stdDevPop is <.75 to come through:
  # ,{"$match": {"sdev": {"$lt": 0.75}}}   # commented out
  
       C0  C1    C2
  bar   8   9  10.0
  baz  11  12   NaN

Sometimes you will not have data organized into an array but instead will find the values spread out across documents. Rather than change the pandas / dataframe logic, use MongoDB to construct the array. We will take the data above and break it out into individual docs, using the scalar field v to carry each element of the (original) array to demonastrate this situation. Then, we will use the $group stage to bring it back together, build the a array, and pass the results to the same code logic as before:

data = [
      { "v": 4, "n": "foo", "d": datetime.datetime(2016,7,1) },
      { "v": 5, "n": "foo", "d": datetime.datetime(2016,7,1) },
      { "v": 6, "n": "foo", "d": datetime.datetime(2016,7,1) },
      { "v": 7, "n": "foo", "d": datetime.datetime(2016,7,1) },
      { "v": 8, "n": "bar", "d": datetime.datetime(2016,8,1) },
      { "v": 9, "n": "bar", "d": datetime.datetime(2016,8,1) },
      { "v": 10, "n": "bar", "d": datetime.datetime(2016,8,1) },
      { "v": 11, "n": "baz", "d": datetime.datetime(2016,9,1) },
      { "v": 12, "n": "baz", "d": datetime.datetime(2016,9,1) }
]
db.myCollection.insert(data)

values = []
seriesLbls = []
max = 0
for cc in db.myCollection.aggregate([
{"$group": {"_id": "$n", "a": {"$push": "$v"}}}
]):
    print cc
    if len(cc['a']) > max:
        max = len(cc['a'])
    values.append(cc['a'])
    seriesLbls.append(cc['_id'])

print pd.DataFrame(values, index=seriesLbls, columns=[ "C"+str(n) for n in range(max)])

     C0  C1    C2   C3
baz  11  12   NaN  NaN
bar   8   9  10.0  NaN
foo   4   5   6.0  7.0

The bold code above will create the a array from the individual values v in each doc, grouped by _id.

Of course, we can add in the filtering from the previous example before the $group operator to reduce the amount of material being passed to $group and later, the filter for $stdDevPop:

  for cc in db.myCollection.aggregate([
  {"$match": {"d": {"$gt": datetime.datetime(2016,7,15)}}}
  ,{"$group": {"_id": "$n", "a": {"$push": "$v"}}}
  ,{"$addFields": {"sdev": {"$stdDevPop": "$a"}}}
  ,{"$match": {"sdev": {"$lt": 0.75}}}
  ]):
  ...

In general, the strategy should be to filter first, then manipulate.

A Final Tip When Dealing With Very Large Datasets

The simplicity of the 5-Liner hides a potential problem when millions or billions of documents are pulled from the database. The list() operator will cause an enormous data structure to be created, only to be passed as a temporary constructor to the DataFrame -- which then builds essentially the same enormous data structure:

  df = pd.DataFrame(list(db.myCollection.find()))

There are a number of approaches to avoiding this problem, but most center around creating small frames from small sets of input records, then concatenating the frames at the end. Here is an example from stackoverflow:

def iterator2dataframes(iterator, chunk_size: int):
  """Turn an iterator into multiple small pandas.DataFrame
  This is a balance between memory and efficiency
  """
  records = []
  frames = []
  for i, record in enumerate(iterator):
    records.append(record)
    if i % chunk_size == chunk_size - 1:
      frames.append(pd.DataFrame(records))
      records = []
  if records:
    frames.append(pd.DataFrame(records))
  return pd.concat(frames)

df = iterator2dataframe(db.myCollection.find(), 10000)

Like this? Dislike this? Let me know