Quick tips for using pandas with MongoDB
This article covers two topics at a high level:
- How to most effectively move array-based data from MongoDB into pandas
- Important and useful ways to exploit MongoDB query and aggregation language to
scope the data in pandas to just what you need to perform your tasks.
For Starters: What is pandas? |
|
From their docs:
-
pandas is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, real world data analysis in Python. Additionally, it has the
broader goal of becoming the most powerful and flexible open source data
analysis / manipulation tool available in any language. It is already well on
its way toward this goal.
pandas is well suited for many different kinds of data:
- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure
How is pandas different than R?
There is plenty of material on the internet on this topic, but here are
some high levels bullets:
- Both pandas and R are used for statistical and numeric analysis, including
capabilities to import data and plot it.
- Both pandas and R feature a 2-dimensional space called a dataframe
as a foundational object.
- R is its own environment with its own language and ecosystem of modules.
pandas is a module you import into python, so the language is python and the ecosystem is python.
- R has a somewhat broader native set of statistical functions.
pandas benefits from the python object-oriented environment and many more "peer" modules and functions can be easily integrated
- Both python and R have drivers/APIs to MongoDB
How do I get data from MongoDB into pandas?
This is the "classic" 5-liner that will produce a dataframe. find()
with no arguments is essentially select * from myCollection; all
the fields, no filtering. The python native function list() is used
to crank the cursor (an iterable) returned by find() until the
cursor is done, creating and appending to a list object. This is fine
for smaller datasets but see the "Final Tip" below for hazards and remedies:
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.sourceDatabase
df = pd.DataFrame(list(db.myCollection.find()))
pandas is very flexible in terms of its ingestion of data and the MongoDB
python driver will yield rich types (i.e. real dates, real arrays, etc.,
not string representations or opaque blobs) so pretty much
anything that is in myCollection will be slurped into the
dataframe. So in action, the 5-Liner run with this data:
db.myCollection.insert({"a":1, "b":2, "c":3})
db.myCollection.insert({"a":4, "b":5, "c":6})
would yield this output:
_id a b c
0 5804e627119c0d2c13c0a7e1 1.0 2.0 3.0
1 5804e627119c0d2c13c0a7e2 4.0 5.0 6.0
Notice how the special MongoDB unique document identifier _id
is also brought into the frame. There are simple techniques to hold that out
from the returned dataset; we'll explore that in just a bit.
As mentioned before, pandas will properly
slurp even complex data, including mixed types in arrays (field f
in the second insert in the following example):
db.myCollection.insert({"a":"hello", "b": datetime.datetime.now(), "c": {"d":"AA","e":"BB"}, "f": [2,3,4]})
db.myCollection.insert({"a":"goodbye", "c": {"e":"WW"}, "f": [0, "grimble", datetime.datetime.now(), 7.7 ]})
_id a b \
0 5804e758119c0d2c13c0a7e5 hello 2016-10-17 14:59:36.029
1 5804e758119c0d2c13c0a7e6 goodbye
c f
0 {u'e': u'BB', u'd': u'AA'} [2.0, 3.0, 4.0]
1 {u'e': u'WW'} [0.0, grimble, 2016-10-17 14:59:36.107000, 7.7]
Although The 5-Liner highlights the ease of integration, it sidesteps
an important issue: how to best deal with arrays in MongoDB.
Some background on pandas and dataframe creation
Backing up just a little bit, most use cases of pandas center around
creating matrixes of numbers, often
physically supplied as rows of columns of numbers. Arrays of arrays are
truly the easiest to construct and nicely drive dynamic dataframe creation
(i.e. the lengths of the arrays drive the frame size),
although
like all integer-offset based addressing, larger datasets or those missing
values can get a bit cumbersome to deal with (just like CSV files):
values = [
[ 4, 5, 6, 7],
[ 7, 8, 9, 10],
[ 10, 11, 12]
]
print pd.DataFrame(values)
0 1 2 3
0 4 5 6 7.0
1 7 8 9 10.0
2 10 11 12 NaN
Arrays of key:value structures of simple scalars make the
field management
easier and in fact are the main representation of data as vended by
traditional RDBMS platforms (think of a cursor of ResultSet):
values = [
{ "a":4, "b":5, "c":6, "d":7},
{ "a":7, "b":8, "c":9, "d":10},
{ "b":11, "c":12 }
]
print pd.DataFrame(values)
a b c d
0 4.0 5 6 7.0
1 7.0 8 9 10.0
2 NaN 11 12 NaN
Combining key:value and arrays, however, yields an interesting result:
values = [
{ "a": [ 4,5,6,7]},
{ "a": [ 8,9,10] },
{ "a": [ 11, 12] }
]
print pd.DataFrame(values)
a
0 [4, 5, 6, 7]
1 [8, 9, 10]
2 [11, 12]
The arrays do not dynamically drive the number of columns. Instead only
a single column exists of type array! This is typically not the structure
we seek.
In MongoDB it is often very advantageous to store arrays of
values. But 5-Liner approach suffers from the same problem:
data = [
{ "a": [ 4,5,6,7]},
{ "a": [ 8,9,10] },
{ "a": [ 11, 12] }
]
db.myCollection.insert(data)
print pd.DataFrame(list(db.myCollection.find()))
_id a
0 58056115119c0d2c13c0a7e9 [4.0, 5.0, 6.0, 7.0]
1 58056115119c0d2c13c0a7ea [8.0, 9.0, 10.0]
2 58056115119c0d2c13c0a7eb [11.0, 12.0]
Optimizing consumption of arrays from MongoDB
The technique for maximizing the utility of moving data from MongoDB arrays
to dynamic columns in pandas is simple: instead of calling list()
on find(), iterate the cursor and build at least one list
for values (this will be a list of arrays), optionally one for columns labels,
and optionally one for an index. Starting simple (and without list
comprehensions):
values = []
for cc in db.myCollection.find():
values.append(cc['a'])
print pd.DataFrame(values)
0 1 2 3
0 4.0 5.0 6.0 7.0
1 8.0 9.0 10.0 NaN
2 11.0 12.0 NaN NaN
Note how it is no longer just column "a"; the columns are dynamically driven
by the size of array a. Also, because we are
explicitly building the values
array (again, a list of arrays), the _id does not get dragged in.
pandas will automatically assign
NaN
to the short arrays.
Each document will typically have some kind of label associated with the
value array. Let's add that as field n:
data = [
{ "a": [ 4,5,6,7], "n": "foo"},
{ "a": [ 8,9,10] , "n": "bar"},
{ "a": [ 11, 12] , "n": "baz"}
]
db.myCollection.insert(data)
values = []
seriesLbls = []
for cc in db.myCollection.find():
values.append(cc['a'])
seriesLbls.append(cc['n'])
print pd.DataFrame(values, index=seriesLbls)
0 1 2 3
foo 4.0 5.0 6.0 7.0
bar 8.0 9.0 10.0 NaN
baz 11.0 12.0 NaN NaN
Note that 0,1,2 have been replaced by foo, bar, and baz. It should be
clear that if desired, cc['_id'] could be used instead of
cc['n'] to create the index. Building multiple arrays in the
for loop is somewhat clearer than getting fancy with list
comprehensions.
Datetimes also flow well from MongoDB into pandas for use as an index.
Let's add a date to each record and use that instead of the name:
data = [
{ "a": [ 4,5,6,7], "n": "foo", "d": datetime.datetime(2016,7,1) },
{ "a": [ 8,9,10] , "n": "bar", "d": datetime.datetime(2016,8,1) },
{ "a": [ 11, 12] , "n": "baz", "d": datetime.datetime(2016,9,1) }
]
db.myCollection.insert(data)
values = []
dates = []
for cc in db.myCollection.find():
values.append(cc['a'])
dates.append(cc['d'])
didx = pd.DatetimeIndex(dates) # not strictly necessary
print pd.DataFrame(values, index=didx)
0 1 2 3
2016-07-01 4.0 5.0 6.0 7.0
2016-08-01 8.0 9.0 10.0 NaN
2016-09-01 11.0 12.0 NaN NaN
The dataframe constructor will accept a regular array of datetime objects
as the value of index but we show explicit construction of a
DatetimeIndex object here for completeness.
Finally, there are a few ways to craft column labels to make the dataframe
nice and complete. Here we simply look for the longest array to
drive the generation of column names C0 through Cn (via a list
comprehension):
values = []
seriesLbls = []
max = 0
for cc in db.myCollection.find():
if len(cc['a']) > max:
max = len(cc['a'])
values.append(cc['a'])
seriesLbls.append(cc['n'])
df = pd.DataFrame(values, index=seriesLbls, columns=[ "C"+str(n) for n in range(max)])
print df
print df.T
C0 C1 C2 C3
foo 4.0 5.0 6.0 7.0
bar 8.0 9.0 10.0 NaN
baz 11.0 12.0 NaN NaN
foo bar baz
C0 4.0 8.0 11.0
C1 5.0 9.0 12.0
C2 6.0 10.0 NaN
C3 7.0 NaN NaN
We show both the nominal and transposed (df.T) dataframes here.
Matrixes
Outside of the dataframe, matrixes are often implemented as an array of arrays.
This implementation translates easily for MongoDB. Here, we are storing two
documents, each with a 3x3 matrix. This is a slightly different data design
than the examples above because the matrix is a single field in one document
instead of each row being carried as an array in 2 or more documents:
values = [
{ "name":"A", "v": [ [1,2,3], [4,5,6], [7,8,9] ] }
,{ "name":"B", "v": [ [2,3,4], [5,6,4], [8,9,10] ] }
]
db.myCollection.drop()
db.myCollection.insert(values)
As with regular vectors, just slurping the MongoDB doc into the frame will yield
a representation that you probably don't want.
print pd.DataFrame(list(db.myCollection.find()))
_id name v
0 587115a8ed58db9467d94d34 A [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
1 587115a8ed58db9467d94d35 B [[2, 3, 4], [5, 6, 4], [8, 9, 10]]
Fortunately, pandas works well with regular arrays, so iteratively appending
simple arrays will yield the 3x3 structure we seek. In the following example,
we use the find_one method to return just the one document with name
"B":
values = []
item = db.myCollection.find_one({"name":"B"})
for row in item['v']: # item['v'] is array of array, so row is array
values.append(row)
print pd.DataFrame(values)
0 1 2
0 2 3 4
1 5 6 4
2 8 9 10
Don't forget about the power of MongoDB!
Very often developers and analysts adopt an approach of "dragging everything"
out of the database (MongoDB or otherwise) and loading up a very big dataframe
-- even if they only need a small subset of the data. Very often this
is because:
- The filtering or aggregation functions need to work on arrays and the pandas
environment offers capabilities not extant in the data provider
- The pandas environment is more performant or more convenient
- All of the above
But clearly it is highly
desirable not to move unnecessary data across the network and create
large dataframes if possible.
The examples above focused on efficiently moving arrays from MongoDB into dataframes
but did not explore filtering and aggregation. Below is an example of how
the powerful MongoDB aggregation framework can be brought into the picture.
These filtering
commands are executed in an index-optimized fashion on the server side,
dramatically reducing the amount of material populating the dataframe and improving
performance all around:
values = []
seriesLbls = []
max = 0
for cc in db.myCollection.aggregate([
# First, only get things after 2016-07-15. Typically this would be on an
# indexed field and will rapidly cut down the material to a fraction
# of what might be stored in the DB. "First cut" filtering on dates,
# portfolios, owners, compute run ids, etc. is a very important and useful
# capability.
#
{"$match": {"d": {"$gt": datetime.datetime(2016,7,15)}}}
# Next, compute stdDevPop of the array. MongoDB offers powerful array
# handling functions:
,{"$addFields": {"sdev": {"$stdDevPop": "$a"}}}
# Next, only permit those items where stdDevPop is <.75 to come through:
,{"$match": {"sdev": {"$lt": 0.75}}}
]):
if len(cc['a']) > max:
max = len(cc['a'])
values.append(cc['a'])
seriesLbls.append(cc['n'])
df = pd.DataFrame(values, index=seriesLbls, columns=[ "C"+str(n) for n in range(max)])
print df
C0 C1
baz 11 12
As a check, if you comment out the last match, you will see that two items
will flow through to the dataframe. The third is still not present because
it was filtered out by the first $match expression on date.
# Next, only permit those items where stdDevPop is <.75 to come through:
# ,{"$match": {"sdev": {"$lt": 0.75}}} # commented out
C0 C1 C2
bar 8 9 10.0
baz 11 12 NaN
Sometimes you will not have data organized into an array but instead will
find the values spread out across documents. Rather than change the
pandas / dataframe logic, use MongoDB to construct the array. We will take
the data above and break it out into individual docs, using the
scalar field v to carry each element of the (original) array to
demonastrate this situation. Then, we will use the $group stage
to bring it back together, build the a array, and pass the results
to the same code logic as before:
data = [
{ "v": 4, "n": "foo", "d": datetime.datetime(2016,7,1) },
{ "v": 5, "n": "foo", "d": datetime.datetime(2016,7,1) },
{ "v": 6, "n": "foo", "d": datetime.datetime(2016,7,1) },
{ "v": 7, "n": "foo", "d": datetime.datetime(2016,7,1) },
{ "v": 8, "n": "bar", "d": datetime.datetime(2016,8,1) },
{ "v": 9, "n": "bar", "d": datetime.datetime(2016,8,1) },
{ "v": 10, "n": "bar", "d": datetime.datetime(2016,8,1) },
{ "v": 11, "n": "baz", "d": datetime.datetime(2016,9,1) },
{ "v": 12, "n": "baz", "d": datetime.datetime(2016,9,1) }
]
db.myCollection.insert(data)
values = []
seriesLbls = []
max = 0
for cc in db.myCollection.aggregate([
{"$group": {"_id": "$n", "a": {"$push": "$v"}}}
]):
print cc
if len(cc['a']) > max:
max = len(cc['a'])
values.append(cc['a'])
seriesLbls.append(cc['_id'])
print pd.DataFrame(values, index=seriesLbls, columns=[ "C"+str(n) for n in range(max)])
C0 C1 C2 C3
baz 11 12 NaN NaN
bar 8 9 10.0 NaN
foo 4 5 6.0 7.0
The bold code above will create the a array from the individual
values v in each doc, grouped by _id.
Of course,
we can add in the filtering from the previous example before
the $group operator to reduce the amount of material being
passed to $group and later, the filter for $stdDevPop:
for cc in db.myCollection.aggregate([
{"$match": {"d": {"$gt": datetime.datetime(2016,7,15)}}}
,{"$group": {"_id": "$n", "a": {"$push": "$v"}}}
,{"$addFields": {"sdev": {"$stdDevPop": "$a"}}}
,{"$match": {"sdev": {"$lt": 0.75}}}
]):
...
In general, the strategy should be to filter first, then manipulate.
A Final Tip When Dealing With Very Large Datasets
The simplicity of the 5-Liner hides a potential problem when millions
or billions of documents are pulled from the database. The list()
operator will cause an enormous data structure to be created, only to
be passed as a temporary constructor to the DataFrame -- which then
builds essentially the same enormous data structure:
df = pd.DataFrame(list(db.myCollection.find()))
There are a number of approaches to avoiding this problem, but most
center around creating small frames from small sets of input records,
then concatenating the frames at the end. Here is an example from
stackoverflow:
def iterator2dataframes(iterator, chunk_size: int):
"""Turn an iterator into multiple small pandas.DataFrame
This is a balance between memory and efficiency
"""
records = []
frames = []
for i, record in enumerate(iterator):
records.append(record)
if i % chunk_size == chunk_size - 1:
frames.append(pd.DataFrame(records))
records = []
if records:
frames.append(pd.DataFrame(records))
return pd.concat(frames)
df = iterator2dataframe(db.myCollection.find(), 10000)
Like this? Dislike this? Let me know
Site copyright © 2013-2024 Buzz Moschetti. All rights reserved