Hurricanes, GeoJSON, and MongoDB

2-Oct-2018 Like this? Dislike this? Let me know

I thought I would explore the latest hurricane data available from NOAA. This is a comprehensive rant that covers a number of things:

  1. The HURDAT2 hurricane data format
  2. Processing this data into something highly useful inside MongoDB, exploiting the geospatial capabilities of MongoDB
  3. Exploring the subtle differences and uses of the GeoJSON FeatureCollection vs. GeometryCollection, using both the latest (2016) GeoJSON spec RFC 7946 and MongoDB 4.0 docs

The HURDAT2 format

The HURDAT2 format and data for all Atlantic hurricanes from 1851 to 2017 can be found here. The Atlantic and Pacific formats are in fact structurally the same; only some of the prefixes and codes vary. In our examples here, I will be using the Atlantic dataset hurdat2-1851-2017-050118.txt. Data from earlier hurricanes is obviously far less granular and rich than more recent events, esp. starting in 2004 when quadrant wind radii values and additional event data like landfall time were recorded.

As with most public datasets, HURDAT2 is designed for "maximum ingestibility" and is a fixed width but also comma-separated file, Below is a quick sample:

AL122005,            KATRINA,     34,
20050823, 1800,  , TD, 23.1N,  75.1W,  30, 1008,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
20050824, 0600,  , TD, 23.8N,  76.2W,  30, 1007,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
20050824, 1200,  , TS, 24.5N,  76.5W,  35, 1006,   60,   60,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
20050825, 0600,  , TS, 26.1N,  78.4W,  50,  997,   60,   60,    0,    0,   15,    0,    0,    0,    0,    0,    0,    0,
20050825, 1800,  , TS, 26.2N,  79.6W,  60,  988,   70,   70,   50,   60,   25,   25,   20,   20,    0,    0,    0,    0,
20050826, 0000,  , HU, 25.9N,  80.3W,  70,  983,   70,   70,   50,   40,   20,   20,   20,   20,   10,   10,   10,   10,
...
The reader program assumes it starts with a header row. The docs are very clear on the components and codes but of critical importance is the third field which specifies the number of data rows to follow. In the example above, we show only 6 of the 34 rows. Multiple storm events can be stored in one file by adding additional header and data rows. The logic to end is simply when read of the next header yields EOF.

This data can be easily slurped into practically any database and even some geo visualizers -- but can we do even more? Certainly!

  1. There is no speed of eye nor bearing information, but we can compute it (acknowledging some loss of accuracy) by compare two adjacent timestamp entries.
  2. Date and time are broken up to ease ingestion but any modern database will support a true datetime type that will enable better -- and easier -- queries.
  3. The wind radii values are "nice" but not terribly useful for a geo-enabled database (MongoDB or PostGIS or even esri/ArcGIS for that matter). Here's how they work (again, the docs have even more on this):
So how do we turn 4 scalars into a GeoJSON ring?

Converting Wind Radii into GeoJSON

The Model

There are many models one could choose to turn 4 points, each guaranteed to exist in a different quadrant, into some sort of a polygon or smoothed shape. Many weather-domain-specific inputs such as viscosity and Coriolis effect could go into the model.

We're not going to do that here.

Instead, we will take a very simple approach. Consider this wind radii diagram and values:

             |
             a...
             |   ..
           . f     NE
         NW  |       .
        .    |       .
 -------e----+----c--b-------
             |    .
             |  SE
             d.
             |
             |
NE = 11, SE = 6, SW = 0, NW = 8
We observe that: So we "smooth" it out by creating an 8 sided polygon, which seems to be the right number of sides to be a good shape. Rules:
  1. Each "diagonal" value e.g. bearing 45 for NE and 135 for SE will use the given max extent value.
  2. Each "crosshair" value e.g. N and S will be the average of the neighboring values. In the example above, this means that the E point in the polygon will have a value of (b + c)/2 or (11+6)/2 = 8.5
  3. Where no max extent exists for a quadrant, it will be fudged by taking the neighboring max extents, averaging them, then applying an attenuation factor.

The Implementation

The implementation is here. It will read a HUMDAT2 file, pre-process all the material (including nasty miles/meters/KM/nautical mile conversions and scrubbing), build the "wind rings," and load it into MongoDB. We take advantage of "holes" in the GeoJSON spec in our modeling of the concentric rings. For example, in the most complex case where a 64 knot ring exists, then 3 polygons are created as a MultiPolygon set: Holes give us a single z dimension for rendering purposes. In other words, there is no effective overlay of the 64 knot ring on top of the 50, etc. They "nest" inside each other. Due to the rule that inner holes cannot touch or share a side with the parent, we have to get a little creative and ensure that the bigger ring is always slightly bigger than ring it is trying to logically contain. This could lead to potential "dark spots" where the outer edge of the inner ring is sufficiently far from the inner edge of the outer ring so that no points will satisfy a lookup. We acknowledge this at this time.

A safer but less interesting option is to create rings with no holes, layer them, and use software like GeoTools to perform the diff/union operations.
In either event, it is important to recognize that if you use $geoIntersects to find windRings that touch a given point, you must still do a bit of work to determine exactly which ring is involved. This is essentially identical to matching material in an array of ordinary scalars in MongoDB: if any element matches, the whole doc is returned and you do not immediately know which elements in the array exactly match. The standard approach is the double match pattern:

  1. $match to find the right docs on an index-optimized basis. This is very important because the first pass at the data should substantially reduce the search space for subsequent operators in the pipeline.
  2. $unwind the coordinates field of the MultiPolygon to create individual polygons
  3. $match again to find the exact ring matches -- which might be more than one!
Here is a representative example:
/*
Approach with MultiPolygon

So in unwinding the coordinates array of the MultiPolygon, we "break"
the MultiPolygon definition because we now have single Polygons; that is,
there is one less level of nested arrays.  MongoDB will
silently ignore the improperly constructed MultiPolygon and nothing will match.

So...  we have to be clever and change the name of the type to Polygon!
*/

var pt = [ -72.5, 40.5 ]

var theMatch = {$match: {loc: {$geoIntersects:
		{$geometry: {type: "Point", coordinates: pt }}}
    }};

db[collname].aggregate([
  theMatch  // first pass, index optimized
 ,{$unwind: "$loc.coordinates"}
 ,{$addFields: {"loc.type":"Polygon"}}  // ah HA!  Overwrite the type field!
 ,theMatch  // second pass
			  ]);

FeatureCollection vs. GeometryCollection

The converter code creates, among many other things, a GeometryCollection object and places it into a field called windRings
	"windRings" : {
		"type" : "GeometryCollection",
		"geometries" : [
			{
				"type" : "Point",
				"coordinates" : [
					-76.8,
					16.7
				]
			},
			{
				"type" : "MultiPolygon",
				"coordinates" : [
					[
		    ...
We use GeometryCollection because ... we have to. MongoDB does not recognize FeatureCollection as an indexable GeoJSON type. But this isn't bad as we'll see in just a bit.

It's good to have the flexibility of putting the center point and a MultiPolygon in one indexable field (an array with 2 elements). Should our needs change esp. around nesting and holes, we could easily switch this to a GeometryCollection with a Point and up to three discrete Polygon for a total of 4 elements in the array. All will be indexed.

The web is filled with commentary on the benefits and differences of FeatureCollection vs. GeometryCollection but I believe it comes down to context, history, and tool compatibility.

Before there was GeoJSON and MongoDB, there was esri and ARCGis and shapefiles. A standard unto itself, almost all academic and government geodata since the 1980s (yep) have used these standards. But owing to their age, shapefiles and the data environment in ARCGis are not rich data / flexible data friendly. Cross vendor data integration was also in a very different state 30 years ago. So shapefiles had to carry both geo data and other data like properties for visualization and just simple management purposes. This created a paradigm where the geo file and data structures were the boss and any other auxilliary data had to be tucked into a standard slot in those structures.

FeatureCollection very much parallels this design for the purposes of some amount of compatibility. It is a single object that contains an array of Feature objects.
Each Feature MUST contain these fields:

  1. type which must be a value Feature (case sensitive!)
  2. geometry which contains substructures that are the same as GeometryCollection
  3. properties which must be a dictionary of name:value pairs. The dictionary can be empty but it must be present:
{"type": "FeatureCollection",
 "features": [
   {"type": "Feature",
    "geometry": {"type": "Point", "coordinates": [-94.8, 28.0]},
    "properties": {} }
  ]
}
The irritating part of this spec is that properties must be present. Now, in truth, as long as the values in the properties object are simple scalars like numbers and strings then many tools can take these simple name:value pairs and display and even filter or graph them. If you are very lucky, these tools might even be able to interpret an ISO8601 datestring and do something intelligent in the date/time space. The likelihood of such tools being able to intelligently deal with arrays of simple scalars drops off considerably, however, and almost none are designed to work with complex rich structures (arrays of arrays of objects, etc.)

With modern programming languages, the MongoDB document model, and a fully GeoJSON-oriented solution, you no longer have to "trap" auxilliary data with the geo data. In the windRings example above, note how no properties are required. We are free to design any structure we want -- including, if we so desire, to add an additional field called properties that may help described the contents of the geodata:

But the real power emerges when richly structured data is stored "alongside" the geo data. The document model and MongoDB make this very easy and performant to do in a way that simply was not possible using the 30 year old ARCgis era design parameters:
        "windRingModel": {
		"version": 3,
		"params": [
		    {"name":"R34", "speed": {v:34, m:"knots"}, bias: [ 0.4, 0.23, 0.12] },
		    {"name":"R50", "speed": {v:50, m:"knots"}, bias: [ 0.8, 0.40, 0.22] }
		 ]
	"windRings" : {
		"type" : "GeometryCollection",
		"geometries" : [
			{
				"type" : "Point",
				"coordinates" : [
					-76.8,
					16.7
				]
			},
			{
				"type" : "MultiPolygon",
				"coordinates" : [
					[
		    ...
In the example above, all sorts of rich information including arrays can be held "alongside" the main windRing geo data carrier.

And in The End ... You Haven Not Lost Anything!

Most important: The underlying shapes (Point, Polygon, LineString, etc.) are the same for both FeatureCollection and GeometryCollection so the skeleton is the same.

It is likely very true that one might create a FeatureCollection for the purposes of driving a visualization tool outside of software and/or MongoDB GeoJSON manipulation space, such as geojson.io. In this case, depending on simplicity, clarity, security of data, and other factors, there might be multiple sets of properties you might want to expose with the geo data. This suggests an approach where a program starts with the GeometryCollection and other data in the document and assembles features for the FeatureCollection, with the primary task being setting up the required properties field as appropriate for the target consumer and/or rendering technology. Here is an example:

#  Get a doc from MongoDB:
item = coll.find(params)

fwrap = {}
fwrap['type'] = "FeatureCollection"

fcoll = []

#  This is the modest amount of bespoke code.  Basically, you are just
#  iterating through the GeometryCollection and assembling the aux data.
#  Note that the GeoJSON shape in GeometryCollection is lifted straight
#  into the the 'geometry' field of the Feature object

n = len(item['windRings']['geometries'])
for i in range(0,n):
    onef = {"type":"Feature"}
    onef['geometry'] = item['windRings']['geometries'][i]  # Easy! set the whole shape at once!
    wrm = item['windRingsModel']['params']
    onef['properties'] = {"speed":wrm[i]['speed'], "status":wrm[i]['status']
    fcoll.append(onef)

fwrap['features'] = fcoll

print json.dumps(fwrap)

# The output is copy-and-pasteable in geojson.io

Going in the other direction, from FeatureCollection to something directly consumable by MongoDB is even easier; just use the very popular JSON processor jq:

    "Unwrap" the features array and emit each shape as CR-delimited JSON.
    Important to use the -c option here to make it CR-delimited:
    $ jq -c '.features[] | del(.type)' featurecollection.json > geojson.json

    Now simply import!  Use of skipRow is not required but helpful when a very small
    number of shapes have bad geometries and stop the whole load.
    $ mongoimport --uri connectionString -d test -c myCollection --parseGrace skipRow geojson.json
A complete example mktrack.py is on the github site with the loader implementation.

Like this? Dislike this? Let me know


Site copyright © 2013-2024 Buzz Moschetti. All rights reserved