Hurricanes, GeoJSON, and MongoDB
I thought I would explore the latest hurricane data available from NOAA.
This is a comprehensive rant that covers a number of things:
- The HURDAT2 hurricane data format
- Processing this data into something highly useful inside MongoDB, exploiting
the geospatial capabilities of MongoDB
- Exploring the subtle differences and uses of the GeoJSON FeatureCollection
vs. GeometryCollection, using both
the latest (2016) GeoJSON spec
RFC 7946
and
MongoDB 4.0 docs
The HURDAT2 format
The HURDAT2 format and data for all Atlantic hurricanes from 1851 to
2017 can be found
here.
The Atlantic and Pacific formats are in fact structurally the same; only some
of the prefixes and codes vary. In our examples here, I will be using
the Atlantic dataset
hurdat2-1851-2017-050118.txt.
Data from earlier hurricanes is obviously far less granular and rich than
more recent events, esp. starting in 2004 when quadrant wind radii
values and additional event data like landfall time were recorded.
As with most public datasets, HURDAT2 is designed for "maximum ingestibility"
and is a fixed width but also comma-separated file, Below is a quick sample:
AL122005, KATRINA, 34,
20050823, 1800, , TD, 23.1N, 75.1W, 30, 1008, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
20050824, 0600, , TD, 23.8N, 76.2W, 30, 1007, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
20050824, 1200, , TS, 24.5N, 76.5W, 35, 1006, 60, 60, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
20050825, 0600, , TS, 26.1N, 78.4W, 50, 997, 60, 60, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0,
20050825, 1800, , TS, 26.2N, 79.6W, 60, 988, 70, 70, 50, 60, 25, 25, 20, 20, 0, 0, 0, 0,
20050826, 0000, , HU, 25.9N, 80.3W, 70, 983, 70, 70, 50, 40, 20, 20, 20, 20, 10, 10, 10, 10,
...
The reader program assumes it starts with a header row. The docs are
very clear on the components and codes but of critical importance is the third
field which specifies the number of data rows to follow. In the example
above, we show only 6 of the 34 rows. Multiple storm events can be stored in one
file by adding additional header and data rows.
The logic to end is simply when read of the next header yields EOF.
This data can be easily slurped into practically any database and even some
geo visualizers -- but can we do even more? Certainly!
- There is no speed of eye nor bearing information, but we can compute it
(acknowledging some loss of accuracy) by compare two adjacent timestamp
entries.
- Date and time are broken up to ease ingestion but any modern database
will support a true datetime type that will enable better -- and easier --
queries.
- The wind radii values are "nice" but not terribly useful for a geo-enabled
database (MongoDB or PostGIS or even esri/ArcGIS for that matter).
Here's how they work (again, the docs have even more on this):
- Each dataline has 3 sets of wind speed radii with 4 quadrant values (NE,SE,SW, and NW in that order) each
- The values are not wind speed; they are maximum distance in
nautical miles from the eye where a certain speed of wind can be experienced.
- The speeds for each radii set are "fixed": set 1 is 34 knots, set 2 is
50 knots, and set 3 is 64 knots.
- A young system with winds <34 knots will have no wind radii.
In case you're wondering, maximum wind speed and minimum pressure at the
center of the storm are in data columns 7 and 8, respectively, above.
Here's a visualization:
|
|
|
|
|
----------C----------
|
|
|
|
|
C has winds up to 34 knots
- As column 7 grows beyond 34 knots, then the 34 knot "ring" kicks in. This
makes sense because a 40 knot wind in the center doesn't just drop to zero. The
outer edges of the storm have slower but still dangerous winds. Remember, this
is not a real ring! The radii set has 4 distance values, one for each compass
quadrant:
|
|
|
|
+---3
--------| C |--------
+---+
|
|
|
|
C has winds >=34
Important:
We show this as a square
for simplicity but it is not required that all four quadrant values
be the same and in fact sometimes one e.g. NE can have quite an
extent with all others quadrants being 0.
- As the system grows, the center speed picks up and the 34 knot ring expands:
|
|
|
+-----3
| |
-------| C |--------
| |
+-----+
|
|
|
C now is at, say 45. The 34 knot ring gets bigger.
- At >= 50 knots, we introduce a new, second ring.
Intuitively, the slower wind speed radius must be larger than the higher speed
radius in any given quad.
|
|
+---------3
| |
| +--5 |
-----| |C | |--------
| +--+ |
| |
+---------+
|
|
C is 50 so we create a new inner ring for 50 knot
- Above 64 knots, we introduce a third ring. The 34 knot and the 50
knot rings continue to expand out.
- Again, these rings in HURDAT2 are conceptual. They are described
only by 4 point values, one in each compass quadrant, and these are simply
the maximum extent in nautical miles where a wind speed is experienced.
So how do we turn 4 scalars into a GeoJSON ring?
Converting Wind Radii into GeoJSON
The Model
There are many models one could choose to turn 4 points, each guaranteed to
exist in a different quadrant, into some sort of a polygon or smoothed
shape. Many weather-domain-specific inputs such as viscosity and Coriolis
effect could go into the model.
We're not going to do that here.
Instead, we will take a very simple approach. Consider this wind radii diagram
and values:
|
a...
| ..
. f NE
NW | .
. | .
-------e----+----c--b-------
| .
| SE
d.
|
|
NE = 11, SE = 6, SW = 0, NW = 8
We observe that:
- The quadrant "wedges" have some pretty scary discontinuities
at the border!
- Physically, how could SW be 0 and NW = 10?
The wind just doesn't ... stop.
So we "smooth" it out by creating an 8 sided polygon, which
seems to be the right number of sides to be a good shape. Rules:
- Each "diagonal" value e.g. bearing 45 for NE and 135 for SE will
use the given max extent value.
- Each "crosshair" value e.g. N and S will be the average
of the neighboring values. In the example above, this means that
the E point in the polygon will have a value of
(b + c)/2 or (11+6)/2 = 8.5
- Where no max extent exists for a quadrant, it will be fudged
by taking the neighboring max extents, averaging them, then
applying an attenuation factor.
The Implementation
The implementation is here.
It will read a HUMDAT2 file,
pre-process all the material (including nasty miles/meters/KM/nautical
mile conversions and scrubbing), build the "wind rings," and load it
into MongoDB. We take advantage of "holes" in the GeoJSON spec in
our modeling of the concentric rings. For example, in the most
complex case where a 64 knot ring exists, then 3 polygons are created
as a MultiPolygon set:
- A 64 knot ring with no hole
- A larger 50 knot ring with the 64 knot ring as its hole
- An even larger 34 knot ring with the 50 knot ring as its hole
Holes give us a single z dimension for rendering purposes.
In other words, there is no effective overlay of the 64 knot ring
on top of the 50, etc. They "nest" inside each other. Due to
the rule that inner holes cannot touch or share a side with the
parent, we have to get a little creative and ensure that the
bigger ring is always slightly bigger than ring it is trying to
logically contain. This could lead to potential "dark spots" where
the outer edge of the inner ring is sufficiently far from the inner
edge of the outer ring so that no points will satisfy a lookup. We
acknowledge this at this time.
A safer but less interesting option
is to create rings with no holes, layer them, and use software like
GeoTools
to perform the diff/union operations.
In either event, it is important to recognize that if you use
$geoIntersects to find windRings that touch a given point, you
must still do a bit of work to determine exactly which ring is
involved. This is essentially identical to matching material in an array
of ordinary scalars in MongoDB: if any element matches, the whole doc is
returned and you do not immediately know which elements in the array exactly
match. The standard approach is the double match pattern:
- $match to find the right docs on an index-optimized basis. This is very important because the first pass at the data should substantially reduce
the search space for subsequent operators in the pipeline.
- $unwind the coordinates field of the MultiPolygon to create individual polygons
- $match again to find the exact ring matches -- which might be
more than one!
Here is a representative example:
/*
Approach with MultiPolygon
So in unwinding the coordinates array of the MultiPolygon, we "break"
the MultiPolygon definition because we now have single Polygons; that is,
there is one less level of nested arrays. MongoDB will
silently ignore the improperly constructed MultiPolygon and nothing will match.
So... we have to be clever and change the name of the type to Polygon!
*/
var pt = [ -72.5, 40.5 ]
var theMatch = {$match: {loc: {$geoIntersects:
{$geometry: {type: "Point", coordinates: pt }}}
}};
db[collname].aggregate([
theMatch // first pass, index optimized
,{$unwind: "$loc.coordinates"}
,{$addFields: {"loc.type":"Polygon"}} // ah HA! Overwrite the type field!
,theMatch // second pass
]);
FeatureCollection vs. GeometryCollection
The converter code creates, among many other things, a GeometryCollection
object and places it into a field called windRings
"windRings" : {
"type" : "GeometryCollection",
"geometries" : [
{
"type" : "Point",
"coordinates" : [
-76.8,
16.7
]
},
{
"type" : "MultiPolygon",
"coordinates" : [
[
...
We use GeometryCollection because ... we have to. MongoDB does not
recognize FeatureCollection as an indexable GeoJSON type. But this
isn't bad as we'll see in just a bit.
It's good to have
the flexibility of putting the center point and a MultiPolygon in
one indexable field (an array with 2 elements). Should our needs change
esp. around nesting and holes, we could easily switch this to a
GeometryCollection with a Point and up to three discrete
Polygon for a total of 4 elements in the array. All will be
indexed.
The web is filled with commentary on the benefits and differences of
FeatureCollection vs. GeometryCollection but I believe it
comes down to context, history, and tool compatibility.
Before there was GeoJSON and MongoDB, there was esri and ARCGis and shapefiles.
A standard unto itself, almost all academic and government geodata since the 1980s
(yep) have used these standards. But owing to their age, shapefiles and the
data environment in ARCGis are not rich data / flexible data friendly. Cross
vendor data integration was also in a very different state 30 years ago. So
shapefiles had to carry both geo data and other data like properties
for visualization and just simple management purposes. This created a paradigm
where the geo file and data structures were the boss and any other auxilliary
data had to be tucked into a standard slot in those structures.
FeatureCollection very much parallels this design for the purposes of
some amount of compatibility. It is a single object that contains an array
of Feature objects.
Each Feature MUST contain these fields:
- type which must be a value Feature (case sensitive!)
- geometry which contains substructures that are the same as
GeometryCollection
- properties which must be a dictionary of name:value pairs.
The dictionary can be empty but it must be present:
{"type": "FeatureCollection",
"features": [
{"type": "Feature",
"geometry": {"type": "Point", "coordinates": [-94.8, 28.0]},
"properties": {} }
]
}
The irritating part of this spec is that properties must be
present. Now, in truth, as long as the values in the properties object
are simple scalars like numbers and strings then many tools can take these simple
name:value pairs and
display and even filter or graph them. If you are very lucky, these tools
might even be able to interpret an ISO8601 datestring and do something intelligent
in the date/time space. The likelihood of such tools being
able to intelligently deal with arrays of simple scalars drops off
considerably, however, and almost none are designed to work with complex rich
structures (arrays of arrays of objects, etc.)
With modern programming languages, the MongoDB document model, and a fully
GeoJSON-oriented solution, you no longer have to "trap" auxilliary data with
the geo data. In the windRings example above, note how no properties
are required. We are free to design any structure we want -- including,
if we so desire, to add an additional field called properties that
may help described the contents of the geodata:
But the real power emerges when richly structured data is stored "alongside"
the geo data. The document model and MongoDB make this very
easy and performant to do in a way that simply was not possible using the 30
year old ARCgis era design parameters:
"windRingModel": {
"version": 3,
"params": [
{"name":"R34", "speed": {v:34, m:"knots"}, bias: [ 0.4, 0.23, 0.12] },
{"name":"R50", "speed": {v:50, m:"knots"}, bias: [ 0.8, 0.40, 0.22] }
]
"windRings" : {
"type" : "GeometryCollection",
"geometries" : [
{
"type" : "Point",
"coordinates" : [
-76.8,
16.7
]
},
{
"type" : "MultiPolygon",
"coordinates" : [
[
...
In the example above, all sorts of rich information including arrays can be held
"alongside" the main windRing geo data carrier.
And in The End ... You Haven Not Lost Anything!
Most important: The underlying shapes (Point, Polygon,
LineString, etc.) are the same for both FeatureCollection
and GeometryCollection so the skeleton is the same.
It is likely very true that one might create a FeatureCollection for the
purposes of driving a visualization tool outside of software and/or MongoDB GeoJSON
manipulation space, such as geojson.io. In this
case, depending on simplicity, clarity, security of data, and other factors, there
might be multiple sets of properties you might want to expose with the geo
data. This suggests an approach where a program starts with the GeometryCollection
and other data in the document and assembles features for the
FeatureCollection, with the primary task being setting up the required
properties field as appropriate for the target consumer and/or rendering
technology. Here is an example:
# Get a doc from MongoDB:
item = coll.find(params)
fwrap = {}
fwrap['type'] = "FeatureCollection"
fcoll = []
# This is the modest amount of bespoke code. Basically, you are just
# iterating through the GeometryCollection and assembling the aux data.
# Note that the GeoJSON shape in GeometryCollection is lifted straight
# into the the 'geometry' field of the Feature object
n = len(item['windRings']['geometries'])
for i in range(0,n):
onef = {"type":"Feature"}
onef['geometry'] = item['windRings']['geometries'][i] # Easy! set the whole shape at once!
wrm = item['windRingsModel']['params']
onef['properties'] = {"speed":wrm[i]['speed'], "status":wrm[i]['status']
fcoll.append(onef)
fwrap['features'] = fcoll
print json.dumps(fwrap)
# The output is copy-and-pasteable in geojson.io
Going in the other direction, from FeatureCollection to something directly
consumable by MongoDB is even easier; just use the very popular JSON processor jq:
"Unwrap" the features array and emit each shape as CR-delimited JSON.
Important to use the -c option here to make it CR-delimited:
$ jq -c '.features[] | del(.type)' featurecollection.json > geojson.json
Now simply import! Use of skipRow is not required but helpful when a very small
number of shapes have bad geometries and stop the whole load.
$ mongoimport --uri connectionString -d test -c myCollection --parseGrace skipRow geojson.json
A complete example mktrack.py is on the
github site
with the loader implementation.
Like this? Dislike this? Let me know
Site copyright © 2013-2024 Buzz Moschetti. All rights reserved