Sat 16 April 2016
Working with geospatial vector data typically involves manipulating collections of features - points, lines and polygons with attributes. You might change their geometries, alter their properties or both. This is nothing new. Tools like this have been around since the first days of GIS.
Notice the essential role of many of these operations: taking vector data as input, doing some work and producing vector data as output.
While conceptually very simple, this logic often gets siloed, tied too closely to our specific implementions, formats, and systems.
The following is my take on the best practices for designing and building
your own vector processing modules using modern Python. The goals here are not primarily
performance but interoperability and composability.
GeoJSON guides the way
Using GeoJSON-like Feature mappings as a representation of simple features buys us a ton of interoperability.
It's not only a standard but the only one that can be translated to fully represent a feature as a python data structure.
Other standards specify file formats or data structures for geometries only.
Most Python modules that deal with geospatial data can speak GeoJSON-like data.
And if they don't, the data structure is easy to construct manually.
Let's take a look at our humble Feature
'coordinates': [-120.0, 42.0]}}
geometry, the geographic component, is just iterables of lon, lat locations - you can represent points, lines, polygons or multis. The
properties dictionary holds non-geographic information about the features, analagous to the "attribute table" in many GIS.
A quick note on the term "GeoJSON-like Feature mapping"... GeoJSON is a text serialization format. When we take GeoJSON and translate it into a python data structure it is no longer GeoJSON but a python dictionary (mapping) which follows the semantics of a GeoJSON Feature. From here on out, I'll just refer to this GeoJSON-like python data structure as a feature. If you're writing functions that work with vector data, they should accept and return features.
That's the convention, the golden rule of writing Python vector processing functions
Functions should take features as inputs and yield or return features.
In other words, features in, features out. That's it. It's really that simple, and the simplicity buys you a great deal of potential.
The IO Sandwich
Functions which fit this convention will not read or write to anything outside of locally-scoped variables.
Does your function need to read from a file or write to the network in addition to processing features?
Why should one function be responsible for doing multiple tasks? We're striving for functions that do one thing - process vector features.
All the data your function needs should be passed in as arguments. Note that this is very different than passing in a file path and doing the reading and writing of data within your function:
features = read_features("/path/to/shapefile.shp")
new_features = process_features(features)
You might be concerned about memory. But don't worry, well-behaved Python libraries can use generators to load the data as needed.
Another way to picture it is that your application should build an IO sandwich with all of the reading and writing happening outside of your processing function.
Read Shapefile into Features --> process_features --> Write Features to Shapefile
That way anyone can use the same function with different inputs and outputs
Read Web Service into Features --> process_features --> Write Features to PostGIS
Processing functions should not care where their input features come from or where the output features are going.
As long as
process_features takes and returns features, any number of combinations are possible.
This not only decouples IO but allows us to compose processes together
Read Features --> process1 --> process2 --> process3 --> Write Features
When possible, you should strive for pure functions; avoid mutating data and return a clean copy.
Unless you have specific reason, leave the original feature intact except for the thing your function is expected to manipulate. For instance, if your function just alters the geometry, don't drop or change existing properties.
There are some cases where it makes sense to collect your features into a collection and return the entire thing at once. This will generally occur if the features are not independent. In many cases though, your features will largely be independent and can be processed one-by-one. For these situations, it makes sense to use a generator (i.e.
yield feature instead of
Finally, you should aim to make your features serializable. You should be able to
json.dump() the output features. The
properties member should not contain nested dicts which might confuse some GIS formats which require a flat structure. And if possible, avoid extending the json with extra elements outside of
In this simple example, we'll write a single vector processing function that buffers a geometry by a specified distance.
Taking an input of points, for example:
and buffering them by 10 units.
Here is the core processing function which follows the features in, features out convention
from shapely.geometry import shape
def buffer(features, buffer=1.0):
"""Buffer a feature by specified units
for feature in features:
geom = shape(feature['geometry']) # Convert to shapely geometry to operate on it
geom = geom.buffer(buffer) # Buffer
new_feature = feature.copy()
new_feature['geometry'] = geom.__geo_interface__
Then we could use it in our IO sandwich by reading features from a shapefile and outputing the Features to GeoJSON on stdout. Here's what our Python interface looks like
import fiona # for input
import json # for output
from process import buffer
with fiona.open("data/points.shp") as src:
for feature in buffer(src, 10.0):
So the python interface is looking good. What if we wanted to use it in a command line interface? Well luckily click and cligj has got the input covered. The
@cligj.features_in_arg reads in an iterable of features from a file, a FeatureCollection or stream of Features.
from process import buffer
def buffer_cmd(features, distance):
for feature in buffer(features, distance):
if __name__ == "__main__":
Which we can then use between
fio cat and
fio collect to process Features in a memory-efficient stream.
$ fio cat data/points.shp | python buffer_cmd.py 10 | fio collect > points_buffer.geojson
What about an HTTP interface? Flask provides us with a lightweight framework to turn our function into a web service:
from flask import Flask, request, Response
from process import buffer
app = Flask(__name__)
collection = request.get_json(force=True)
distance = float(distance)
new_features = list(
collection['features'] = new_features
if __name__ == '__main__':
Which gives us a
buffer web service to which you can post GeoJSON FeatureCollections and get back a buffered collection:
$ fio dump data/points.shp | \
curl -X POST -d @- http://localhost:5000/buffer/10.0 > points_buffered.geojson
Writing your vector processing code to follow these simple conventions
enables great flexibility. You can use your code in a Python application,
a command line interface, an HTTP web service - all based on the same core processing functions.
Assuming you can write some glue code to express input and output as GeoJSON features,
this will work with any vector data source and is not constrained to a single context.
You can use this with any data, anywhere that supports Python. That's a pretty powerful concept,
all made possible by the simple convention of features in, features out.