perrygeo

About | Articles | CV

Processing vector features in Python

Sat 16 April 2016

Working with geospatial vector data typically involves manipulating collections of features - points, lines and polygons with attributes. You might change their geometries, alter their properties or both. This is nothing new. Tools like this have been around since the first days of GIS. Notice the essential role of many of these operations: taking vector data as input, doing some work and producing vector data as output. While conceptually very simple, this logic often gets siloed, tied too closely to our specific implementions, formats, and systems.

The following is my take on the best practices for designing and building your own vector processing modules using modern Python. The goals here are not primarily performance but interoperability and composability.

GeoJSON guides the way

Using GeoJSON-like Feature mappings as a representation of simple features buys us a ton of interoperability. It's not only a standard but the only one that can be translated to fully represent a feature as a python data structure. Other standards specify file formats or data structures for geometries only. Most Python modules that deal with geospatial data can speak GeoJSON-like data. And if they don't, the data structure is easy to construct manually. Let's take a look at our humble Feature

{
    'type': 'Feature',
    'properties': {
        'name': 'Example'},
    'geometry': {
        'type': 'Point',
        'coordinates': [-120.0, 42.0]}}

The geometry, the geographic component, is just iterables of lon, lat locations - you can represent points, lines, polygons or multis. The properties dictionary holds non-geographic information about the features, analagous to the "attribute table" in many GIS.

A quick note on the term "GeoJSON-like Feature mapping"... GeoJSON is a text serialization format. When we take GeoJSON and translate it into a python data structure it is no longer GeoJSON but a python dictionary (mapping) which follows the semantics of a GeoJSON Feature. From here on out, I'll just refer to this GeoJSON-like python data structure as a feature. If you're writing functions that work with vector data, they should accept and return features.

That's the convention, the golden rule of writing Python vector processing functions

Functions should take features as inputs and yield or return features.

In other words, features in, features out. That's it. It's really that simple, and the simplicity buys you a great deal of potential.

The IO Sandwich

Functions which fit this convention will not read or write to anything outside of locally-scoped variables. Does your function need to read from a file or write to the network in addition to processing features? Why should one function be responsible for doing multiple tasks? We're striving for functions that do one thing - process vector features.

All the data your function needs should be passed in as arguments. Note that this is very different than passing in a file path and doing the reading and writing of data within your function:

# BAD
process_features("/path/to/shapefile.shp", output="out.shp")

# GOOD
features = read_features("/path/to/shapefile.shp")
new_features = process_features(features)
write_features(new_features, output="out.shp")

You might be concerned about memory. But don't worry, well-behaved Python libraries can use generators to load the data as needed.

Another way to picture it is that your application should build an IO sandwich with all of the reading and writing happening outside of your processing function.

Read Shapefile into Features -->  process_features  --> Write Features to Shapefile

That way anyone can use the same function with different inputs and outputs

Read Web Service into Features -->  process_features  --> Write Features to PostGIS

Processing functions should not care where their input features come from or where the output features are going. As long as process_features takes and returns features, any number of combinations are possible.

This not only decouples IO but allows us to compose processes together

Read Features --> process1 --> process2 --> process3 --> Write Features

Other guidelines

When possible, you should strive for pure functions; avoid mutating data and return a clean copy.

Unless you have specific reason, leave the original feature intact except for the thing your function is expected to manipulate. For instance, if your function just alters the geometry, don't drop or change existing properties.

There are some cases where it makes sense to collect your features into a collection and return the entire thing at once. This will generally occur if the features are not independent. In many cases though, your features will largely be independent and can be processed one-by-one. For these situations, it makes sense to use a generator (i.e. yield feature instead of return features).

Finally, you should aim to make your features serializable. You should be able to json.dump() the output features. The properties member should not contain nested dicts which might confuse some GIS formats which require a flat structure. And if possible, avoid extending the json with extra elements outside of properties.

An Example

In this simple example, we'll write a single vector processing function that buffers a geometry by a specified distance. Taking an input of points, for example:

and buffering them by 10 units.

Here is the core processing function which follows the features in, features out convention

from shapely.geometry import shape

def buffer(features, buffer=1.0):
    """Buffer a feature by specified units
    """
    for feature in features:
        geom = shape(feature['geometry'])   # Convert to shapely geometry to operate on it
        geom = geom.buffer(buffer)          # Buffer
        new_feature = feature.copy()
        new_feature['geometry'] = geom.__geo_interface__
        yield new_feature

Then we could use it in our IO sandwich by reading features from a shapefile and outputing the Features to GeoJSON on stdout. Here's what our Python interface looks like

import fiona # for input
import json  # for output

from process import buffer

with fiona.open("data/points.shp") as src:
    for feature in buffer(src, 10.0):
        print(json.dumps(feature))

So the python interface is looking good. What if we wanted to use it in a command line interface? Well luckily click and cligj has got the input covered. The @cligj.features_in_arg reads in an iterable of features from a file, a FeatureCollection or stream of Features.

import click
import cligj
import json

from process import buffer

@click.command()
@click.argument("distance", type=float)
@cligj.features_in_arg
def buffer_cmd(features, distance):
    for feature in buffer(features, distance):
        click.echo(json.dumps(feature))

if __name__ == "__main__":
    buffer_cmd()

Which we can then use between fio cat and fio collect to process Features in a memory-efficient stream.

$ fio cat data/points.shp | python buffer_cmd.py 10 | fio collect > points_buffer.geojson

What about an HTTP interface? Flask provides us with a lightweight framework to turn our function into a web service:

import json
from flask import Flask, request, Response
from process import buffer

app = Flask(__name__)

@app.route('/buffer/<distance>', methods=['POST'])
def index(distance):
    collection = request.get_json(force=True)
    distance = float(distance)

    new_features = list(
        buffer(collection['features'], distance))

    collection['features'] = new_features

    return Response(
        response=json.dumps(collection),
        status=200, mimetype="application/json")

if __name__ == '__main__':
    app.run(debug=True)

Which gives us a buffer web service to which you can post GeoJSON FeatureCollections and get back a buffered collection:

$ fio dump data/points.shp | \
    curl -X POST -d @- http://localhost:5000/buffer/10.0 > points_buffered.geojson

Conclusion

Writing your vector processing code to follow these simple conventions enables great flexibility. You can use your code in a Python application, a command line interface, an HTTP web service - all based on the same core processing functions. Assuming you can write some glue code to express input and output as GeoJSON features, this will work with any vector data source and is not constrained to a single context. You can use this with any data, anywhere that supports Python. That's a pretty powerful concept, all made possible by the simple convention of features in, features out.