# Zonal statistics: histograms as user-defined aggregate functions

## Introduction

Zonal statistics allow you to summarize raster datasets based on vector geometries
by aggregating all pixels associated with each vector feature, typically to a single scalar value. For example, you might want the *mean* elevation of each country against an SRTM Digital Elevation
Model (DEM). This is easily accomplished in python using `rasterstats`

:

```
from rasterstats import zonal_stats
stats = zonal_stats('countries.shp', 'elevation.tif', stats="mean", copy_properties=True)
for s in stats:
print s['name'], s['mean']
```

Which would give us output similar to below, with the mean elevation (meters) for each country:

```
Afghanistan 1826.38
Netherlands 8.78
Nepal 2142.28
Zimbabwe 980.85
```

## Zonal Histograms

Using the built-in aggregate functions in `rasterstats`

can reveal a lot about
about the underlying raster dataset (see statistics for full list). Most of the time the standard descriptive statistics
like min, max, mean, median, etc. can tell us everything we need to know.

But what if we want to retain more information about the underlying distribution of values? Instead of simply stating

Afghanistan is, on average, 1826.38 meters above sea level

supposed we wanted to see how much of the country is in high vs low elevation areas. We could bin the elevations into meaningful ranges (say 0-200 meters, 200 to 400 meters, etc) and create a histogram of pixel counts to show the shape of the underlying distribution. In this case, the aggregate function does not return a scalar value but a dictionary with each bin as a key.

```
>> stats['elevation_histogram']
{'0 to 500m': ...,
'500 to 1000m': ...,
'1000 to 3000m':...,
'3000 to 5000m':...,
'5000m+'
}
```

That's the goal, now how do we accomplish this?

## User-defined aggregate functions

Because a histogram might need to specify a number of arguments to customize the results, it's not
feasible for `rasterstats`

to define a generic histogram function. However, as of version 0.6, we
have the ability to create custom, user-defined aggregate functions such as the zonal histogram
idea described above.

First, we have to write our function. The first and only argument is a masked numpy array
and will typically be handled by `numpy`

functions. The function's return value will be added to the stats output for each feature. The returned
value does *not* need to be a scalar, it can be any valid python value (though it's probably
best to stick with dicts, lists and other simple data structures that are easily
serializable).

```
import numpy as np
import itertools
def elevation_histogram(x):
bin_edges = [0, 400, 1000, 3000, 5000, 10000]
hist, _ = np.histogram(x, bins=bin_edges)
data = {}
for upper, lower, value in itertools.izip(bin_edges, bin_edges[1:], hist):
key = "{} to {}m".format(upper, lower)
data[key] = value
return data
```

And then add our custom `elevation_histogram`

function to our `zonal_stats`

call
using the `add_stats`

keyword argument:

```
stats = zonal_stats('countries.shp', 'elevation.tif', copy_properties=True,
add_stats={'elevation_histogram': elevation_histogram})
for s in stats:
print s['name'], s['mean'], s['elevation_histogram']
```

which gives us output similar to the following which gives you raw pixel counts for each of the elevation bins (formatted for readability)

```
Afghanistan 1826.38 {
'3000 to 5000m': 1099730,
'0 to 400m': 1754317,
'1000 to 3000m': 2884917,
'5000 to 10000m': 83158,
'400 to 1000m': 1907790}
```

The only caveat with using this technique is that nested dictionaries and other non-scalar values might cause difficulty when trying to serialize this data structure to other formats. For example, most GIS formats don't support hierarchical properties (nested dictionaries) so you might have to flatten the data before writing to e.g. PostGIS or an ESRI shapefile.

With the ability to write user-defined aggregate functions, I can keep the core
of `rasterstats`

light while allowing for the possibility of complex aggregate analysis
that might be needed in the future. Good stuff.