Sat 28 November 2020
In my last post I compared two approaches for calculating zonal statistics:
- A Python approach using the rasterstats library
- A SQL approach using PostGIS rasters.
I came away happy that I could express zonal stats in SQL, but wasn't happy with the performance; an 87x slowdown compared to the equivalent Python code. When in doubt though, it's user error! I received some good suggestions from readers of this blog (Thanks Stefan Jäger and Pierre Racine!) who suggested some performance enhancements from tiling and spatial indexes.
Additionally, I wasn't happy with the setup of the last experiment; while PostGIS and Rasterio both interact with the underlying GDAL C API,
in my experiment they were using GDAL libraries of different origins. And I'm skeptical that my synthetic vector data was representive of all workloads. A common case for zonal statistics is aggregating a raster by (non-overlapping) administrative boundaries. The nature of the datasets can have a significant impact; best to go with something more realistic.
Time for a reboot...
I used my
docker-postgres image to easily recreate an environment where everything is built from source against the same shared libraries.
To run a postgresql server from a docker container (no messy install required) with
local data volumes mounted in
git clone https://github.com/perrygeo/docker-postgres.git
This will download a pre-built image from Dockerhub so you can try it out without messing with your system. Then launches
the Postgresql server process, with your local
log directories mounted as container
In order to run Python code from the same container, we can
exec into it to get shell access:
docker exec -ti postgres-server /bin/bash
From here we can run our Python-based command line tools (Rasterio)
$ rio --version
$ rio --gdal-version
Connecting to the server with
psql, I can use the built-in version commands to show what we're working with
PostgreSQL 13.0 on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
GDAL="GDAL 3.2.0, released 2020/10/26"
Since the Rasterio library is running in the container, linked to exact same GDAL, GEOS and PROJ libraries as PostGIS, we can be assured of a more consistent environment.
For our raster dataset, we'll use the historic climate data provided by the WorldClim project. For our experiment we'll use the historic average monthly temperature rasters.
The result is a dozen monthly GeoTIFF files representing the historic average temperature for the month - we'll use
wc2.1_2.5m_tavg_07.tif, the average July temperature. Each raster is a 4320 x 8640 grid with global coverage in WGS84 coordinates.
And use Rasterio to inspect the shape of the raster grid
rio info wc2.1_2.5m_tavg_07.tif | jq -c .shape
which prints to stdout, confirming the raster grid shape:
For our vector dataset, we're using the Natural Earth Admin
dataset with 241 multipolygons, one for each nation.
Check the number of features using Fiona
$ fio info ne_50m_admin_0_countries.shp | jq .count
Overlaying the admin polygons on top of the stylized temperature raster and we get a good picture of the question we're trying to answer:
What is the historical average temperature of each country in the month of July?
Zonal Stats using
from rasterstats import zonal_stats
stats = zonal_stats(
stats=["sum", "mean", "count", "std", "min", "max"]
The time to complete this script was 6.67 seconds (fastest of 3 runs).
Zonal Stats using
To test the performance of the database, we need to get the data in:
Load the raster data
In part 1, I imported my raster data using a rather naive
raster2pgsql command. This
time, we add a few more options to tune performance.
raster2pgsql -Y -d -t 256x256 -N '-3.4e+38' -I -C -M -n "path" \
wc2.1_2.5m_tavg_07.tif tavg_07 | psql
-t 256x256 is a key parameter. By cutting the raster into 256-pixel square tiles,
the resulting raster table contains multiple rows, one per tile. A spatial index on the tiles, combined with rewriting the SQL to take advantage of the index and to aggregate across tiles, zonal stats can be made much more efficient inside PostgreSQL.
-I indicates that a spatial index of the raster tiles should be built after import. The spatial index, along with a spatial query that can take advantage of it, can quickly select the subset of tiles that overlap your features of interest.
The other parameters to note:
-Y uses COPY for more efficient transfer.
-d deletes the table if it already exists (useful for testing but careful in production).
-N defines a nodata value directly at the CLI.
-n create a
path column to store the filename.
-C applies constraints to ensure valid raster alignment, etc.
VACUUM ANALYZE on the table as a final step.
Load the vector data
Using a standard
shp2pgsql with a
-I to build and index.
shp2pgsql -g geometry -I -s 4326 ne_50m_admin_0_countries.shp countries | psql
Run the query
Now we have two tables loaded,
tavg_07, and can ask our question in SQL:
(ST_SummaryStatsAgg(ST_Clip(raster.rast, countries.geometry, true), 1, true)).*,
countries.name AS name,
countries.geometry AS geometry,
count(1) as n_tiles
tavg_07 as raster
INNER join countries on
I added the
GROUP BY to aggregate across tiles; otherwise we'd get multiple rows per country. And on the SELECT side, PostGIS provides a
ST_SummaryStatsAgg function (the aggregate variant of the
ST_SummaryStats) to sum across tiles.
Here's the resulting map data rendered via DBeaver. The
count is the number of raster pixels intersecting the feature, while the
n_tiles is the number of raster tiles. The
mean is probably what we're interested in; the avergage temperature.
Here's the bottom line on performance: PostGIS can perform this query in 6.1s. Marginally faster than the Python rasterstats version even. It could be that the latest improvements in the geospatial stack account for some of this effect but tiling clearly matters to performance.
Effect of tile size
The chosen value of
-t determines how much data fits into each tile. There's an unavoidable inverse relationship between the size of a row and the number of rows/tiles. Not surprisingly we find a tradeoff between those two constraints.
Smaller tiles with a spatial index means more efficient queries, at the expernse of pre-chopping the raster into many tiles. Depending on the nature of your analysis, you'll want to adjust accordingly. The optimal tilesize is likely to depend on hardware, the tiling patterns of the orignal data and and the usage patterns you expect.
For this dataset, somewhere around 256x256 appears to be an optimal size. It would make a good default providing the benefits of tiling without as much import overhead as smaller tiles.
Surprisingly, the untiled version still performs ok relative to the python code. The query on an untiled raster is "only" 7.5x slower than the python code, not as bad as the 80x performance hit I found in part 1. While this factor seems highly dependent on the data at hand, the conclusion doesn't change - tiling maters.
raster2pgsql -t 256x256 -I to tile your PostGIS rasters. Combined with aggregate functions and spatial indexes, you get similar zonal stats query functionality and performance from PostGIS as you would with equivalent single-threaded Python/GDAL approaches.
There's still much to be explored regarding optimal tiling, parallel aggregates, out-of-band rasters, and the impact of source raster data file layout on performance. More to come in part 3...