perrygeo.com

Getting started with application configuration in Rust

2022-12-03T00:00:00-07:00

Sooner or later, that application you're writing will need to be configured. At the very least, you'll need a way to adjust inputs without editing source code. Wouldn't it be nice to have a reasonable configuration system from the start?

The best way to configure your app will depend on the environment in which you're using the software, and the requirements of the project, all of which will change over time. Ideally, we'd start out with a system that had the flexibility to pull our configuration from a number of input sources:

Command Line Interface for interactive development with standard flags, clear usage and error handling
.env files for declarative configuration, either development or production
Environment variables for containers and many production settings
Reasonable defaults if nothing is provided by the user. And if there is no obvious default, mark it clearly as a mandatory argument.

For a language that is often refered to as a low-level "systems" language, Rust allows for some very ergonomic abstractions. We can implement a type-safe configuration system with a minimal amount of imperative code, letting the third-party crates handle the mechanical details. Let's walk through a new project...

Project setup

In this example, we'll create a Rust project using the clap and dotenv crates.

cargo new myapp
cd myapp

cargo add clap --features derive,env
cargo add dotenv

Your Cargo.toml file should look something like

[dependencies]
clap = { version = "4.0.29", features = ["derive", "env"] }
dotenv = "0.15.0"

Creating the configuration struct

Let's build it up from scratch, starting with a plain struct defining all values we need to configure the app.

In our src/main.rs

pub struct Config {
    pub ipaddr: String,
    pub port: i32,
    pub database_url: String,
}

Let's pause for a second to consider types. In Rust, types can help us out by providing powerful correctness guarantees.

Is ipaddr really a String? The type system should enforce a valid IPv4 address instead of a free-form string. Likewise, let's make sure the port is a unsigned 16 bit integer to stay within the range of viable port numbers.

use std::net::Ipv4Addr;

pub struct Config {
    pub ipaddr: Ipv4Addr,
    pub port: u16,
    pub database_url: String,
}

Clap annotations

Next, we use the clap crate and add annotations to our struct.

This turns our declarative struct into a powerful command line interface, with error handling, default values and type conversion.

use clap::Parser;
use std::net::Ipv4Addr;

#[derive(Parser, Debug)]
#[command(author, version, about)]
pub struct Config {
    #[arg(short, long, default_value = "0.0.0.0")]
    pub ipaddr: Ipv4Addr,

    #[arg(short, long, default_value_t = 3000)]
    pub port: u16,

    #[arg(short, long)]
    pub database_url: String,
}

The author, version and about text are derived from the contents of our Cargo.toml file.

Note that the database_url does not use a default value.

Self documentation

We can add docstrings (///) to the struct and to its members. This serves the purpose of both documenting the code and exposing friendly command line usage and error messages.

use clap::Parser;
use std::net::Ipv4Addr;

/// My Awesome Application
#[derive(Parser, Debug)]
#[command(author, version, about)]
pub struct Config {
    /// IPv4 address
    #[arg(short, long, default_value = "0.0.0.0")]
    pub ipaddr: Ipv4Addr,

    /// Port number
    #[arg(short, long, default_value_t = 3000)]
    pub port: u16,

    /// Database connection string
    #[arg(short, long)]
    pub database_url: String,
}

Environment handling

Clap can handle env vars explicitly by add the env(...) annotation to each configuration item. Here, we explictly define each variable name using the APP_* prefix, all upper case, as a convention:

use clap::Parser;
use dotenv::dotenv;
use std::net::Ipv4Addr;

/// My Awesome Application
#[derive(Parser, Debug)]
#[command(author, version, about)]
pub struct Config {
    /// IPv4 address
    #[arg(short, long, env("APP_IPADDR"), default_value = "0.0.0.0")]
    pub ipaddr: Ipv4Addr,

    /// Port number
    #[arg(short, long, env("APP_PORT"), default_value_t = 3000)]
    pub port: u16,

    /// Database connection string
    #[arg(short, long, env("APP_DATABASE_URL"))]
    pub database_url: String,
}

Constructor

Since we want to (optionally) populate our environment using a .env file, we have to set up the environment before invoking the Clap parser. To do this, We'll implement a from_env_and_args constructor method for our Config struct.

impl Config {
    pub fn from_env_and_args() -> Self {
        dotenv().ok();
        Self::parse()
    }
}

With four potential inputs, how do we reason about which takes precendence? To determine the config value, the effective order is as follows, first one wins:

Command line interface argument
File (.env)
Environment variable
Default value

Main

Finally, we write our main function to create and construct the Config at runtime.

fn main() {
    let cfg = Config::from_env_and_args();
    println!("Starting HTTP server on {}:{}", cfg.ipaddr, cfg.port);
    println!("Connecting to {}", cfg.database_url);
}

Presumably, your application will do something more interesting here!

Result

$ cargo build
...
$ ./target/debug/myapp --help
My Awesome Application

Usage: myapp [OPTIONS] --database-url <DATABASE_URL>

Options:
  -i, --ipaddr <IPADDR>              IPv4 address [env: APP_IPADDR=] [default: 0.0.0.0]
  -p, --port <PORT>                  Port number [env: APP_PORT=] [default: 3000]
  -d, --database-url <DATABASE_URL>  Database connection string [env: APP_DATABASE_URL=]
  -h, --help                         Print help information
  -V, --version                      Print version information

In this case, we see that the database_url is undefined in the environment, has no default, but is required by the application. If we try to run it now, the app exits with status code of 2 and we get a human-readable message that we are missing the database URL:

$ ./target/debug/myapp

error: The following required arguments were not provided:
  --database-url <DATABASE_URL>

Usage: myapp --database-url <DATABASE_URL>

For more information try '--help'

To provide it we have three options, depending on your operational needs.

First, we can use the command line for interactive testing:

./target/debug/myapp --database-url postgres://postgres@localhost:5432/postgres

Or, an environment variable for production settings:

export APP_DATABASE_URL="postgres://postgres@localhost:5432/postgres"
./target/debug/myapp

Or finally, using a .env file for declarative environment setup (in prod or dev).

echo 'APP_DATABASE_URL=postgres://postgres@localhost:5432/postgres' >> .env
./target/debug/myapp

Whichever way we configure the required DATABASE_URL, we get the same result.

$ ./target/debug/myapp
Starting HTTP server on 0.0.0.0:3000
Connecting to postgres://postgres@localhost:5432/postgres

Error handling is intuitive from the command line. Let's see what happens when we provide an invalid IP adress and port number.

$ ./target/debug/myapp --ipaddr 255.255.255.999
error: Invalid value '255.255.255.999' for '--ipaddr <IPADDR>': invalid IPv4 address syntax

For more information try '--help'

$ ./target/debug/myapp --port 999999
error: Invalid value '999999' for '--port <PORT>': 999999 is not in 0..=65535

For more information try '--help'

Viola. A simple, declarative, type-safe abstraction with minimal code. We get operational flexibility and confidence in the validity of the inputs without writing imperative code to handle the details of each scenario.

This can serve as a starter template suitable for most backend server or command line applications. Here it is, all 26 lines of code in one place:

use clap::Parser;
use dotenv::dotenv;
use std::net::Ipv4Addr;

/// My Awesome Application
#[derive(Parser, Debug)]
#[command(author, version, about)]
pub struct Config {
    /// IPv4 address
    #[arg(short, long, env("APP_IPADDR"), default_value = "0.0.0.0")]
    pub ipaddr: Ipv4Addr,

    /// Port number
    #[arg(short, long, env("APP_PORT"), default_value_t = 3000)]
    pub port: u16,

    /// Database connection string
    #[arg(short, long, env("APP_DATABASE_URL"))]
    pub database_url: String,
}

impl Config {
    pub fn from_env_and_args() -> Self {
        dotenv().ok();
        Self::parse()
    }
}

fn main() {
    let cfg = Config::from_env_and_args();
    println!("Starting HTTP server on {}:{}", cfg.ipaddr, cfg.port);
    println!("Connecting to {}", cfg.database_url);
}

Check out the clap docs for more examples of how you can extend this approach.

I think this interface shows that we don't need to compromise between ergonomics and type-safety, speed and correctness. It's a great example of Rust's potential as a higher level application language.

Don't install PostgreSQL - Using containers for local development.

2022-02-11T00:00:00-07:00

So you need a database for an application you're developing. You've looked around and decided that PostgreSQL fits the bill. Excellent choice! Now it's time to start coding. How do you get postgres running locally to devlop and test against it?

The typical suggestion for many web application frameworks is to install PostgreSQL to your system using your chosen dependency management tool - brew install postgresql or apt install postgresql - then configure it to work for your application (maybe tweaking some settings in /etc/postgresql/ as the root user), starting a background process with your system supervisor of choice (sudo systemctl start postgresql), hooking it up to your app, and you're off to the races.

But what happens when you're working on project that needs a different major version of postgresql, with different extensions or entirely different settings? I often found myself in a scenario where my system was full of cruft, having been reworked many times over to swap out different postgresql instances. Additionally there is only a single data directory (/etc/postgresql/<version>/main) so if you need the data to persist for more than a single project, you have to manage backup and restore each time you switch contexts.

A traditional system install just doesn't cut it. We need a way to run many different postgres instances, independent of each other with isolated data, settings and software versions. We can use Docker containers to run postgresql in a more flexible way that allows for greater experimentation, data stability, and greatly improved ease of use.

Running postgres in Docker, the naive approach

There's no real secret to running Docker containers. We know that postgresql docker images exist and we should be able to run them like any other.

$ docker run postgres:14.1
Unable to find image 'postgres:14.1' locally
14.1: Pulling from library/postgres
...
Status: Downloaded newer image for postgres:14.1
Error: Database is uninitialized and superuser password is not specified.
       You must specify POSTGRES_PASSWORD to a non-empty value for the
       superuser. For example, "-e POSTGRES_PASSWORD=password" on "docker run".

       You may also use "POSTGRES_HOST_AUTH_METHOD=trust" to allow all
       connections without a password. This is *not* recommended.

       See PostgreSQL documentation about "trust":
       https://www.postgresql.org/docs/current/auth-trust.html

Ah, clearly there are a few tricks specific to running postgres in a container. If we set a postgres password, we can get a running postgres instance.

$ docker run -e POSTGRES_PASSWORD=password postgres:14.1
...
2022-02-03 18:23:38.823 UTC [1] LOG:  database system is ready to accept connections

The container startup script will initialize your database, create users and start the process, listening for connections. But where is it listening? We can't yet connect to it. And where is the data? We can't see any data anywhere on our host system. Everything is, well, contained within the running Docker container.

To make this workflow viable for local development, we'd like

An open TCP port on the host system so we can connect to it.
The data to live on the host system, not in the container's overlay filesystem.
To give postgres access to files from the host system so that we can import datasets.
Settings to live on the host system so that we can adjust them and optionally check them into source control.

Of course the offical PostgreSQL Docker documentation covers these exact scenarios, showing us how we can use port forwarding and volume mounts.

An alternative to system-wide PostgreSQL installs

Here is my opinionated take on how to set up an ergonomic postgres environment for local development.

First, create a database directory in your project to hold all things postgres

Then create database/postgresql.conf to specify the postgres settings. The example below is a subset of the full postgres config, the settings that I typically need to adjust when doing any serious performance-sensistive development

# PostgreSQL configuration file
# See https://github.com/postgres/postgres/blob/master/src/backend/utils/misc/postgresql.conf.sample

#------------------------------------------------------------------------------
# CONNECTIONS AND AUTHENTICATION
#------------------------------------------------------------------------------
listen_addresses = '*'
port = 5432             # (change requires restart)
max_connections = 100           # (change requires restart)

#------------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#------------------------------------------------------------------------------
shared_buffers = 2048MB         # min 128kB
work_mem = 40MB             # min 64kB
maintenance_work_mem = 640MB        # min 1MB
dynamic_shared_memory_type = posix  # the default is the first option
max_parallel_workers_per_gather = 6 # taken from max_parallel_workers
max_parallel_workers = 12       # maximum number of max_worker_processes that

#------------------------------------------------------------------------------
# WRITE-AHEAD LOG
#------------------------------------------------------------------------------
checkpoint_timeout = 40min      # range 30s-1d
max_wal_size = 1GB
min_wal_size = 80MB
checkpoint_completion_target = 0.75 # checkpoint target duration, 0.0 - 1.0

#------------------------------------------------------------------------------
# REPORTING AND LOGGING
#------------------------------------------------------------------------------
logging_collector = off
log_autovacuum_min_duration = 0
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = default
log_min_duration_statement = 20ms
log_lock_waits = on
log_temp_files = 0
log_timezone = 'UTC'

#------------------------------------------------------------------------------
# AUTOVACUUM
#------------------------------------------------------------------------------
autovacuum_vacuum_scale_factor = 0.02   # fraction of table size before vacuum
autovacuum_analyze_scale_factor = 0.01  # fraction of table size before analyze

#------------------------------------------------------------------------------
# CLIENT CONNECTION DEFAULTS
#------------------------------------------------------------------------------
datestyle = 'iso, mdy'
timezone = 'UTC'
lc_messages = 'C.UTF-8'
lc_monetary = 'C.UTF-8'
lc_numeric = 'C.UTF-8'
lc_time = 'C.UTF-8'
default_text_search_config = 'pg_catalog.english'
shared_preload_libraries = 'pg_stat_statements'

Create a database/pg_hba.conf to control access to the database. You might need to adjust this to experiment with different networking setups, different users, etc. Usually the defaults here are fine.

# PostgreSQL Client Authentication Configuration File
# ===================================================
# TYPE  DATABASE    USER        CIDR-ADDRESS          METHOD

# Database administrative login by UNIX sockets
# "local" is for Unix domain socket connections only
local   all         postgres                          ident
local   all         all                               ident

# IPv4 local connections:
host    all         all         172.17.0.0/16         md5

# IPv6 local connections:
host    all         all         ::1/128               md5

Make two subdirectories to hold the data: database/mnt_data to hold data you intend to import/export and database/pgdata to hold the actual database.

$ mkdir mnt_data
$ mkdir pgdata

You probably don't want to check your datasets or database into source control. Create a database/.gitignore to ignore them

# .gitignore
pgdata
mnt_data

Finally, create a run-postgres.sh script to launch the docker container with everything hooked up.

# run-postgres.sh
set -e
HOST_PORT=5432
NAME=postgres-dev
DOCKER_REPO=postgres
TAG=14.1

docker run --rm --name $NAME \
  --volume `pwd`/pgdata:/var/lib/pgsql/data \
  --volume `pwd`/mnt_data:/mnt/data \
  --volume `pwd`/pg_hba.conf:/etc/postgresql/pg_hba.conf \
  --volume `pwd`/postgresql.conf:/etc/postgresql/postgresql.conf \
  -e POSTGRES_PASSWORD=password \
  -e POSTGRES_USER=postgres \
  -e PGDATA=/var/lib/pgsql/data/pgdata14 \
  -e POSTGRES_INITDB_ARGS="--data-checksums --encoding=UTF8" \
  -e POSTGRES_DB=db \
  -p ${HOST_PORT}:5432 \
  ${DOCKER_REPO}:${TAG} \
  postgres \
    -c 'config_file=/etc/postgresql/postgresql.conf' \
    -c 'hba_file=/etc/postgresql/pg_hba.conf'

Note the HOST_PORT variable. If you've already got another database running on 5432, this won't work. This is where you need to get a bit creative and tune the process to your needs. What I typically do is use port 6432 and increment by one for every project so they don't conflict. This allows to run all of your databases at the same time on one machine. The only downside is you need to remember which port maps to which database!

Running it

$ ./run-postgres.sh
...
2022-02-03 19:13:09.673 UTC [1] LOG:  starting PostgreSQL 14.1 (Debian 14.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-02-03 19:13:09.673 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-02-03 19:13:09.673 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2022-02-03 19:13:09.677 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-03 19:13:09.685 UTC [26] LOG:  database system was shut down at 2021-11-13 21:34:06 UTC
2022-02-03 19:13:09.700 UTC [1] LOG:  database system is ready to accept connections

Using this setup, the logs are sent directly to stdout so you'll see everything in the terminal. The ports and paths in the logs are inside the container, so don't get fooled trying to find them on your host system.

To connect, we use the defined host port

$ psql postgres://postgres:password@localhost:6432/postgres

You can put data in mnt_data from the host system, which will be exposed to postgresql as the /mnt/data directory inside the container. For example, load it with psql using COPY data FROM '/mnt/data/my.csv' WITH CSV HEADER;. Likewise, any data dumps or exports from postgres can be output to this directory, immediately accessible to the host system.

To stop the server, use Ctrl-C. The data will persist to your pgdata directory. Resist the temptation to touch any files therein as they are managemed internally to postgres. But you can move the directory as a whole around the filesystem or to another machine. It's not quite as convenient as a process-less, single file SQLite database but it's close.

Because the pgdata directory is created by postgres which provides strong gaurantees that the on-disk data format will be consistent within a major version, we can even use a different image altogether to access the same underlying dataset. This can be very handy for e.g. switching between vanilla postgres and postgis, or for testing different versions of extensions, etc. As long as the image follows the basic rules of the postgres container behavior and uses the same major version, it should just work.

What about in production?

Installing postgresql on a VM or bare-metal server is still viable, especially if automated with configuration tools like Ansible or Chef. But there are other options.

If your project is all-in on containers in production, consider checking out some of the Kubernetes operators for postgres. You can use the exact same container image in production that you test on locally, albeit with some additional operational concerns around availability and stateful data. Operator software like Crunchy PostgreSQL for Kubernetes and Kubegres can be configured for load balancing, high-availability, backups, monitoring, etc. which can ease the operational burden should your database require such things.

Of course, there is always the cloud hosted option. I've used postgresql on both GCP Cloud SQL and AWS RDS and, while you give up some control of the environment and are no longer able to run the exact same database locally as you do in prod, the easy of adminstering these hosted databases might be worth it.

Conclusion

Docker containers provide a robust way to run postgres in local development, with very few compromises. A container-based workflow makes it easier to maintain multiple parallel database, and to move data freely between systems. For my money, there's no need to apt install postgres again.

Zonal Stats with PostGIS Rasters, part 2

2020-11-28T00:00:00-07:00

In my last post I compared two approaches for calculating zonal statistics:

A Python approach using the rasterstats library
A SQL approach using PostGIS rasters.

I came away happy that I could express zonal stats in SQL, but wasn't happy with the performance; an 87x slowdown compared to the equivalent Python code. When in doubt though, it's user error! I received some good suggestions from readers of this blog (Thanks Stefan Jäger and Pierre Racine!) who suggested some performance enhancements from tiling and spatial indexes.

Additionally, I wasn't happy with the setup of the last experiment; while PostGIS and Rasterio both interact with the underlying GDAL C API, in my experiment they were using GDAL libraries of different origins. And I'm skeptical that my synthetic vector data was representive of all workloads. A common case for zonal statistics is aggregating a raster by (non-overlapping) administrative boundaries. The nature of the datasets can have a significant impact; best to go with something more realistic.

Time for a reboot...

Reproducible containers

I used my docker-postgres image to easily recreate an environment where everything is built from source against the same shared libraries.

To run a postgresql server from a docker container (no messy install required) with local data volumes mounted in ./pgdata.

git clone https://github.com/perrygeo/docker-postgres.git
cd docker-postgres
./run-postgres.sh

This will download a pre-built image from Dockerhub so you can try it out without messing with your system. Then launches the Postgresql server process, with your local pgdata, mnt_data and log directories mounted as container volumes.

In order to run Python code from the same container, we can exec into it to get shell access:

docker exec -ti postgres-server /bin/bash

From here we can run our Python-based command line tools (Rasterio)

$ rio --version
1.1.8
$ rio --gdal-version
3.2.0

Connecting to the server with psql, I can use the built-in version commands to show what we're working with

SELECT version();

PostgreSQL 13.0 on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

SELECT postgis_full_version();

POSTGIS="3.1.0alpha3 b2221ee"
PGSQL="130"
GEOS="3.9.0dev-CAPI-1.14.0"
PROJ="7.2.0"
GDAL="GDAL 3.2.0, released 2020/10/26"
LIBXML="2.9.4"
LIBJSON="0.12.1"
LIBPROTOBUF="1.3.3"
WAGYU="0.5.0 (Internal)"
RASTER

Since the Rasterio library is running in the container, linked to exact same GDAL, GEOS and PROJ libraries as PostGIS, we can be assured of a more consistent environment.

Raster dataset

For our raster dataset, we'll use the historic climate data provided by the WorldClim project. For our experiment we'll use the historic average monthly temperature rasters.

wget http://biogeo.ucdavis.edu/data/worldclim/v2.1/base/wc2.1_2.5m_tavg.zip
unzip wc2.1_2.5m_tavg.zip

The result is a dozen monthly GeoTIFF files representing the historic average temperature for the month - we'll use wc2.1_2.5m_tavg_07.tif, the average July temperature. Each raster is a 4320 x 8640 grid with global coverage in WGS84 coordinates.

And use Rasterio to inspect the shape of the raster grid

rio info wc2.1_2.5m_tavg_07.tif | jq -c .shape

which prints to stdout, confirming the raster grid shape:

[4320,8640]

Vector data

For our vector dataset, we're using the Natural Earth Admin dataset with 241 multipolygons, one for each nation.

wget https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/50m/cultural/ne_50m_admin_0_countries.zip
unzip ne_50m_admin_0_countries.zip

Check the number of features using Fiona

$ fio info  ne_50m_admin_0_countries.shp | jq .count
241

Overlaying the admin polygons on top of the stylized temperature raster and we get a good picture of the question we're trying to answer:

What is the historical average temperature of each country in the month of July?

Zonal Stats using `python-rasterstats`

from rasterstats import zonal_stats

stats = zonal_stats(
    vector="ne_50m_admin_0_countries.shp",
    raster="wc2.1_2.5m_tavg_07.tif",
    stats=["sum", "mean", "count", "std", "min", "max"]
)

The time to complete this script was 6.67 seconds (fastest of 3 runs).

Zonal Stats using `postgis_raster`

To test the performance of the database, we need to get the data in:

Load the raster data

In part 1, I imported my raster data using a rather naive raster2pgsql command. This time, we add a few more options to tune performance.

raster2pgsql -Y -d -t 256x256 -N '-3.4e+38' -I -C -M -n "path" \
    wc2.1_2.5m_tavg_07.tif tavg_07 | psql

The -t 256x256 is a key parameter. By cutting the raster into 256-pixel square tiles, the resulting raster table contains multiple rows, one per tile. A spatial index on the tiles, combined with rewriting the SQL to take advantage of the index and to aggregate across tiles, zonal stats can be made much more efficient inside PostgreSQL.

The -I indicates that a spatial index of the raster tiles should be built after import. The spatial index, along with a spatial query that can take advantage of it, can quickly select the subset of tiles that overlap your features of interest.

The other parameters to note:

-Y uses COPY for more efficient transfer.
-d deletes the table if it already exists (useful for testing but careful in production).
-N defines a nodata value directly at the CLI.
-n create a path column to store the filename.
-C applies constraints to ensure valid raster alignment, etc.
-M runs VACUUM ANALYZE on the table as a final step.

Load the vector data

Using a standard shp2pgsql with a -I to build and index.

shp2pgsql -g geometry -I -s 4326 ne_50m_admin_0_countries.shp countries | psql

Run the query

Now we have two tables loaded, countries and tavg_07, and can ask our question in SQL:

SELECT
    (ST_SummaryStatsAgg(ST_Clip(raster.rast, countries.geometry, true), 1, true)).*,
    countries.name AS name,
    countries.geometry AS geometry,
    count(1) as n_tiles
FROM
    tavg_07 as raster
INNER join countries on
    ST_INTERSECTS(countries.geometry, raster.rast)
GROUP BY
    name, geometry;

I added the GROUP BY to aggregate across tiles; otherwise we'd get multiple rows per country. And on the SELECT side, PostGIS provides a ST_SummaryStatsAgg function (the aggregate variant of the ST_SummaryStats) to sum across tiles.

Here's the resulting map data rendered via DBeaver. The count is the number of raster pixels intersecting the feature, while the n_tiles is the number of raster tiles. The mean is probably what we're interested in; the avergage temperature.

Here's the bottom line on performance: PostGIS can perform this query in 6.1s. Marginally faster than the Python rasterstats version even. It could be that the latest improvements in the geospatial stack account for some of this effect but tiling clearly matters to performance.

Effect of tile size

The chosen value of -t determines how much data fits into each tile. There's an unavoidable inverse relationship between the size of a row and the number of rows/tiles. Not surprisingly we find a tradeoff between those two constraints.

tilesize	query (s)	raster2pgsql import (s)
64x64	5.9	58.7
256x256	6.6	15.8
1024x1024	8.5	7.3
untiled	49.2	5.2

Smaller tiles with a spatial index means more efficient queries, at the expernse of pre-chopping the raster into many tiles. Depending on the nature of your analysis, you'll want to adjust accordingly. The optimal tilesize is likely to depend on hardware, the tiling patterns of the orignal data and and the usage patterns you expect.

For this dataset, somewhere around 256x256 appears to be an optimal size. It would make a good default providing the benefits of tiling without as much import overhead as smaller tiles.

Surprisingly, the untiled version still performs ok relative to the python code. The query on an untiled raster is "only" 7.5x slower than the python code, not as bad as the 80x performance hit I found in part 1. While this factor seems highly dependent on the data at hand, the conclusion doesn't change - tiling maters.

Conclusion

Use raster2pgsql -t 256x256 -I to tile your PostGIS rasters. Combined with aggregate functions and spatial indexes, you get similar zonal stats query functionality and performance from PostGIS as you would with equivalent single-threaded Python/GDAL approaches.

There's still much to be explored regarding optimal tiling, parallel aggregates, out-of-band rasters, and the impact of source raster data file layout on performance. More to come in part 3...

Zonal Stats with PostGIS Rasters

2018-12-31T00:00:00-07:00

Zonal statistics is a technique to summarize the values of a raster dataset overlapped by a set of vector geometries. The analysis can answer queries such as "Average elevation of each nation park" or "Maximum temperature by state".

My goal in this article is to demonstrate a PostGIS implementation of zonal stats and compare the results and runtime performance to a reference Python implementation.

Python with the rasterstats library using GeoTIFF and GeoJSON files.
SQL queries using PostGIS raster and vector tables.

The Dataset

For the raster data, let's use the ALOS Global Digital Surface Model (from the Japan Aerospace Exploration Agency ©JAXA). I picked a 1°x1° tile with 1 arcsecond resolution (roughly 30 meters) in GeoTIFF format.

Next, generate 100 random circular polygon features covering the extent of the raster. The following Python script shows how to do so with the Rasterio and Shapely libs.

#!/usr/bin/env python
import json
import random
import sys

import rasterio
from shapely.geometry import Point


def random_features_for_raster(path, steps=100):
    with rasterio.open(path) as src:
        x1, y1, x2, y2 = src.bounds

    xs = [random.uniform(x1, x2) for _ in range(steps)]
    ys = [random.uniform(y1, y2) for _ in range(steps)]
    for i, (x, y) in enumerate(zip(xs, ys)):
        buffdist = random.uniform(0.002, 0.04)
        shape = Point(x, y).buffer(buffdist)
        yield {
            "type": "Feature",
            "properties": {"name": str(i)},
            "geometry": shape.__geo_interface__,
        }


if __name__ == "__main__":
    for feat in random_features_for_raster(sys.argv[1]):
        print(json.dumps(feat))

Piping the features through fio collect gives us a valid GeoJSON collection with 100 polygon features.

python make-random-features.py N035W106_AVE_DSM.tif | fio collect > regions.geojson

Visualizing the data in QGIS shows what we're working with. The goal is to find basic summary statistics for elevation in each of the regions:

Python with `rasterstats`

Using zonal_stats Python function allows you to express the processing at a high-level.

#!/usr/bin/env
import json
from rasterstats import zonal_stats

features = zonal_stats(
    "regions.geojson"
    "N035W106_AVE_DSM.tif"
    stats=["count", "sum", "mean", "std", "min", "max"],
    prefix="dem_",
    geojson_out=True)

with open("regions_with_elevation.geojson", "w") as dst:
    collection = {
        "type": "FeatureCollection",
        "features": list(features)}
    dst.write(json.dumps(collection))

Running this script takes about 2.4 seconds and creates a new GeoJSON file regions_with_elevation.geojson with the following attributes, as viewed in QGIS

And the resulting features can be mapped, in this case using the dem_mean field to show the average elevation of each region:

Postgis

Instead of working with GeoTiff rasters and GeoJSON files, we can perform the same thing in PostGIS tables using SQL.

Loading the data

To create a raster table name dem from the GeoTIFF.

raster2pgsql N035W106_AVE_DSM.tif dem | psql <connection info>

For some rasters, it might be necessary to explictly set the nodata value.

UPDATE dem SET rast = ST_SetBandNoDataValue(rast, -32768);

To create a vector table named regions from the GeoJSON file. (See ogr2ogr docs for details on the connection info)

ogr2ogr -f PostgreSQL PG:"<connection info>" regions.geojson

Zonal Statistics in SQL

Now we can express our zonal stats analysis as a SQL statement.

SELECT
    -- provides: count | sum | mean | stddev | min | max
    (ST_SummaryStats(ST_Clip(dem.rast, regions.wkb_geometry, TRUE))).*,
    regions.name AS name,
    regions.wkb_geometry AS geometry
INTO
    regions_with_elevation
FROM
    dem, regions;

Let's break that down a bit

FROM dem, regions does a full product of the 100 regions X 1 raster row.
ST_Clip function clips each raster to the precise geometry of each feature.
ST_SummaryStats function summarizes each clipped raster and produces a count, sum, mean, standard deviation, min and max column.
INTO regions_with_elevation creates a new table with the results.

Conceptually, this approach is similar to the internal process used by rasterstats.

database=# SELECT name, min, max, mean, count from regions_with_elevation;

name | min  | max  |       mean       | count
-----+------+------+------------------+-------
...
32   | 2104 | 2196 | 2141.13257847212 |  6977
33   | 2296 | 2667 | 2429.01510429154 |  4171
34   | 1784 | 1917 | 1852.97140948564 |  7485
35   | 2033 | 2144 | 2083.38765260393 | 51768
36   | 1796 | 1843 | 1828.69792802617 |   917
37   | 2072 | 2206 |  2122.1204719764 |  8475
38   | 2117 | 2214 | 2152.05270513076 |  5009
39   | 1915 | 2071 | 2040.61622890496 | 15762
...

Compared to attribute table screenshot above, the results are identical for all columns. That isn't too surprising given that both approachs use GDAL's rasterization API under the hood.

Performance is a different story. The zonal stats query took 81.90 seconds, roughly 34x slower than the Python code for the equivalent result.

Thoughts

In terms of the expressiveness of the two approaches, I can see the appeal of both Python code and SQL queries. Of course this will be personal preference depending on your background and familiarity with the environments. The Python API hides the implementation details and is more flexible, with more statistics options and rasterization strategies. But the SQL approach covers the common use case in a declarative query; it exposes the implementation details yet remains very readable.

The performance impact is significant enough to be a deal breaker for PostGIS. I haven't delved into the issues too closely; There might some obvious ways to optimize this query but I haven't found anything as of writing this. PostGIS experts, please get in touch if you find any speedups that I could consider here!

Performance combined with the additional overhead of managing postgres instances and data imports tells me that running zonal stats in postgis will not be a great option unless you're already running PostgreSQL. If your application is already committed to postgres and you want to integrate zonal stats tightly into your data management strategy, it could be a viable approach. For example, you could create a TRIGGER or an asyncronous worker via LISTEN/NOTIFY to ensure run zonal statistics is run each time a new feature is inserted into your vector table.

For most other zonal stats use cases, using rasterstats against local files or in-memory Python data will be a faster with less data management overhead.

Processing vector features in Python

2016-04-16T00:00:00-06:00

Working with geospatial vector data typically involves manipulating collections of features - points, lines and polygons with attributes. You might change their geometries, alter their properties or both. This is nothing new. Tools like this have been around since the first days of GIS. Notice the essential role of many of these operations: taking vector data as input, doing some work and producing vector data as output. While conceptually very simple, this logic often gets siloed, tied too closely to our specific implementions, formats, and systems.

The following is my take on the best practices for designing and building your own vector processing modules using modern Python. The goals here are not primarily performance but interoperability and composability.

GeoJSON guides the way

Using GeoJSON-like Feature mappings as a representation of simple features buys us a ton of interoperability. It's not only a standard but the only one that can be translated to fully represent a feature as a python data structure. Other standards specify file formats or data structures for geometries only. Most Python modules that deal with geospatial data can speak GeoJSON-like data. And if they don't, the data structure is easy to construct manually. Let's take a look at our humble Feature

{
    'type': 'Feature',
    'properties': {
        'name': 'Example'},
    'geometry': {
        'type': 'Point',
        'coordinates': [-120.0, 42.0]}}

The geometry, the geographic component, is just iterables of lon, lat locations - you can represent points, lines, polygons or multis. The properties dictionary holds non-geographic information about the features, analagous to the "attribute table" in many GIS.

A quick note on the term "GeoJSON-like Feature mapping"... GeoJSON is a text serialization format. When we take GeoJSON and translate it into a python data structure it is no longer GeoJSON but a python dictionary (mapping) which follows the semantics of a GeoJSON Feature. From here on out, I'll just refer to this GeoJSON-like python data structure as a feature. If you're writing functions that work with vector data, they should accept and return features.

That's the convention, the golden rule of writing Python vector processing functions

Functions should take features as inputs and yield or return features.

In other words, features in, features out. That's it. It's really that simple, and the simplicity buys you a great deal of potential.

The IO Sandwich

Functions which fit this convention will not read or write to anything outside of locally-scoped variables. Does your function need to read from a file or write to the network in addition to processing features? Why should one function be responsible for doing multiple tasks? We're striving for functions that do one thing - process vector features.

All the data your function needs should be passed in as arguments. Note that this is very different than passing in a file path and doing the reading and writing of data within your function:

# BAD
process_features("/path/to/shapefile.shp", output="out.shp")

# GOOD
features = read_features("/path/to/shapefile.shp")
new_features = process_features(features)
write_features(new_features, output="out.shp")

You might be concerned about memory. But don't worry, well-behaved Python libraries can use generators to load the data as needed.

Another way to picture it is that your application should build an IO sandwich with all of the reading and writing happening outside of your processing function.

Read Shapefile into Features -->  process_features  --> Write Features to Shapefile

That way anyone can use the same function with different inputs and outputs

Read Web Service into Features -->  process_features  --> Write Features to PostGIS

Processing functions should not care where their input features come from or where the output features are going. As long as process_features takes and returns features, any number of combinations are possible.

This not only decouples IO but allows us to compose processes together

Read Features --> process1 --> process2 --> process3 --> Write Features

Other guidelines

When possible, you should strive for pure functions; avoid mutating data and return a clean copy.

Unless you have specific reason, leave the original feature intact except for the thing your function is expected to manipulate. For instance, if your function just alters the geometry, don't drop or change existing properties.

There are some cases where it makes sense to collect your features into a collection and return the entire thing at once. This will generally occur if the features are not independent. In many cases though, your features will largely be independent and can be processed one-by-one. For these situations, it makes sense to use a generator (i.e. yield feature instead of return features).

Finally, you should aim to make your features serializable. You should be able to json.dump() the output features. The properties member should not contain nested dicts which might confuse some GIS formats which require a flat structure. And if possible, avoid extending the json with extra elements outside of properties.

An Example

In this simple example, we'll write a single vector processing function that buffers a geometry by a specified distance. Taking an input of points, for example:

and buffering them by 10 units.

Here is the core processing function which follows the features in, features out convention

from shapely.geometry import shape

def buffer(features, buffer=1.0):
    """Buffer a feature by specified units
    """
    for feature in features:
        geom = shape(feature['geometry'])   # Convert to shapely geometry to operate on it
        geom = geom.buffer(buffer)          # Buffer
        new_feature = feature.copy()
        new_feature['geometry'] = geom.__geo_interface__
        yield new_feature

Then we could use it in our IO sandwich by reading features from a shapefile and outputing the Features to GeoJSON on stdout. Here's what our Python interface looks like

import fiona # for input
import json  # for output

from process import buffer

with fiona.open("data/points.shp") as src:
    for feature in buffer(src, 10.0):
        print(json.dumps(feature))

So the python interface is looking good. What if we wanted to use it in a command line interface? Well luckily click and cligj has got the input covered. The @cligj.features_in_arg reads in an iterable of features from a file, a FeatureCollection or stream of Features.

import click
import cligj
import json

from process import buffer

@click.command()
@click.argument("distance", type=float)
@cligj.features_in_arg
def buffer_cmd(features, distance):
    for feature in buffer(features, distance):
        click.echo(json.dumps(feature))

if __name__ == "__main__":
    buffer_cmd()

Which we can then use between fio cat and fio collect to process Features in a memory-efficient stream.

$ fio cat data/points.shp | python buffer_cmd.py 10 | fio collect > points_buffer.geojson

What about an HTTP interface? Flask provides us with a lightweight framework to turn our function into a web service:

import json
from flask import Flask, request, Response
from process import buffer

app = Flask(__name__)

@app.route('/buffer/<distance>', methods=['POST'])
def index(distance):
    collection = request.get_json(force=True)
    distance = float(distance)

    new_features = list(
        buffer(collection['features'], distance))

    collection['features'] = new_features

    return Response(
        response=json.dumps(collection),
        status=200, mimetype="application/json")

if __name__ == '__main__':
    app.run(debug=True)

Which gives us a buffer web service to which you can post GeoJSON FeatureCollections and get back a buffered collection:

$ fio dump data/points.shp | \
    curl -X POST -d @- http://localhost:5000/buffer/10.0 > points_buffered.geojson

Conclusion

Writing your vector processing code to follow these simple conventions enables great flexibility. You can use your code in a Python application, a command line interface, an HTTP web service - all based on the same core processing functions. Assuming you can write some glue code to express input and output as GeoJSON features, this will work with any vector data source and is not constrained to a single context. You can use this with any data, anywhere that supports Python. That's a pretty powerful concept, all made possible by the simple convention of features in, features out.

Running Python with compiled code on AWS Lambda

2015-10-10T00:00:00-06:00

With the recent announcement that AWS Lambda now supports Python, I decided to take a look at using it for geospatial data processing.

Previously, I had built queue-based systems with Celery that allow you to run discrete processing tasks in parallel on AWS infrastructure. Just start up as many workers on EC2 instances as you need, set up a broker and a results store, add jobs to the queue and collect the results. The problem with this system is that you have to manage all of the infrastructure and services yourself.

Ideally you wouldn't need to worry about infrastructure at all. That is the promise of AWS Lambda. Lambda can respond to events, fire up a worker and run the task without you needing to worry about provisioning a server. This is especially nice for sporadic work loads in response to events like user-uploaded data where you need to scale up or down regularly.

The reality of AWS Lambda is that you do need to worry about infrastructure in a different way. The constraints of the runtime environment mean that you need to get creative if you're doing anything beyond the basics. If your task relies on compiled code, either Python C extensions or shared libraries, you have to jump through some hoops. And for any geo data processing you are going to use a good amount of compiled code to call into C libs (see numpy, rasterio, GDAL, geopandas, Fiona, and so on)

This article describes my approach to solving the problem of running Python with calls to native code on AWS Lambda.

Outline

The short version goes like this:

Start an EC2 instance using the official Amazon Linux AMI (based on Red Hat Enterprise Linux)
On the EC2 insance, Build any shared libries from source.
Create a virtualenv with all your python dependecies.
Write a python handler function to respond to events and interact with other parts of AWS (e.g. fetch data from S3)
Write a python worker, as a command line interface, to process the data
Bundle the virtualenv, your code and the binary libs into a zip file
Publish the zip file to AWS Lambda

The deployment process is a bit clunky but the benefit is that, once it works, you don't have any servers to manage! A fair tradeoff IMO.

The process will take a raster dataset uploaded to the input s3 bucket

and automatically extract the shape of the valid data region, placing the resulting GeoJSON in the output s3 bucket.

Start EC2

Under the hood, your Lambda functions are running on EC2 with Amazon Linux. You don't have to think about that at runtime but, if you're calling native compiled code, it needs to be compiled on a similar OS. Theoretically you could do this with your own version of RHEL or CentOS but to be safe it's easier to use the official Amazon Linux since we know that's the exact environment our code will be run in.

I'm not going to go over the details of setting up EC2 so I'll assume we already have our account set up. The AMI ids are listed here, pick the appropriate one for your region

aws ec2 run-instances --image-id ami-9ff7e8af \
    --count 1 --instance-type t2.micro \
    --key-name your-key --security-groups your-sg

And ssh in

ssh -i your-key.pem ec2-user@your.public.ip

Make sure everything's up to date:

sudo yum -y update
sudo yum -y upgrade

Build shared libraries from source

Because your Lambda function will run in a clean AWS linux environment, you can't assume any system libraries will be there. Compiling from source isn't the only option - you could install binaries from the Enterprise Linux GIS effort but those tend to be older versions. To get more recent libs, compiling from source is an effective approach.

First install some compile-time deps

sudo yum install python27-devel python27-pip gcc libjpeg-devel zlib-devel gcc-c++

Then build and install proj4 to a local prefix

wget https://github.com/OSGeo/proj.4/archive/4.9.2.tar.gz
tar -zvxf 4.9.2.tar.gz
cd proj.4-4.9.2/
./configure --prefix=/home/ec2-user/lambda/local
make
make install

And build GDAL, statically linking proj4

wget http://download.osgeo.org/gdal/1.11.3/gdal-1.11.3.tar.gz
tar -xzvf gdal-1.11.3.tar.gz
cd gdal-1.11.3
./configure --prefix=/home/ec2-user/lambda/local \
            --with-geos=/home/ec2-user/lambda/local/bin/geos-config \
            --with-static-proj4=/home/ec2-user/lambda/local
make
make install

This should leave us with a nice shared library at /home/ec2_user/lambda/local/lib/libgdal.so.1 that can be safely moved to another AWS Linux box.

Create a virtualenv

Pretty straighforward but keep in mind that some of the dependecies here are compiled extensions so these builds are platform-specific - which is why we need to build it on the target Amazon Linux OS.

virtualenv env
source env/bin/activate
export GDAL_CONFIG=/home/ec2-user/lambda/local/bin/gdal-config
pip install rasterio

Python handler function

The handler's job is to respond to the event (e.g. a new file created in an S3 bucket), perform any amazon-specific tasks (like fetching data from s3) and invoke the worker. Importantly, in the context of this article, the handler must set the LD_LIBRARY_PATH to point to any shared libraries that the worker may need.

import os
import subprocess
import uuid
import boto3

libdir = os.path.join(os.getcwd(), 'local', 'lib')
s3_client = boto3.client('s3')

def handler(event, context):
    results = []
    for record in event['Records']:

        # Find input/output buckets and key names
        bucket = record['s3']['bucket']['name']
        output_bucket = "{}.geojson".format(bucket)
        key = record['s3']['object']['key']
        output_key = "{}.geojson".format(key)

        # Download the raster locally
        download_path = '/tmp/{}{}'.format(uuid.uuid4(), key)
        s3_client.download_file(bucket, key, download_path)

        # Call the worker, setting the environment variables
        command = 'LD_LIBRARY_PATH={} python worker.py "{}"'.format(libdir, download_path)
        output_path = subprocess.check_output(command, shell=True)

        # Upload the output of the worker to S3
        s3_client.upload_file(output_path.strip(), output_bucket, output_key)
        results.append(output_path.strip())

    return results

It's important that the handler function does not import any modules which require dynamic linking. For example, you cannot import rasterio in the main python handler since the dynamic linker doesn't yet know where to look for the GDAL shared library. Your can control the linker paths using the LD_LIBRARY_PATH environment variable but only before the process is started. Lambda doesn't give you any control over the environment variables of the handler function itself. I tried hacks like creating new processes within the handler using os.execv or multiprocessing pools but the user running the lambda function doesn't have the necessary permissions to that (both give you OSErrors - [Errno 13] Permission Denied and [Errno 38] Function not implemented respectively).

Fortunately, Lambda lets you call out to the shell so we can just do our real work through a worker script exposed as a command line interface (details in the next section). While at first this feels clunky, it has the side benefit of forcing separation of your AWS code from your business logic which can be written and tested separately.

Worker

The worker script can be written in any language, compiled or interpreted, so long as it follows the basic rules of command line interfaces. We're using Python in the handler to set up the appropriate environment. For this example, the worker will also be written in Python because of it's awesome support for geospatial data processing. But it could be written in Bash or C or just about anything so long as it's runtime environment can be configured with environment variables and arguments.

In this case, the handler is calling worker.py which looks like:

import rasterio
from tempfile import NamedTemporaryFile
import json
import sys
from rasterio import features

def raster_shape(raster_path):
    with rasterio.open(raster_path) as src:

        # read the first band and create a binary mask
        arr = src.read(1)
        ndv = src.nodata
        binarray = (arr == ndv).astype('uint8')

        # extract shapes from raster
        shapes = features.shapes(binarray, transform=src.transform)

        # create geojson feature collection
        fc = {
            'type': 'FeatureCollection',
            'features': []}
        for geom, val in shapes:
            if val == 0:  # not nodata, i.e. valid data
                feature = {
                    'type': 'Feature',
                    'properties': {'name': raster_path},
                    'geometry': geom}
                fc['features'].append(feature)

        # Write to file
        with NamedTemporaryFile(suffix=".geojson", delete=False) as temp:
            temp.file.write(json.dumps(fc))

        return temp.name

if __name__ == "__main__":
    in_path = sys.argv[1]
    out_path = raster_shape(in_path)
    print(out_path)

Notice how the worker itself has no knowledge of AWS events or S3 - it works entirely on the local filesystem and thus can be used in other contexts and tested much more easily.

Bundle

In order to deploy to Lambda, you need to package it up in a zip file in a slightly unusual manner. All of your Python packages and your handler script should be at the root while the shared libraries can be put in a directory (local/lib in this case)

cd ~/lambda

zip -9 bundle.zip handler.py
zip -r9 bundle.zip worker.py
zip -r9 bundle.zip local/lib/libgdal.so.1

cd $VIRTUAL_ENV/lib/python2.7/site-packages
zip -r9 ~/lambda/bundle.zip *
cd $VIRTUAL_ENV/lib64/python2.7/site-packages
zip -r9 ~/lambda/bundle.zip *

Publish

The details of setting up a Lambda function are far too verbose for this article - I would suggest running through the AWS S3 walkthrough to get the basic S3 example working first. Then use the AWS CLI to update your existing Lambda function:

aws lambda update-function-code \
--function-name testfunc1 \
--zip-file fileb://bundle.zip

The end result

Uploading a raster dataset to your S3 bucket should now trigger the Lambda function which will create a new GeoJSON in the output bucket. All automatically invoked based on the S3 events and completely scalable without having to worry about managing or provisioning servers. Nifty!

The worker and handler code above are intentionally kept short to be more readable. In real usage they would need significantly more error handling and conditionals to handle edge cases, malformed inputs, etc.

It occured to me after writing this that there really is nothing Python-specific about this approach - the handler could just as easily have been written in Javascript and the worker in some other language. But this should provide a general approach for incorporating native code of any sort in AWS Lambda.

It remains to be seen if this approach is faster or cheaper than a queue-based system with autoscaled EC2 instances. If you're doing a constantly-high workload with lots of data, it's probably safe to say that Lambda is not appropriate. If you're doing sporadic workloads with some discrete processing task based on user-uploaded data, Lambda might be the ticket. The primary advantage is not necessarily speed or cost but reduced infrastructure complexity and hands-off autoscaling.

Python affine transforms

2015-09-13T00:00:00-06:00

Raster data coordinate handling with 6-element geotransforms is a pain. Use the affine Python library instead.

The typical geospatial coordinate reference system is defined on a cartesian plane with the 0,0 origin in the bottom left and X and Y increasing as you go up and to the right. But raster data, coming from its image processing origins, uses a different referencing system to access pixels. We refer to rows and columns with the 0,0 origin in the upper left and rows increase and you move down while the columns increase as you go right. Still a cartesian plane but not the same one.

So how do you transform between the two? Affine transformations provide a simple way to do it through the use of matrix algebra. Geospatial software of all varieties use an affine transform (sometimes refered to as "geotransform") to go from raster rows/columns to the x/y of the coordinate reference system. Converting from x/y back to row/col uses the inverse of the affine transform. Of course the software implementations vary widely.

For the remainder, I'll assume the simple case of a non-rotated "north up" raster as that is by far the most common case.

If you're coming from the matrix algebra perspective, you can ignore the constants in the affine matrix and refer to the the six paramters as a, b, c, d, e, f. This is the ordering and notation used by the affine Python library.

a = width of a pixel
b = row rotation (typically zero)
c = x-coordinate of the upper-left corner of the upper-left pixel
d = column rotation (typically zero)
e = height of a pixel (typically negative)
f = y-coordinate of the of the upper-left corner of the upper-left pixel

Perhaps the most pervasive implementation of affine transform encoding in the GIS world is the ESRI World File. The world file is a simple text file accompanying any raster image which uses six line-separated values in this order:

a = width of a pixel
d = column rotation (typically zero)
b = row rotation (typically zero)
e = height of a pixel (typically negative)
c = x-coordinate of the center of the upper-left pixel
f = y-coordinate of the center of the upper-left pixel

It's important to note that the c and f parameters refer to the center of the cell, not the origin!

GDAL also uses the 6 parameter transform in yet a different order with the "Geotransform" array

c = x-coordinate of the upper-left corner of the upper-left pixel
a = width of a pixel
b = row rotation (typically zero)
f = y-coordinate of the of the upper-left corner of the upper-left pixel
d = column rotation (typically zero)
e = height of a pixel (typically negative)

None of those orderings are particularly intutive but at least the first, as implemented by affine, is "correct" from the matrix algebra perspective.

For python programmers looking to work with raster data, the osgeo.gdal library has existed for quite a while. With it the notion of a 6-tuple geotransform in GDAL ordering has become pervasive. And if ordering were the only issue, it wouldn't necessarily be worth switching to the use of the affine library. The more convincing argument for the use of affine is the ease with which you can transform coordinates. In other words, why should you have to worry about ordering of parameters at all?

When dealing with the geotransform as a simple 6-element tuple, you'll probably end up writing code like this to do the actual conversion:

# Using osgeo.gdal and GDAL geotransform 6-tuples
gt = ds.GetGeoTransform()

# col, row to x, y
x = (col * gt[1]) + gt[0]
y = (row * gt[5]) + gt[3]

# x,y to col,row
col = int((x - gt[0]) / gt[1]) 
row = int((y - gt[3]) / gt[5])

I'd be willing to guess that variations of that formula exist in hundreds of python codebases. Not very complicated math but opaque enough not to commit to memory. It's also very easy to slip up ("Is the y origin element 4 or 5?") and introduce non-obvious bugs. Why should such a basic formulation be reimplemented by every programmer? Again, why rely on element ordering at all? affine, through the use of clever operation overloading, gives you a much simpler interface:

# Using rasterio and affine
a = ds.affine

# col, row to x, y
x, y = a * (col, row)

# x, y to col, row
col, row = ~a * (x, y)

Clean, nice looking code that's harder to get wrong, wouldn't you agree? And as @Asgerpetersen pointed out, if there were a non-zero rotation parameter, the affine example would handle it seamlessly while the geotransform formula would fail.

Also, interoperability with GDAL-style geotransforms is painless

# construct from our GDAL geotransform
a = Affine.from_gdal(*gt)
gt = a.to_gdal()

As is the ability to read/write from World Files

from affine import loadsw, dumpsw

# Read from World File
with open('raster.tfw') as tfw:
    a = loadsw(tfw.read())

# Write to World File
with open('other.wld', 'w') as dest:
    dest.write(dumpsw(a))

With rasterio planning to deprecate the use of GDAL-style geotransforms in the 1.0 release, it's never too early to start making the switch. Your cleaner raster coordinate code will be well worth the effort.

Raspberry Pi: real-time sensor plots with websocketd

2015-03-02T00:00:00-07:00

This year I'm starting to delve into some electronics projects and hardware hacking. What follows is an account of my first end-to-end Raspberry Pi project. In terms of functionality, it doesn't do much at the moment - just reads from a photoresistor sensor and plots the light levels in the corner of my office. Eventually, I want to hook up a couple of light, moisture and temperature sensors throughout my garden to do some science experiments and/or remind myself to water the tomatoes. This is but the first step in that larger project...

The Pi is wired up to a 3.3v circuit with a photoresistor.
The state of the digital input pins are read by a python program.
The readings are streamed to a websocket via log file.
The HTML/Javascript interface connects with the websocket and plots the values in real time.

Although it's all just for fun at this point, I've discovered a lot of great unix networking tools and javascript libraries that will come in handy in my day job as well. Here's the details on how it all came together...

The circuit

I implemented the design from the adafruit tutorial on the subject. The adafruit image shows the basic idea:

The photoresistor provides increased resistance to electric current as the visible light becomes dimmer. Conversely, resistance decreases as light becomes brighter. It is an analog sensor but the Raspberry Pi only has digital inputs (the general purpose input output or GPIO pins). To solve that, we can employ a capacitor using "RC timing".

A capacitor builds up voltage over time and, when this voltage hits ~1.4V, the digital input pin reads "high". So instead of taking a direct analog reading, we set a loop and time how long it takes for the capacitor to "fill up".

If the time interval is small (i.e. the capacitor is charging rapidly), there is less resistance from our analog sensor which means more light. If the time interval is large (i.e. the capacitor is taking a long time to charge on each cycle), there is more resistance and less light.

Wired up to the photoresistor on my 25 year-old Radio Shack Electronics Learning Lab, it looks a bit clunkier but still does the trick:

Quick side note: The ribbon and connectors between the raspberry pi and the breadboard are called a Pi Cobber. It makes working with the GPIO pins easier but, as you can tell from the photos, the incoming cable obstructs access a bit. I might take a look at the T-Cobbler which promises to clear up some vertical space on the breadboard.

Reading digital input pins from an analog sensor

In order to read input from our analog pins, we can use the RPi.GPIO python library.

There's not much more that I can add to the adafruit tutorial which covers the topic well. I made a few modifications:

output a unix timestamp along with the reading
flush the output to stdout after every reading to make sure the output isn't buffered.

if __name__ == "__main__":

    while True:
         # Get sensor timing and unix timestamp
         reading = RCtime(18)
         n = datetime.datetime.now()
         timestamp = to_unix_timestamp(n)

         print "{},{}".format(timestamp, reading)
         sys.stdout.flush()

You can read the complete read_sensor.py script on github.

With the script in place and the circuit wired up, I can fire up the script

sudo python read_sensor.py

and see the timestamp and sensor reading written to the console as comma-separated values:

1425505117.05,793
1425505117.16,802
1425505117.38,768
1425505117.82,709
1425505117.93,801
1425505118.05,798

So what do those values mean? They represent a count of the number of cycles it took to charge the capacitor. Not a meaningful number by itself but it could be calibrated to use standard units or simply used as relative values (lower value == brighter light)

It's important to note that, on a Linux machine, you can't be guaranteed that your event loop won't get interrupted by other processes. So you probably shouldn't use Linux as a real-time sensor platform directly. However, it works well enough for demonstration provided your Raspberry Pi isn't bogged down by other CPU-intensive processes.

Another caveat with this approach - we can only use a single process to access the GPIO pins in this manner. Having multiple processes or threads setting/reading GPIO pin states would cause inaccuracies as each process could reset the pins mid-cycle and interrupt the timing of other processes.

Streaming websockets

Websockets are an extension to HTTP that allow data to be sent from a server to a client using a persistent connection. Think pushing notification messages.

websocketd allows you to take the standard output from any unix program and publish it on a websocket. It can also work with standard input, opening up the doors for some amazing software workflows: imagine taking any well behaved Unix command and immediately wrapping it's functionality in a web protocol!

To output the sensor readings using a websocket, I'll first run the read_sensor.py script in the background with high priority (nice -20) and redirect the output to a logfile:

sudo nice -20 python read_sensor.py > log.txt &

Then I will run websocketd on port 8080, serve a few static files and provide a command to run. In this case, the command is the basic unix tail -f which streams the contents of the log file.

websocketd --port=8080 --staticdir=./static tail -f log.txt

Now the sensor readings are being logged and a websocket server is running. For each client that connects to the websocket, a new process (tail -f log.txt) will be started and stdout will be streamed to that client via websocket messages.

Note that the tail -f command is not yet running until a websocket client makes a connection. Because it runs in its own process and simply reads the sensor log file, we can start as many of them as our hardware can handle.

In summary, the pattern is: run a single process that reads from the GPIO pins and writes to a sensor log, then fire off multiple processes that read the log and stream the output over websockets.

Now we're ready to test it.

HTML/Javascript interface

Working with websockets in Javascript is fairly straightforward. First, create a connection

var ws = new WebSocket('ws://example.org:8080/');

then set some callbacks to handle incoming messages from the server.

ws.onmessage = function(e){
   console.log("We got something:", e.data);
}

Websockets are built into almost every modern browser so this functionality works out of the box. But if the connection is lost for any reason, the native Websocket implementations do not automatically reconnect. To solve that problem, there is ReconnectingWebSocket which does exactly what it sounds like; attempts to reconnect automatically when needed.

Then to create an animated real time plot of the streaming data, you'll need a javascript library like Smoothie Charts.

I should also note that the server (websocketd), the javascript plotting library (Smoothie Charts), and the javascript networking library (ReconnectingWebSocket) were all written by joewalnes - this guy is responsible for making the three biggest pieces of this system and deserves mad props!

All of the HTML and js can be found here: index.html.

Finally, here is the result. A streaming, real time plot of sensor readings. This clip was recorded as I came into my office, opened a few windows and turned on a light. As the room gets brighter, you can see the sensor readings drop, and then rise again as I pass my hand over sensor a few times to block the light.

Maybe not incredibly useful in it's current state but it provided an excellent learning experience to work on the entire stack, integrating electronics and hardware with web software. It opens the doors for all sorts of new projects. All of the code is available on my github repo. Any questions? Shoot me an email or message on twitter. I'm a beginner when it comes to electrical theory so somebody please correct me if I'm way off the mark on something.

Zonal statistics: histograms as user-defined aggregate functions

2015-02-23T00:00:00-07:00

Introduction

Zonal statistics allow you to summarize raster datasets based on vector geometries by aggregating all pixels associated with each vector feature, typically to a single scalar value. For example, you might want the mean elevation of each country against an SRTM Digital Elevation Model (DEM). This is easily accomplished in python using rasterstats:

from rasterstats import zonal_stats
stats = zonal_stats('countries.shp', 'elevation.tif', stats="mean", copy_properties=True)
for s in stats:
    print s['name'], s['mean']

Which would give us output similar to below, with the mean elevation (meters) for each country:

Afghanistan 1826.38
Netherlands 8.78
Nepal 2142.28
Zimbabwe 980.85

Zonal Histograms

Using the built-in aggregate functions in rasterstats can reveal a lot about about the underlying raster dataset (see statistics for full list). Most of the time the standard descriptive statistics like min, max, mean, median, etc. can tell us everything we need to know.

But what if we want to retain more information about the underlying distribution of values? Instead of simply stating

Afghanistan is, on average, 1826.38 meters above sea level

supposed we wanted to see how much of the country is in high vs low elevation areas. We could bin the elevations into meaningful ranges (say 0-200 meters, 200 to 400 meters, etc) and create a histogram of pixel counts to show the shape of the underlying distribution. In this case, the aggregate function does not return a scalar value but a dictionary with each bin as a key.

>> stats['elevation_histogram']
{'0 to 500m': ...,
 '500 to 1000m': ...,
 '1000 to 3000m':...,
 '3000 to 5000m':...,
 '5000m+'
}

That's the goal, now how do we accomplish this?

User-defined aggregate functions

Because a histogram might need to specify a number of arguments to customize the results, it's not feasible for rasterstats to define a generic histogram function. However, as of version 0.6, we have the ability to create custom, user-defined aggregate functions such as the zonal histogram idea described above.

First, we have to write our function. The first and only argument is a masked numpy array and will typically be handled by numpy functions. The function's return value will be added to the stats output for each feature. The returned value does not need to be a scalar, it can be any valid python value (though it's probably best to stick with dicts, lists and other simple data structures that are easily serializable).

import numpy as np
import itertools

def elevation_histogram(x):
    bin_edges = [0, 400, 1000, 3000, 5000, 10000]
    hist, _ = np.histogram(x, bins=bin_edges)
    data = {}  
    for upper, lower, value in itertools.izip(bin_edges, bin_edges[1:], hist):
        key = "{} to {}m".format(upper, lower)
        data[key] = value
    return data

And then add our custom elevation_histogram function to our zonal_stats call using the add_stats keyword argument:

stats = zonal_stats('countries.shp', 'elevation.tif', copy_properties=True,
                    add_stats={'elevation_histogram': elevation_histogram})
for s in stats:
    print s['name'], s['mean'], s['elevation_histogram']

which gives us output similar to the following which gives you raw pixel counts for each of the elevation bins (formatted for readability)

Afghanistan 1826.38 {
    '3000 to 5000m': 1099730, 
    '0 to 400m': 1754317,
    '1000 to 3000m': 2884917, 
    '5000 to 10000m': 83158, 
    '400 to 1000m': 1907790}

The only caveat with using this technique is that nested dictionaries and other non-scalar values might cause difficulty when trying to serialize this data structure to other formats. For example, most GIS formats don't support hierarchical properties (nested dictionaries) so you might have to flatten the data before writing to e.g. PostGIS or an ESRI shapefile.

With the ability to write user-defined aggregate functions, I can keep the core of rasterstats light while allowing for the possibility of complex aggregate analysis that might be needed in the future. Good stuff.

Topological simplification of simple features

2015-01-11T00:00:00-07:00

The case for topology

Simple feature representations of polygon geometries are ubiquitous due to their ease of use. Thinking of spatial features as having a single, independent geometry is easy and fits most use cases. But that ease of use disappears when we need to represent the topological relationship between features.

In this article, I'll focus on one particular task with simple features data that would benefit from topology - namely simplifying a polygon dataset by removing vertices. Here's the original dataset, a 30+MB shapefile with very dense line work.

Geometries can be simplified under the Simple Features model but, since each geometry is processed independently, the topological relationships between features can be disrupted. For instance, using the Simplify Geometries tool in QGIS, I can simplify the polygons dramatically but we see gaps between polygons and other side effects.

The plan

Because we'll need to build topology before acting on it, the process for simplifying simple features datasets involves converting the data to topological structure, simplifying it, then converting it back to a simple features representation.

Many of the big GIS systems (ESRI's .e00, ArcInfo "coverages", and GRASS vectors) have their own topological data structures. More recently, we've seen the rise of Open Street Map (OSM) format and TopoJSON, both of which model topological relationships.

Of these options, I selected TopoJSON because of it's robust command-line tool which handles building topology and simplification in one step. Additionally, it works with GeoJSON and Shapefile inputs, two of the most common data formats for simple features.

The workflow goes something like this:

Convert data into a shapefile with the EPSG:4326 spatial reference (lonlat, wgs84)
Convert to topojson and simplify
Convert to geojson
Optionally, convert geojson to other formats supported by OGR

To follow along, you'll need to have the following software installed:

GDAL command line utilities (we'll use ogr2ogr at the command line)
- apt-get install gdal-bin
The topojson command line utility
- npm install -g topojson
Python with the shapely package installed.
- pip install shapely

Step 1: Convert to WGS84 shapefile

If you're already working with an ESRI Shapefile or GeoJSON format and your data is already in unprojected WGS84 coordinates (i.e. EPSG:4326), you can skip to step 2.

Otherwise, ogr2ogr makes that conversion simple:

ogr2ogr -t_srs epsg:4326 -f "ESRI Shapefile" \
   ecoregions_original.shp EcoregionSummaries3.gdb.zip EcoRegions

Step 2: Convert to TopoJSON and simplify

The simplification, quantization (more on that later) and the conversion to a topological data model are handled by topojson

You have two options for specifying how aggressively you want to simplify your data.

Use a tolerance, specified in steridians with the -s flag
Use a proportion of points, 0 to 1, to retain with the --simplify-proportion flag

One quirk of the topojson implementation is that it uses a relatively low quantization factor by default. Effectively, this snaps coordinates to a grid in order to save space and simplify geometries even further. This yields nice small coordinates but can result in a "stair step" effect at higher zoom levels. The default is -q 1E4 but I've found good results with -q 1E6 as recommended in the topojson docs.

As an example, let's take our ecoregions_original.shp and convert it to topojson with a tolerance of 1E-8 steridians. We want to make sure we explicitly mention that the data is in spherical (unprojected) coordinates and to retain the properties of the original attribute table:

topojson --spherical \
        --properties \
        -s 1E-8 \
        -q 1E6 \
        -o temp.topojson \
        ecoregions_original.shp

Step 3: Convert to GeoJSON

This part was a bit trickier than I anticipated. Luckily Sean Gillies has written some preliminary python functions for converting topojson geometries to standard GeoJSON-like python dictionaries.

In order to make a higher-level conversion utility, I started working on topo2geojson.py which provides a command line interface to perform TopoJSON to GeoJSON conversions.

python topo2geojson.py temp.topojson ecoregions_simple.geojson

There is some additional logic to ensure validity of polygons though it is very basic and I'm sure there are ways to make the geometry conversions more robust. Please note that I've only tested this script on this one dataset and it likely needs additional work to be considered a full-fledged conversion tool; consider it more of a starting point than an out-of-the box solution.

Optional Step 4: Convert to any OGR format

Once data is in GeoJSON format, we're free to do what we want with it, including converting it back to a shapefile or any other OGR supported data format.

ogr2ogr -f "ESRI Shapefile" ecoregions_simple.shp ecoregions_simple.geojson OGRGeoJson

Case study: evaluating simplification tolerances

In the remainder of this article, I'll walk through a demonstration of these steps in order to find an optimal simplification tolerance for my test data. The optimal tolerance depends on your needs, what scales you will be using your data and how aggressively you need to reduce file size. Ultimately, it's a tradeoff between low geometry size and accurate line work.

We can easily script this solution in order to test multiple simplification tolerances. As a bonus, we can fire off multiple iterations at once to leverage multiple cores. Since I've got 4 cores on my laptop, I can run 4 processes in nearly the same time it takes to run 1 using some simple shell tricks (Linux/OSX only; sorry Windows users but I don't know .bat files well enough to demonstrate)

for tolerance in 1E-7 1E-8 1E-9 1E-10
    do
        topojson --spherical \
            --properties \
            -s $tolerance \
            -q 1E6 \
            -o temp_$tolerance.topojson \
            ecoregions_original.shp &&

        # Convert it to GeoJSON
        python topo2geojson.py temp_$tolerance.topojson temp_$tolerance.geojson &&

        # Optionally, convert GeoJSON to any OGR data source
        ogr2ogr -f "ESRI Shapefile" ecoregions_$tolerance.shp temp_$tolerance.geojson OGRGeoJson &

    done
    wait

Then we can take a look at the resulting .topojson file sizes

$ ls -lh *.topojson
-rw-rw-r-- 1 mperry mperry 4.5M Jan 11 12:25 temp_1E-10.topojson
-rw-rw-r-- 1 mperry mperry 2.1M Jan 11 12:25 temp_1E-9.topojson
-rw-rw-r-- 1 mperry mperry 869K Jan 11 12:25 temp_1E-8.topojson
-rw-rw-r-- 1 mperry mperry 362K Jan 11 12:25 temp_1E-7.topojson

OK, so with a simplification tolerance of 1E-10 steridians, we can get a 4.5M file. If we reduce it to 1E-7, we can get 362k file - a 12.5x reduction. Is the reduction in file size worth the reduction in geometric accuracy? The only way to find out is to render maps of the resulting datasets and visually assess them.

	Original	1E-7	1E-8	1E-9

First thing that we notice - all of the results have retained topology with no gaps or slivers introduced. (the key benefit to this workflow).

Next, we notice that at this scale (roughly 1:500k on my monitor) we can barely see a difference between the 1E-9 version and the original. And the 1E-7 version looks a bit too simplified and chunky. So, in this case, we can say that a simplification tolerance of around 1E-8 steridians is an optimal balance of file size and detail.

Of course other datasets, scales and uses may have completely different results so please try it out and let me know how it goes. Just don't settle for simple features simplification next time you need to reduce file sizes!

Sensitivity Analysis in Python

2014-01-19T00:00:00-07:00

Demonstrates the use of the `SALib` python module to sample and test the sensitivity of models

As (geo)data scientists, we spend much of our time working with data models that try (with varying degrees of success) to capture some essential truth about the world while still being as simple as possible to provide a useful abstraction. Inevitably, complexity starts to creep into every model and we don't often stop to assess the value added by that complexity. When working with models that require a large number of parameters and a huge domain of potential inputs that are expensive to collect, it becomes difficult to answer the question:

What parameters of the model are the most sensitive?

In other words, if I am going to spend my resources obtaining/refining data for this model, where should I focus my efforts in order to get the best bang for the buck? If I spend weeks working on deriving a single parameter for the model, I want some assurance that the parameter is critically important to the model's prediction. The flip-side, of course, is that if a parameter is not that important to the model's predictive power, I could save some time by perhaps just using some quick-and-dirty approximation.

SALib: a python module for testing model sensitivity

I was thrilled to find SALib which implements a number of vetted methods for quantitatively assessing parameter sensitivity. There are three basic steps to running SALib:

Define the parameters to test, define their domain of possible values and generate n sets of randomized input parameters.
Run the model n times and capture the results.
Analyze the results to identify the most/least sensitive parameters.

I'll leave the details of these steps to the SALib documentation. The beauty of the SALib approach is that you have the flexibility[1] to run any model in any way you want, so long as you can manipulate the inputs and outputs adequately.

Case Study: Climate effects on forestry

I wanted to compare a forest growth and yield model under different climate change scenarios in order to assess what the most sensitive climate-related variables were. I identified 4 variables:

Climate model (4 global circulation models)
Representative Concentration Pathways (RCPs; 3 different emission trajectories)
Mortality factor for species viability (0 to 1)
Mortality factor for equivalent elevation change (0 to 1)

In this case, I was using the Forest Vegetation Simulator(FVS) which requires a configuration file for every model iteration. So, for Step 2, I had to iterate through each set of input variables and use them to generate an appropriate configuration file. This involved translating the real numbers from the samples into categorical variables in some cases. Finally, in order to get the result of the model iteration, I had to parse the outputs of FVS and do some post-processing to obtain the variable of interest (the average volume of standing timber over 100 years). So the flexibility of SALib comes at a slight cost: unless your model works directly with the file formatted for SALib, the input and outputs may require some data manipulation.

After running the all required iterations of the model[2] I was able to analyze the results and assess the sensitivity of the four parameters.

Here's the output of SALib's analysis (formatted slightly for readability):

Parameter    First_Order First_Order_Conf Total_Order Total_Order_Conf
circulation  0.193685    0.041254         0.477032    0.034803
rcp          0.517451    0.047054         0.783094    0.049091
mortviab    -0.007791    0.006993         0.013050    0.007081
mortelev    -0.005971    0.005510         0.007162    0.006693

The first order effects represent the effect of that parameter alone. The total order effects are arguably more relevant to understanding the overall interaction of that parameter with your model. The "Conf" columns represent confidence and can be interpreted as error bars.

In this case, we interpret the output as follows:

Parameter    Total Order Effect   
circulation  0.47  +- 0.03  (moderate influence)      
rcp          0.78  +- 0.05  (dominant parameter)
mortviab     0.01  +- 0.007 (weak influence)
mortelev     0.007 +- 0.006 (weak influence)

We can graph each of the input parameters against the results to visualize this:

Note that the 'mortelev' component is basically flat (as the factor increases, the result stays the same) whereas the choice of 'rcp' has a heavy influence (as emissions increase to the highest level, the resulting prediction for timber volumes are noticeably decreased).

The conclusion is that the climate variables, particularly the RCPs related to human-caused emissions, were the strongest determinants[1] of tree growth for this particular forest stand. This ran counter to our initial intuition that the mortality factors would play a large role in the model. Based on this sensitivity analysis, we may be able to avoid wasting effort on refining parameters that are of minor consequence to the output.

Footnotes:

Compared to more tightly integrated, model-specific methods of sensitivity analysis
20 thousand iterations took approximately 8 hours; sensitivity analysis generally requires lots of processing
Note that the influence of a parameter says nothing about direct causality

Leaflet SimpleCSV

2013-09-30T00:00:00-06:00

Simple leaftlet-based template for mapping tabular point data on a slippy map

Anyone who's worked with spatial data and the web has run across the need to take some simple tabular data and put points on an interactive map. It's the fundamental "Hello World" of web mapping. Yet I always find myself spending way too much time solving this seemingly simple problem. When you consider zoom levels, attributes, interactivity, clustering, querying, etc... it becomes apparent that interactive maps require a bit more legwork. But that functionality is fairly consistent case-to-case so I've developed a generalized solution that works for the majority of basic use cases out there:

leaftlet-simple-csv on github

The idea is pretty generic but useful for most point marker maps: * Data is in tabular delimited-text (csv, etc.) with two required columns: lat and lng * Points are plotted on full-screen Leaflet map * Point markers are clustered dynamically based on zoom level. * Clicking on a point cluster will zoom into the extent of the underlying features. * Hovering on the point will display the name. * Clicking will display a popup with columns/properties displayed as an html table. * Full text filtering with typeahead * Completely client-side javascript with all dependencies included or linked via CDN

Of course this is mostly just a packaged version of existing work, namely Leaflet with the geoCSV and markercluster plugins.

Usage

Grab the leaflet-simple-csv zip file and unzip it to a location accessible through a web server.
Copy the config.js.template to config.js
Visit the index.html page to confirm everything is working with the built-in example.
Customize your config.js for your dataset.

An example config:

var dataUrl = 'data/data.csv';
var maxZoom = 9;
var fieldSeparator = '|';
var baseUrl = 'http://otile{s}.mqcdn.com/tiles/1.0.0/osm/{z}/{x}/{y}.jpg';
var baseAttribution = 'Data, imagery and map information provided by <a href="http://open.mapquest.co.uk" target="_blank">MapQuest</a>, <a href="http://www.openstreetmap.org/" target="_blank">OpenStreetMap</a> and contributors, <a href="http://creativecommons.org/licenses/by-sa/2.0/" target="_blank">CC-BY-SA</a>';
var subdomains = '1234';
var clusterOptions = {showCoverageOnHover: false, maxClusterRadius: 50};
var labelColumn = "Name";
var opacity = 1.0;

The example dataset:

Country|Name|lat|lng|Altitude
United States|New York City|40.7142691|-74.0059738|2.0
United States|Los Angeles|34.0522342|-118.2436829|115.0
United States|Chicago|41.8500330|-87.6500549|181.0
United States|Houston|29.7632836|-95.3632736|15.0
...

I make no claims that this is the "right" way to do it but leveraging 100% client-side javascript libraries and native delimited-text formats seems like the simplest approach. Many of the features included here (clustering, filtering) are useful enough to apply to most situations and hopefully you'll find it useful.

Python rasterstats

2013-09-24T00:00:00-06:00

This article introduces a python module for summarizing geospatial raster datasets based on vector geometries (i.e. zonal statistics).

A common task in many of my data workflows involves "zonal statistics"; summarizing raster data based on vector geometries. Despite many alternatives (starspan, the QGIS Zonal Statistics plugin, ArcPy and R) there were none that were

open source
fast enough
flexible enough
worked with python data structures

We'd written a wrapper around starspan for madrona (see madrona.raster_stats ) but relying on shell calls and an aging, unmaintained C++ code base was not cutting it.

So I set out to create a solution using numpy, GDAL and python. The rasterstats package was born.

`python-raster-stats` on github

Example

Let's jump into an example. I've got a polygon shapefile of continental US state boundaries and a raster dataset of annual precipitation from the North American Environmental Atlas.

states = "data/boundaries_contus.shp"
precip = "data/precipitation.tif"

The raster_stats function is the main entry point. Provide a vector and a raster as input and expect a list of dicts, one for each input feature.

from rasterstats import raster_stats
rain_stats = raster_stats(states, precip, stats="*", copy_properties=True)
len(rain_stats)  # continental US; 48 states plus District of Columbia

49

Print out the stats for a given state:

[x for x in rain_stats if x['NAME'] == "Oregon"][0]

{'COUNTRY': 'USA',
 'EDIT': 'NEW',
 'EDIT_DATE': '20060803',
 'NAME': 'Oregon',
 'STATEABB': 'US-OR',
 'Shape_Area': 250563567264.0,
 'Shape_Leng': 2366783.00361,
 'UIDENT': 124704,
 '__fid__': 35,
 'count': 250510,
 'majority': 263,
 'max': 3193.0,
 'mean': 779.2223903237395,
 'median': 461.0,
 'min': 205.0,
 'minority': 3193,
 'range': 2988.0,
 'std': 631.539502512283,
 'sum': 195203001.0,
 'unique': 2865}

Find the three driest states:

[(x['NAME'], x['mean']) for x in 
       sorted(rain_stats, key=lambda k: k['mean'])[:3]]


[('Nevada', 248.23814034118908),
 ('Utah', 317.668743027571),
 ('Arizona', 320.6157232064074)]

And write the data out to a csv.

from rasterstats import stats_to_csv
with open('out.csv', 'w') as fh:
    fh.write(stats_to_csv(rain_stats))

Geo interface

The basic usage above shows the path of an entire OGR vector layer as the first argument. But raster-stats also supports other vector features/geometries.

Well-Known Text/Binary
GeoJSON string and mappings
Any python object that supports the geo_interface
Single objects or iterables

In this example, I use a geojson-like python mapping to specify a single geometry

geom = {'coordinates': [[
   [-594335.108537269, -570957.932799394],
   [-422374.54395311, -593387.5716581973],
   [-444804.1828119133, -765348.1362423564],
   [-631717.839968608, -735441.9510972851],
   [-594335.108537269, -570957.932799394]]],
 'type': 'Polygon'}

raster_stats(geom, precip, stats="min median max")

[{'__fid__': 0, 'max': 1011.0, 'median': 451.0, 'min': 229.0}]

Categorical

We're not limited to descriptive statistics for continuous rasters either; we can get unique pixel counts for categorical rasters as well. In this example, we've got a raster of 2005 land cover (i.e. general vegetation type).

Note that we can specify only the stats that make sense and the categorical=True provides a count of each pixel value.

>>> landcover = "/data/workspace/rasterstats_blog/NA_LandCover_2005.img"
>>> veg_stats = raster_stats(states, landcover, 
    stats="count majority minority unique",
    copy_properties=True,
    nodata_value=0,
    categorical=True)
>>> [x for x in veg_stats if x['NAME'] == "Oregon"][0]

{1: 999956,
 3: 6,
 5: 3005,
 6: 198535,
 8: 2270805,
 10: 126199,
 14: 20883,
 15: 301884,
 16: 17452,
 17: 39246,
 18: 28872,
 19: 2174,
 'COUNTRY': 'USA',
 'EDIT': 'NEW',
 'EDIT_DATE': '20060803',
 'NAME': 'Oregon',
 'STATEABB': 'US-OR',
 'Shape_Area': 250563567264.0,
 'Shape_Leng': 2366783.00361,
 'UIDENT': 124704,
 '__fid__': 35,
 'count': 4009017,
 'majority': 8,
 'minority': 3,
 'unique': 12}

Of course the pixel values alone don't make much sense. We need to interpret the pixel values as land cover classes:

Value, Class_name
1       Temperate or sub-polar needleleaf forest
2       Sub-polar taiga needleleaf forest
3       Tropical or sub-tropical broadleaf evergreen
4       Tropical or sub-tropical broadleaf deciduous
5       Temperate or sub-polar broadleaf deciduous
6        Mixed Forest
7       Tropical or sub-tropical shrubland
8       Temperate or sub-polar shrubland
9       Tropical or sub-tropical grassland
10      Temperate or sub-polar grassland
11      Sub-polar or polar shrubland-lichen-moss
12      Sub-polar or polar grassland-lichen-moss
13      Sub-polar or polar barren-lichen-moss
14      Wetland
15      Cropland
16      Barren Lands
17      Urban and Built-up
18      Water
19      Snow and Ice

So, for our Oregon example above we can see that, despite Oregon's reputation as a lush green landscape, the majority land cover class (#8) is "Temperate or sub- polar shrubland" at 2.27m pixels out of 4 millions total.

There's a lot more functionality that isn't covered in this post but you get the picture... please check it out and let me know what you think.

Creating UTFGrids directly from a polygon datasource

2012-08-20T00:00:00-06:00

We've begun to rely on the interactivity provided by UTFGrids in many of our recent web maps. (Quick recap: UTFGrids are "invisible" map tiles that allow direct interactivity with feature attributes without querying the server.) Earlier this year, I created the initial OpenLayers UTFGrid support and was glad to see it accepted into OpenLayer 2.12 (with some enhancements).

With the client-side javascript support in place, the only missing piece in the workflow was to create the UTFGrid .json files. We had expirimented with several alternate UTFGrid renderers but Mapnik's rendering was by far the fastest and produced the best results. Using Tilemill was a convenient way to leverage the Mapnik UTFGrid renderer but it came at the cost of a somewhat circuitious and manual workflow:

Load the data up into Tilemill,
Configure interactivity fields
Export to .mbtiles
Convert to .json files

What we really needed was a script to take a polygon shapefile and render the UTFGrids directly to files. Mapnik would provide the rendering while the Global Map Tiles python module would provide the logic for going back and forth between geographic coordinates and tile grid coordinates. From there it's just a matter of determining the extent of the data set and, for a specified set of zoom levels, looping through and using Mapnik to render the UTFGrid to a .json file in Z/X/Y.json directory structure.

Get `create-utfgrids` on github

If we have a mercator polygon shapefile of ecoregions and want to render UTFGrids for zoom levels 3 through 5 using the dom_desc and div_desc attributes, we could use a command like

$ ./create_utfgrids.py test_data/bailey_merc.shp 3 5 ecoregions -f dom_desc,div_desc

WARNING:
This script assumes a polygon shapefile in spherical mercator projection.
If any of these assumptions are not true, don't count on the results!
 * Processing Zoom Level 3
 * Processing Zoom Level 4
 * Processing Zoom Level 5

and inspect the output (e.g. zoom level 5, X=20, Y=18)

$ cat ecoregions/5/20/18.json | python -mjson.tool
{
    "data": {
        "192": {
            "div_desc": "RAINFOREST REGIME MOUNTAINS", 
            "dom_desc": "HUMID TROPICAL DOMAIN"
        }, 
...
    "grid": [
        "  !!!!!!!!!#####$%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%", 
...

Some caveats:

This currently only works for polygon datasets in a Web Mercator projection.
It's only tested with shapefiles as it assumes a single-layer datasource at the moment. Full OGR Datasource support would not be too difficult to add for PostGIS, etc.
It assumes a top-origin tile scheme (as do OSM and Google Maps). Supporting TMS bottom-origin schemes in the future should be straightforward.
Requires OGR and Mapnik >= 2.0 with python bindings. Finding windows binaries for the required version of Mapnik may be difficult so using OSX/Linux is recommended at this time.

Many thanks to Dane Springmeyer for his help on UTFGrid related matters and and to Klokan Petr Přidal for his MapTiler docs

Introducing the Madrona framework

2012-07-11T00:00:00-06:00

Madrona: A software framework for effective place-based decision making

My work at Ecotrust mainly revolves around creating web-based spatial analysis tools - software to bring data-driven science to the place-based descision making process. This began several years ago when I joined the MarineMap team. Since working with Ecotrust, we've taken the MarineMap software far beyond it's original niche. What was once a specific tool for marine protected area planning has now become a powerful framework for all sorts of web-based spatial tools in the realms of marine, forestry, conservation planning, aquatic habitat restoration, etc. So, in a sense, Madrona is a recognition of that evolution.

From the official Madrona release announcement from the Ecotrust blog post:

Over the last year we’ve distilled the best ideas from our most successful tools into a suite of software building blocks that can be mixed and matched to create cutting-edge software for decision support and spatial planning at any scale. These building blocks are already at the heart of our work and now we’re ready to share them with you.

So what is Madrona from a developer's perspective?

A set of python django apps that provide models, views and templates for representing spatial features and solving problems specific to spatial decision tools.
A RESTful API for accessing spatial features
A collection of javascript libraries (based on JQuery) to provide a web-based interface to the API.

In short, we think its a great platform for spatial tools and we want to open it up to the wider developer audience. Ecotrust already has many madrona-based apps in the wild (with many more in development) but we're hoping to get other folks using (and contributing to) the Madrona framework in the future.

I know this post is short on technical details but there will more to come ... for now, check out the technology page for an overview or the developer's page to dive in.

Migrating from Wordpress to Jekyll

2012-04-28T00:00:00-06:00

I just switched this blog from an ancient version of wordpress running on a VPS to a static-file jekyll bootstrap site (hosted by github). Let me know if you experience any wierdness on the site or feeds. I've taken good measures to make sure links don't break (old URLS should get a 301 permanent redirect to blog.perrygeo.net) but let me know if you get any 404s.

So why do it?

Having a PHP-MySQL app running on a VPS just to serve up a bunch of blog posts seemed excessive. I don't have the desire to maintain that sort of infrastructure for a simple blog!
Wordpress' editing and admin interface suck. I prefer vim and bash.
Markdown is a great language for quickly banging out blog posts.
Static files just make sense for what is basically static content.
Github pages provides the hosting for me and even handles CNAMEs for DNS.
Managing revisions with git.

The conversion process

It was not an entirely smooth transition, most of which can be traced directly to dumb decisions on my part. I won't recount the entire process (there are plenty of guides on internets) but I'll outline the major steps here:

Export the wordpress blog to an xml file. I has to use xmllint to clean it up a bit.
Set up a disqus account and import my wordpress file. Disqus will handle all the comments which are the only dynamic content on the page.
Use exitwp.py to convert the xml to jekyll markdown files. This worked OK. Not great. Tags and formatting did not come through as expected and I had to wrestle the script a bit. Tables were destroyed and some iframes (youtube links) were lost.
Forked Jekyll Bootstrap and brought in my posts.
Started tweaking of css and markdown to get formatting right. Still have a ways to go on this front - let me know if there is any content you'd like me to restore faster than others.
Had to write a little web service to redirect posts; the old blog stupidly used the default wordpress URLS like /wordpress/?p=4 which needed to go to /2010/01/01/blah
My images were all over the place; some I had in wordpress uploads, others on various servers, some were absolute links, others relative. Gathering them all in one place and using some sed-fu to get the paths right was essential.
Retagged some posts - still working on tags.
Set up Google Analytics to track usage.

I think that's about it. There are still some big formatting problems on older posts (mostly due to the fact that I used blockquotes for code). And tables are still destroyed. I'll be working on cleaning these up as I go along.

Overall impression of Jekyll-Bootstrap and hosting with Github pages? Awesome. I would highly recomend it to anyone starting a new blog or converting a smaller/better-behaved wordpress site. It is so much better than having to deal with PHP and MySQL (hopefully the last time I'll ever see them!). But the conversion was a bit tricky and took way more of my Friday and Saturday than I'd like to admit. I would not want to do that again... But I'm glad did.

What do you think of the new digs?

Working with mbtiles in python

2012-03-25T00:00:00-06:00

python-mbtiles. Check it out.

I've been working a bit with Tilemill lately and love the Carto css styling, iteractivity through UTFGrids and being able to export the whole deal as a single mbtiles sqlite database. But when it comes to working with the mbtiles databases, I've found both Tilestache and Tilestream to be fairly limiting:

Tilestache serves images but does not (yet) serve up UTFGrids _directly from mbtiles _ while Tilestream hardcodes a "grid()" JSONP callback around the returned json data making it fairly specific to Wax client libraries.

So I went down two paths, first trying to export all the tiles out of mbtiles to json and png files (for those times when you just want to serve static files), then trying to write a simple server that would do dynamic jsonp callbacks. Turns out that in the process, I was able to abstract a lot of the python< ->sqlite interaction into some generic python classes.

Thus python-mbtiles was born. It provides a simple mbtiles web server, a conversion script, and some python classes to work with. No frills, no anything really at this point. More an experiment gone right that might be useful to someone out there in GeoPython land. Enjoy and let me know if you have any ideas!

Average Aspect

2012-03-18T00:00:00-06:00

Ever try to figure out what the average aspect of an area is? i.e.

What direction does this hillside face?

Let's say we want to determine the average elevation of an area based on a raster DEM. Just take the arithmetic mean of all the elevation cells contained in the area - a simple zonal statistics problem.

Turns out that aspect is not quite as straightforward. True, we can easily use gdaldem to create an aspect map.

gdaldem aspect elevation.tif aspect.tif

This gives a raster with values in degrees: 0 is north, 90 is east, 180 is south, etc... but note that 360 is north as well. We're dealing with angular units, not linear units.

For example, take a nearly North facing hillside; the left edge is facing slightly NW (350 degrees) while the right edge faces slighty NE (10 degrees).

The arithmetic mean of the aspect values = (350+350+10+10)/4 = 180°. Due south? That's entirely wrong! It doesn't take into account the angular units. For that we need to create grids representing the sin and cos of the aspect.

Luckily you can use the handy gdal_calc.py utility that comes with recent versions of gdal. This allows you to apply numpy's trigonometric functions to a raster...

gdal_calc.py -A aspect.tif --calc "cos(radians(A))" --format "GTiff" --outfile cos_aspect.tif  
gdal_calc.py -A aspect.tif --calc "sin(radians(A))" --format "GTiff" --outfile sin_aspect.tif

Now we can look at the sum of the cos/sin grid cells for our area and take the arctangent according to this python code

import math
avg_aspect_rad = math.atan2(sum(cos_cells), sum(sin_cells))
avg_aspect_deg = math.degrees(avg_aspect_rad)
print avg_aspect_deg

In our example avg_aspect_deg comes out to an aspect of 0 degrees (due north) which is exactly what we'd expect.

Thanks to Dan Patterson for his forum post which clued me into this approach.

UTFGrids with OpenLayers and Tilestache

2012-02-24T00:00:00-07:00

A while back, the Development Seed team developed the UTFGrid spec to provide

a standard, scalable way of encoding data for hundreds or thousands of features alongside your map tiles.

The basics

In more detail, the UTFGrids are invisible "ASCII Art" and attribute data embedded in json. They sit "behind" your map tiles (they are not rendered visually) and allows quick attribute lookups without going back to the server. This allows a high degree of real-time map interactivity in an HTML web map - something that has typically been the strong point of plugin-based maps.

So take this tile image...

and it's corresponding "utfgrid" ...

          !######$$$$%%% %%%% % 
          !#######$$$$%%%    %%%
         !!#####   $$$%%%    %%%
         !######  $$$$%%% %% %%%
        !!!####  $$$$$%%%%  %%%%
      ! !###### $$$$$$%%%%%%%%%%
     ! !!#####  $$$$$$$%%%%%%%%%
    !!!!!####   $$$$$$%%%%%%%%%%
    !!!!!####   $$$$$$%%%%%%%%%%
    !!!!!####   $$$$$%%%%%%%%%%%
    !!!!!#####% $$   %%%%%%%%%%%
    !!!!!### #      %%%%%%%%%%%%
    !!! #####   ''''%%%%%%%%%%%%
     !   ###      ('%%%%%%%%%%%%
       ) ### #  ( ((%%%%%%%%%%%%
      ))  ##   (((((%%%%%%%%%%%%
      ))  #    ****(+%%%%%%%%%%%
       )        %**++++%%%%%%%%%
       , , ------*+++++%%%%%%%%%
.     ,,,,,------+++++++%%%%%%%%
..  /,,,,,,------++++++%%%%%%%%%
.  //,,,,,,------000++000%%%%%%%
  211,,,,,33------00000000%%%%%%
 2221,,,,33333---00000000000%%%%
222222,,,,3635550000000000000%%%
222222,,,,6665777008900000000%%%
22222::66666777788889900000 %%%%
22222:;;;;%%=7%8888890  0   %%%%
22222;;;; ==??%%888888  00 %%%%%
222222 ;;  =??%%%8888       %%%%
222     ;;   ?A>>@@@          B%
CCC      ;;   DEE@@@          BB

You can see how each character corresponds with a country. The character's code is used as a lookup key to retrieve the data associated with that feature (which is also included in the json tile).

If you want to dig in, check out the mapbox demo.

The Server side

I'm going to assume you have Tilestache and Mapnik 2+ already installed (if not, you should!). The steps to configuring your server for UTFGrids are fairly simple..

First, set up mapnik xml file pointing to your data source.

<?xml version="1.0"?>

<!-- An ultra simple Mapnik stylesheet -->

<!DOCTYPE Map [
<!ENTITY google_mercator "+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs +over">
]>

<Map srs="&google_mercator;">
    <Style name="style">
        <Rule>
            <PolygonSymbolizer>
                <CssParameter name="gamma">.65</CssParameter>
                <CssParameter name="fill">green</CssParameter>
                <CssParameter name="fill-opacity">0.5</CssParameter>
            </PolygonSymbolizer>
            <LineSymbolizer>
                <CssParameter name="stroke">#666</CssParameter>
                <CssParameter name="stroke-width">0.3</CssParameter>
            </LineSymbolizer>
        </Rule>
    </Style>
    <Layer name="layer" srs="&google_mercator;">
        <StyleName>style</StyleName>
        <Datasource>
            <Parameter name="type">shape</Parameter>
            <Parameter name="file">sample_data/world_merc.shp</Parameter>
        </Datasource>
    </Layer>
</Map>

Next, set up tilestache configuration file

{
"cache": {
           "name": "Disk",
           "path": "/tmp/stache"
},
"layers": {
    "world":
    {
        "provider": {"name": "mapnik", "mapfile": "style.xml"}
    },
    "world_utfgrid":
    {
        "provider":
        {
        "class": "TileStache.Goodies.Providers.MapnikGrid:Provider",
        "kwargs":
        {
            "mapfile": "style.xml", 
            "fields":["NAME", "POP2005"],
            "layer_index": 0,
            "scale": 4
        }
    }
  }
}

Finally, you're ready to run the tilestache server...

tilestache-server.py -c your.cfg -i localhost -p 7890

Now you should be serving utfgrids to http://localhost:7890/world_utfgrid/

The Client side

Now we need something to consume the UTFGrid tiles and interact with them in an HTML/JS environment. The original client implementation of UTFGrid support is provided by Wax which sits atop mapping clients like Modest Maps and Leaflet. Wax is very slick and easy to use but doesn't work so well for more complex arrangements or with OpenLayers-based maps.

Rather than clog up Wax with the complex UTFGrid use cases that we envisioned, we decided to implement a UTFGrid client in native OpenLayers. Hence my project for the OSGEO code sprint was born.

The result was a new OpenLayers Layer which loads up the json "tiles" behind the scenes...

        var grid_layer = new OpenLayers.Layer.UTFGrid( 
            'Invisible UTFGrid Layer', 
            "./utfgrid/world_utfgrid/${z}/${x}/${y}.json"
        );
        map.addLayer(grid_layer);

and an OpenLayers Control that handles how the mouse events interact with the grid. In this example, as the mouse moves over the map, a custom callback if fired off which updates a div with some attribute information.

       var callback = function(attributes) {
            if (attributes) {
                var msg  = "<strong>In 2005, " + attributes.NAME 
                    msg += " had a population of " + attributes.POP2005 + " people.</strong>";
                var element = OpenLayers.Util.getElement('attrsdiv');
                element.innerHTML = msg;
                return true;
            } else {
                this.element.innerHTML = '';
                return false; 
            }
        }

        var control = new OpenLayers.Control.UTFGrid({
            'handlerMode': 'move',
            'callback': callback
        });
        map.addControl(control);

Overall the design goal was to decouple the loading/tiling of the UTFGrids from the interactivity/control. I think this works out nicely and, while a bit more cumbersome than the method used by Wax, it is more flexible and integrates well with existing OpenLayers apps.

You can see them in action on the examples pages:

Demonstrating the use of different event handlers (click, hover, move)
Demonstrating multiple interactivity layers (the interactivity layer need not visible in the map tiles!)

And feel free to check out the code at my github fork for the code.

What do you think? Let me know...

Optimizing KML for hierarchical polygon data

2011-05-18T00:00:00-06:00

For all the benefits of KML, it is decidedly a step backwards for handling large vector datasets. Most KML clients, including the cannonical Google Earth application, experience debilitating slow-down when viewing a couple dozen MB of vector data - datasets that I could easily open on a Pentium 4 in ArcView 3.2 10 years ago!

The unfortunate reality is that optimizing the performance of KML datasets is conflated with the structure of the data and is thus the responsibility of the data publisher. The wisdom of combining styling, performance-related structure, organizational structure, geometry and attributes into a single file format may be questionable, but KML has become the defacto geographic markup language due to it's other benefits.

Anyways, back to performance enhancements on big vector datasets... The concept of "regionation" is used by several KML software to improve performance. From the Google LatLong Blog:

You can think of Regionation as a hierarchical subdivision of points or tiles, which shows less detail from afar, and more detail as you zoom in to the globe. This dynamic loading creates clearer visualizations by minimizing clutter, while simultaneously speeding up the rendering process.

In most implementations, there is a generic strategy for determining this hierarchy based on attributes or geometry size (in the case of vectors) or by a tile system. Neither is ideal when you want to preserve the vector nature of the data, split it into small, easily-loadable files and determine it's view based on the natural hierarchy that is built into the data structure.

Specifically I am thinking about watersheds here - the US Hydrologic Units. Hydrologic units are watershed boundaries that are organized in a nested hierarchy; higher levels contain smaller watersheds that are contained within a single watershed from a "parent" level. The unique identifiers (hydrologic unit codes or HUCs) are rather ingenious as well; Each level is represented by 2 digits and are concatenated to form a single identifier that can be used to determine it's "parent". For example:

Level 4 HUCs
e.g. 17090011

Level 5 HUCs
e.g. 1709001104

Level 6 HUCs
e.g. 170900110403

Instead of fabricating a hierarchy of features, why not just use this natural hierarchy to structure the KML documents?

Or as KML markup:

    <placemark>
        <name>17090009</name>
        <styleurl>#HUC_8-default</styleurl>
        <polygon><outerboundaryis><linearring><coordinates>...
        </coordinates></linearring></outerboundaryis></polygon>   
    </placemark>

    <networklink>
    <name>17090009_children</name>
    <region>
      <latlonaltbox>
        <west>-123.001645628</west>
        <south>44.8300083641</south>
        <east>-122.203351254</east>
        <north>45.298653051</north>
      </latlonaltbox>
      <lod>
        <minlodpixels>256</minlodpixels>
        <maxlodpixels>1600</maxlodpixels>
      </lod>
    </region>
    <link>
      <href>./17090009_children.kml</href>
      <viewrefreshmode>onRegion</viewrefreshmode>
    </link>
    </networklink>

The advantages to this design are that you don't have to break the geometries up to fit into a square tiling pattern, data loads and renders in a logical pattern and there will always be 100 or less (usually far less) placemarks per file due to the design of the HUC data structure. File sizes stay low, network links load quickly and request/rendering occurs only when they come into view. For this example dataset totaling 300M of shapefiles, there are several hundred resulting kmz files without any repeated features and all less than ~ 150K each. In essence, it achieves optimal performance by its very design.

Here's a video of it in action:

iframecontent

This was all done with a fairly "hackish" python script. I'll continue to refine it as needed for this particular application but, at this time, it's not intended to be a reusable tool - if you want to use it, be prepared to dig through the source code and get your hands dirty. The same concept could theoretically be applied to any spatially-hierarchical vector data (think geographic boundaries ... country > state > county > city).

Um - nice “review” of QGIS

2010-12-20T00:00:00-07:00

RJ Zimmer at American Surveyor magazine did what he described as a comparison of several free GIS application entitled "Something for Nothing"

First of all, the title bugs me. The idea that the sole benefit of free software is simply cost savings is pretty naive. It disregards openness, community support, ability to transfer knowledge, freedom from restrictive licensing, etc. But I can live with the title.

I can also live with his decision to include only a single open-source GIS application alongside 3 closed-but-gratis applications. He doesn't claim that it's a comprehensive review despite the fact that the ecosystem of Free GIS is far more diverse.

But I can't accept his treatment of Quantum GIS:

I did not fully test Quantum GIS. I did download and install it but the software was too complicated to use "right out of the box", and I did not have time to learn to use it.

The feature comparison chart includes mainly "?" in the QGIS column.

OK we get it - your deadline hit before you could bother to learn one of the applications you were supposedly reviewing. One even wonders why he included QGIS the review at all. This is nothing short of irresponsible reporting. When people post stuff like this, it really rubs me the wrong way - now a whole audience of users have a inaccurate view of QGIS and entire free GIS ecosystem thanks to his slacker journalism.

kmltree

2010-06-09T00:00:00-06:00

When the MarineMap team started delving into the Google Earth plugin, it was apparent that it supported the display and rendering of KML files almost as well as the Google Earth desktop application. The missing piece of functionality was the nice tree-style legend that is provided with the desktop app. The plugin lets you add KML for display but gives you no HTML interface to work with it. For simple apps, you can just roll your own html/js form. But that quickly becomes unmanageable if you're adding KML dynamically and need to create a tree-style legend for any arbitrary KML document.

Enter kmltree.

kmltree is a javascript tree widget that can be used in conjunction with the Google Earth API. It replicates the functionality of the Google Earth desktop client, and is fast, extensible, and stable for use in advanced web applications. It's built utilizing the earth-api-utility-library and jQuery.

Any arbitrary KML can be parsed and represented in a tree-style legend right in the web browser. Try it out.

Kmltree is the brainchild of Chad Burt who developed it as part of the marinemap codebase but had the foresight to realize that this would be useful to a much wider audience and abstracted it into its own javascript library. If you're building a web mapping application with the Google Earth API, give it a shot!

MarineMap wins award for Environmental Conflict Resolution

2010-05-27T00:00:00-06:00

For the last year or so, I've had the pleasure of working with the MarineMap Consortium. We just learned yesterday that the U.S. Institute for Environmental Conflict Resolution awarded MarineMap the “Innovation in Technology and Environmental Conflict Resolution”.

I joined the team after the launch of the South Coast of California site which was already widely recognized as a successful decision-support tool for marine spatial planning. We've since been working on version 2 of the MarineMap tool which is deployed currently for the North Coast of California in support of their Marine Life Protection Act (MLPA) process.

It's been a tremendous challenge to bring a new version of the software to life and have it meet and exceed the standards set by its predecessor. It has also been tremendously rewarding and having our work recognized at this level is a great honor. It's nice to know that the tools we've developed have been so helpful and instrumental in the marine planning process along the coast of California. Looking forward, I see MarineMap growing beyond a tool for a specific purpose (supporting the MLPA Initiative) to a robust framework for developing web-based spatial planning tools for all sorts of environmental applications, both marine and terrestrial. And this award confirms that we are already heading in the right direction. Very exciting news!

Exploring Geometry

2010-05-06T00:00:00-06:00

I don't know how I let this gem slip past my radar for so long. It was only via a post by Dr. JTS himself (aka Martin Davis) that I saw a screenshot of JTS TestBuilder and decided to check it out.

I was actually just talking with someone about a tool that could provide simple visualization of WKT geometries; JTS Test Builder does that and much more.

You can input geometries (graphically or by well-known text) and compare two geometries based on spatial predicates:

Do overlay analyses with the two geometries. Note that you can see the result as WKT below.

And there are a host of other spatial operations to generate geometries using buffers...

... convex hulls ...

This app provides a very nice and user-friendly way to quickly and simply explore and test geometric operations. To try it out, download JTS and unzip the contents somewhere. If you're on windows, the .bat file is provided. If you're running anything else, you have to cook up a shell script that will set up the environment and run JTS TestBuilder:

JTS_HOME=/usr/share/java/jts-1.11
CP=$CLASSPATH
for i in $JTS_HOME/lib/*.jar; do CP=$i:$CP; done
java -Xmx256m -cp $CP com.vividsolutions.jtstest.testbuilder.JTSTestBuilder $*

Distributed

2010-03-31T00:00:00-06:00

I've been playing around with some distributed version control systems (DVCS) to replace svn.

First, the why: I'll leave the details up to Joel in his excellent HgInit tutorial. Its mercurial-specific but the general concepts apply to any DVCS. The takeaway message for any project with > 1 developer is this:

Mercurial [ed: DVCS] separates the act of committing new code from the act of inflicting it on everybody else.

Next, the implementation: I'm using git to work on another project (Golden Cheetah) and its been a tough learning curve. Git is no doubt the most powerful DVCS out there. You can do magical things with it like combine commits and mess with history trees. And you can also screw things up pretty badly if you misinterpret the esotric docs for some non-intuitive piece of the workflow.

I just tried mercurial this morning - hg seems to fit my mind well. There is less power but the workflow is very clear and intuitive. And there are docs written for people who don't want to do an in-depth study of their version control software. It stays out of the way.

Long story short, I'm going to use mercurial/hg for my new projects. Ah what the heck my old/ongoing projects as well. My googlecode repository has been converted over to Mercurial. Svn will stick around but wont be updated.

Lazy raster processing with GDAL VRTs

2010-02-18T00:00:00-07:00

No, not lazy as in REST :-) ... Lazy as in "Lazy evaluation":

In computer programming, lazy evaluation is the technique of delaying a computation until the result is required.

Take an example raster processing workflow to go from a bunch of tiled, latlong, GeoTiff digital elevation models to a single shaded relief GeoTiff in projected space:

Merge the tiles together
Reproject the merged DEM (using bilinear or cubic interpolation)
Generate the hillshade from the merged DEM

Simple enough to do with GDAL tools on the command line. Here's the typical, process-as-you-go implementation:

gdal_merge.py -of GTiff -o srtm_merged.tif srtm_12_*.tif 
gdalwarp -t_srs epsg:3310 -r bilinear -of GTiff srtm_merged.tif srtm_merged_3310.tif 
gdaldem hillshade srtm_merged_3310.tif srtm_merged_3310_shade.tif -of GTiff

Alternately, we can simulate lazy evaluation by using GDAL Virtual Rasters (VRT) to perform the intermediate steps, only outputting the GeoTiff as the final step.

gdalbuildvrt srtm_merged.vrt srtm_12_0*.tif
gdalwarp -t_srs epsg:3310 -r bilinear -of VRT srtm_merged.vrt srtm_merged_3310.vrt 
gdaldem hillshade srtm_merged_3310.vrt srtm_merged_3310_shade2.tif -of GTiff

So what's the advantage to doing it the VRT way? They both produce exactly the same output raster. Lets compare:

	Process-As-You-Go	"Lazy" VRTs
Merge (#1) time	3.1 sec	0.05 sec
Warp (#2) time	7.3 sec	0.10 sec
Hillshade (#3) time	10.5 sec	19.75 sec
Total processing time	20.9 sec	19.9 sec
Intermediate files	2 tifs	2 vrts
Intermediate file size	261 MB	0.005 MB

The Lazy VRT method delays all the computationally-intensive processing until it is actually required. The intermediate files, instead of containing the raw raster output of the actual computation, are XML files which contain the instructions to get the desired output. This allows GDAL to do all the processing in one step (the final step #3). The total processing time is not significantly different between the two methods but in terms of the productivity of the GIS analyst, the VRT method is superior. Imagine working with datasets 1000x this size with many more steps - having to type the command, wait 2 hours, type the next, etc. would be a waste of human resources versus assembling the instructions into vrts then hitting the final processing step when you leave the office for a long weekend.

Additionaly, the VRT method produces only small intermediate xml files instead of having a potentially huge data management nightmare of shuffling around GB (or TB) of intermediate outputs! Plus those xml files serve as an excellent piece of metadata which describe the exact processing steps which you can refer to later or adapt to different datasets.

So next time you have a multi-step raster workflow, use the GDAL VRTs to your full advantage - you'll save yourself time and disk space by being lazy.

Peaksware licensing revisted …

2009-12-16T00:00:00-07:00

I had previously bitched and moaned about the licensing restrictions on the TrainingPeaks WKO+ software. Truth be told, the reason I was so put off by their crappy licensing scheme was that my cycling training relied so heavily on their software. It was not perfect but it was the best tool available. I've since discovered Golden Cheetah which is a viable open-source alternative but it still lags behind WKO+ in many critical features.

Now, fresh in time for the 2010 training season, Peaksware has released a new version 3.0 of WKO+ which, amongst many UI and functionality improvements, has made considerable progress on the licensing front.

We know, our licensing has been a challenge to deal with for our customers in the past, but we’ve always tried to be as helpful as possible getting you back up and running after a hard drive crash or new computer. To remedy this, we’re pleased to announce an all new flexible licensing system. First, with every purchase we now allow you to install WKO+ 3.0 on up to two computers; second, we’ve built an online activation/deactivation system so you are free to move your active licenses from machine to machine. Are you leaving on a 2 week trip? Just de-activate your home computer, activate your laptop, and you’re on your way. When you get home, de-actiavate your laptop, re-activate your desktop and you’re all set.

It ain't open source (there is still a place in this world for proprietary software if they can push the boundaries and innovate) but the sensitivity to the licensing issue just may have restored my faith in their company.

Nice examples of ESRIs geoprocessing python module (9.3)

2009-08-10T00:00:00-06:00

Just thought I'd point out a great presentation about the "new" 9.3 geoprocessing (gp) python module from ESRI.

Ghislain Prince and Elizabeth Flanary do a great job of introduction by examples. The latest gp module is much more pythonic and these examples show how to leverage that to its full advantage. If you try to do this with older gp versions, the code would make most pythonistas cringe. This latest version returns objects and lists, use real booleans, and uses true objects instead of funky string parameters. Basic OO stuff for most python libraries but a big improvement for gp.

Here's the powerpoint presentation. Thanks to Jamey Rosen for the tip!

Peaksware licensing hell

2009-06-23T00:00:00-06:00

I've been using Peaksware's WKO+, a cycling and running training tool to manage data from heart rate monitors, GPS units, power meters, etc. Its a powerful tool with a clunky UI but I've gotten used to it.

You pay $100 for a "personal" license. Not a big deal to me since they basically have a monopoly on this software niche. I first installed it on my work computer to test the data from my daily bike commute. Cool it works. Then I went to install it at home since that's where I'll be using it. Works ok. I proceed to gather all my fitness data into their proprietary binary format.

Fast forward a few months. I'm reformatting the hard drive on the laptop and want to move all my data and software to my desktop. But installing WKO+ is giving me a headache ("Error: Too many installations"). The registration process takes a hardware fingerprint and your must active it via the web to get a registration code. However, hidden withing their EULA, is a term which dissallows the transfer of license to another computer other than the one to which it was originally installed. The second installation was just an allowance they make to allow for "hard drive crashes" and such.

Since neither of those machines would be available to me, certainly there would be a way to transfer it? After several progressively more desperate communications with Matt Allen at peaksware support, he informed me that there was no way they would transfer the license (the non-transfer clause IS in the EULA after all). I would need to purchase another license simply because I switched computers!

Here is my response:

Basically what you are telling me is that I can no longer use WKO+ without paying again. I get to use the software for a few months and you revoke my right to use it because I buy a new computer! I am a paying customer, trying to be totally legit here, willing to support your business in exchange for a license to use your software and you insist on screwing me over. Brilliant.

This is one of the most unprofessional and idiotic stances I have ever seen from a software company. Your intention appears to be to screw over your paying customers and milk as much cash from them as possible - you might want to rethink that business model unless you want to loose customers! I will never endorse, recommend or purchase another product or service from peaksware nor will any of my family, friends, teammates or readers once the word gets out about your disrespectful policies.

There are numerous typical situations where a new copy of the software would need to be installed including:

Hard drive failure
Operating system upgrades
New computer purchases
Extended traveling and touring (installing onto a laptop or netbook)

Now I fully understand why your policy is one license per computer. It makes perfect sense. I have seen plenty of other software with a similar licensing model. But they also allow to uninstall the software and re-register it on another computer due to these circumstances. There is simply no technological reason why you could not implement a licensing structure that allowed the user more freedom to transfer licenses while still preventing piracy. As it stands, your licensing model treats paying customers like criminals if they happen to run across any one of the above situations.

So, to sum it up - your foolish license policy has lost you one customer and many future ones.

Good riddance.

So if you want to support a company that treats its paying customers like criminals because they get a new computer, go right ahead and support Peaksware. But anyone who expects to use software that they pay for even if they happen to buy a new computer should steer clear.

The real kicker is that all that work is locked away in their proprietary file format simply because of their draconian licensing. This is the real take home lesson to all software users (not just fitness geeks): If you lock your data away in a proprietary format and are beholden to a single company in order to access it, they can and will screw you. Always insist on open data formats, even if using proprietary software. Oh and always read the EULA carefully before clicking OK!

Reading XFS partition from Windows

2009-06-21T00:00:00-06:00

When I was setting up my linux system a few years ago, I did some research into filesystems and determined that the XFS file system, being particularly proficient in dealing with large files, would be ideal for my home directory. And it was. But the one factor I didn't consider was portability. Turns out that there is basically no support for XFS in windows.

So how do you access your files from Windows if they are on an XFS partition? I had just shy of 1 TB of data to transfer so using my other linux box and transferring across the network would have taken forever. The solution I came up with is a bit convoluted but it has some real advantages:

1) Install Sun's VirtualBox. 2) Download an iso for your favorite linux distribution (mine being Ubuntu 9.04) 3) Create a virtual machine from the linux iso 4) Install the VBOxGuestAdditions in the linux virtual machine. 5) Create a Share folder on the windows host and register it with the virtual machine. This will allow you to transfer files from the guest (linux) to the host(windows) You may have to manually mount the drive in the linux guest:

mount -t vboxsf share_name /mnt/share_name

6) Using the windows host cmd line, create a vmdk from the physical drive that your XFS partition resides on. In this case, PhysicalDrive1 corresponds to the second SATA connector. This will allow your guest OS to talk directly with the drive:

cd C:\Program Files\Sun\xVM VirtualBox
VBoxManage.exe internalcommands createrawvmdk 
  -filename "C:\Documents and Settings\perry\.VirtualBox\HardDisks\Physical1.vmdk" 
  -rawdisk \\.\PhysicalDrive1 -register

Once completed, you should see:

RAW host disk access VMDK file 
C:\Documents and Settings\perry\.VirtualBox\HardDisks\Physical1.vmdk created successfully.

7) Make sure to add the physical drive to your list of hard drives in the linux guest options. Restart the linux guest virtual machine and your XFS partition should already be mounted. Now you can begin transfering files between your XFS partition and the shared folder on the windows host.

Whew. Lots of hassle for a simple file transfer, right! But the side benefit is that now you have a fully functional linux virtual machine with a shared folder set up to the windows host. Very useful - even when you must run windows, it helps to have a linux VM standing by!

IronPython (2.6) and ArcGIS - ready for prime time!!

2009-06-16T00:00:00-06:00

Not sure why this didn't occur to me before I wrote that last post but I tried the "pythonic" version of the code under the IronPython 2.6 Beta 1 release and it works!

lyr = Carto.LayerFileClass()
lyr.Open('C:\\test.lyr')
print lyr.Filename

Works perfectly now. So IronPython 2.6 promises to be a viable option for extending ArcGIS. My enthusiasm has been renewed.

IronPython and ArcGIS - not quite ready for prime time

2009-06-16T00:00:00-06:00

Occasionally I find myself in the C#/.NET world in order to write code using ESRI ArcObjects. Today I was toying with the idea of automating the creation of ESRI Layer files (a file which defines the cartographic styling of a dataset). Of course they are in an undocumented binary file format, inaccessible to anything but ESRI software. So I pop open Visual Studio ....

I feel a nagging unease every time I type a set of curly braces. And VB just makes me insane. I prefer, of course, to use python. Luckily there is IronPython which runs on .NET - which means I could theoretically use it to interact with ArcGIS.

I only found a single working example of using ArcObjects through IronPython. But it looked promising enough to close Visual Studio and give it a go.

The first nagging problem is an IronPython-specific one. Relatively minor annoyance but you have to add the reference to a .NET assembly (library) before you can load it.

import clr
clr.AddReference('ESRI.ArcGIS.System')
clr.AddReference('ESRI.ArcGIS.Carto')
from ESRI.ArcGIS import esriSystem
from ESRI.ArcGIS import Carto

Now there is the issue of grabbing an ESRI license. A little verbose IMO but it could easily be encapsulated in a helper function to clean things up.

aoc = esriSystem.AoInitializeClass()
res = esriSystem.IAoInitialize.IsProductCodeAvailable(aoc, 
         esriSystem.esriLicenseProductCode.esriLicenseProductCodeArcView)
if res == esriSystem.esriLicenseStatus.esriLicenseAvailable:
    esriSystem.IAoInitialize.Initialize(aoc, 
      esriSystem.esriLicenseProductCode.esriLicenseProductCodeArcView)

Now that we've satisfied the demands of our proprietary license overlords, we can proceed with the real work .. in this case I just want to open an existing Layer file and see if the resulting object knows it's own file path. Really simple, right?

lyr = Carto.LayerFileClass()
if "Open" in dir(lyr): print "The Layer object has an Open method but...."
lyr.Open('C:\\test.lyr')
print lyr.Filename



The Layer object has an Open method but....
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
AttributeError: 'GenericComObject' object has no attribute 'Open'</module></stdin>

Hrm. Looks like we've run across bug 1506 which doesn't allow access to the properties and methods of a given instance - instead your have to work through the functions provided by the implementation. Grr...

Carto.ILayerFile.Open(lyr, 'C:\\test.lyr')
print Carto.ILayerFile.Filename.GetValue(lyr)

That is unwieldy, ugly and unpythonic. What's the point of object oriented programming if you can't access the methods and properties of an object directly? Since all ArcObjects applications are based on extending COM interfaces, this would be a major pain in any non-trivial application. Basically, until these .NET-accessible COM objects can be treated in a pythonic way, I don't see any compelling reason to pursue IronPython and ArcGIS integration. Looks like its back to C# for the moment ... (/me take a deep sigh and opens Visual Studio) ... unless of course anyone has some brilliant solution to share!!

The GPS told me to do it

2009-06-12T00:00:00-06:00

Another disastrous consequence of inaccurate spatial information... Not only can you accidentally tag your neighbor as a criminal, now it appears that sloppy spatial data has lead to the wrong house getting demolished.

I've asked it before but its worth repeating ... with all the recent advances in spatial data publishing, where are the advances in metadata and data quality assurance? How do you know where the data comes from, what's been done to it and by whom? What is the intended use of the data? For the vast majority of the data being shoved out onto the web, these bits of metadata are sorely lacking.

Of course this case is more a matter of one person's sheer stupidity; I'm not sure any caveats in the metadata would have stopped the wrecking ball!

The magic bullet

2009-03-25T00:00:00-06:00

Dealing with corrupted shapefiles can be a painful experience: programs crash for seemingly no reason, attribute tables get screwy, features get lost, queries results don't look right and ArcGIS processing tools fail with mysterious error codes:

Never fear, OGR is here. The magic bullet for fixing corrupted shapefiles is, 90% of the time, accomplished by using ogr2ogr to convert the shapefile to another shapefile.

ogr2ogr -f "ESRI Shapefile"  shiny_new_clean_dataset.shp corrupted_dataset.shp corrupted_dataset

OGR's internal data model cleans it up and the output is a fresh shiny new shapefile that works without hassle.

TV cycling coverage is dead

2009-02-19T00:00:00-07:00

Real-time spatial application developers take note...

I've been following the Tour of California this week (looking forward to the Solvang Time Trial this Friday) and have been disappointed with the TV coverage on Versus. Its not that the coverage is bad, its just that long-distance endurance sports don't lend themselves to the traditional 2 announcers and 1 camera format. There are multiple groups of riders and so much spatial information to keep track of if one really wants to understand the dynamics of a cycling event.

Maybe I've just been spoiled by the Amgen Tour Tracker. It is a crowning example of a spatially-aware real-time web application.

It provides two cameras of live coverage, live commentary with interviews, chat, summary updates, gps tracking of riders shown on both an elevation profile and a yahoo-based aerial map, "gps+" location prediction, race standings, time checks, etc. Far more information than any TV coverage without resorting to information overload.

Stimulus watch

2009-02-12T00:00:00-07:00

Last time I posted on this blog, Hillary and Obama were still battling it out for the Democratic nomination. Now Barack Obama is our president with an uphill battle to save the economy. So yeah, it's been a while. I haven't been doing too much innovative Geo-related stuff lately, hence the lack of blog posts. I'll try to pick up the pace a bit, even if I have to resort to fluff pieces like this one...

Well, it looks like the economic stimulus bill is going to pass. The bill doesn’t actually specify the projects that will be funded; the money will be allocated to cities and some federal grant agencies. The mayors have already proposed thousands of “shovel-ready” projects that might get a green light depending on how much funding the city gets.

There’s a great site, stimuluswatch.org, that allows the public to review these proposals. Good to know where our tax dollars are headed!

There are several GIS proposals ranging from projects with specific, well-defined (and measurable) objectives to the nebulous "Give us $500,000 to upgrade our cities' GIS program". It will be interesting to see which ones pan out, which ones produce results and which ones are just a pure waste of taxpayer dollars.

P.S. If you'd like to see where most of my time and energy is going these days, it's training for the US National Cup mountain bike race series. My cycling exploits are available for all who are inclined to read them.

R is for Radiohead

2008-07-15T00:00:00-06:00

Radiohead realeased their video for House of Cards yesterday. Besides being a big radiohead fan, I was also loving the LIDAR technology behind the video.

If you want to check it out yourself, there are code samples on the site as well as access to the raw data. The csv files have four columns (x, y, z, and intensity). For me the quickest way to visualize the data was through R and it's OpenGL interface called rgl (which is a wonderful high-level 3D data visualization environment).

Assuming you have R installed, rgl is a simple add on through the CRAN repositories:

install.packages("rgl")

Then you need to load the library, load the csv, scale the intensity values from 0 to 1. Then it's a simple rgl.points command to get an interactive 3D rendering:

library(rgl)
d < - read.csv("C:/temp/radiohead/22.csv", header=FALSE)

# scale intensity values from 0 to 1
d$int <- d[,4] / 255

# rgl.points(x,y,z,size=__,color=__)
# note y value is inverted
# color is a grayscale rgb based on intensity
rgl.points(d[,1],d[,2]*-1,d[,3], size=3, color=rgb(d$int,d$int,d$int))

That's all it takes to render Thom Yorke in all his 3D digital glory:

Geospatial Reddit - 2 weeks later

2008-06-12T00:00:00-06:00

So, despite frustrations with getting submitted URLs to appear, Geospatial Reddit is still puttering along. Not exactly a vibrant community yet but there are currently 133 subscribers. If you're subscribed, take a minute to submit your favorite URLs. If you haven't subscribed, check it out.

I thought 133 subscribers was a decent number until I found that the Bacon subreddit has over 500. Apparently the world would rather discuss their greasy breakfast food than maps.

Jabref - Open Source Alternative to EndNote

2008-06-08T00:00:00-06:00

For those of you that use EndNote to keep track of your bibliographies/references , there is an alternative. JabRef. I find the UI to be very intuitive and it has a range of customizable import/export formats. JabRef uses the BibTex format as it's native file format so, of course, it integrates very well with LaTeX.

One of the neat features is the ability to create custom bibliographies in HTML, complete with javascript-based search capabilities. Here's my reference list which I'll be slowly adding to as I convert all my old text-based and EndNote reference lists over.

Geospatial Reddit - A democratic solution to geo blog overload?

2008-05-28T00:00:00-06:00

All the great GIS news/blog aggregators out there (planetgs, slashgeo, etc) are moderator driven - a few people act as the gatekeepers and inevitably have to decide what information is useful. This is not the ideal way to do things.

There's a more democratic and distributed way to spread the role - it's called reddit. More specifically, Geospatial Reddit. For those unfamiliar with reddit (or similar sites like digg), the idea is simple: users submit stories and users vote on stories. The most popular ones rise to the top and, theoretically, the best articles magically appear on the front page. Much like democracy itself, there are flaws in the theory but its the best thing we've got.

Geospatial Reddit is public so sign up, submit your favorite stories and vote. Lets see if we can make this work.

Posting to Geospatial Reddit

2008-05-28T00:00:00-06:00

Some folks have had trouble submitting links so I figured I should post a bit more detail on that. To get articles to show up on the geospatial reddit (not the main reddit), go to http://reddit.com/r/geospatial/submit or click the "Submit a Link" button on the right - from the geospatial page. When you're submitting the url, you should see "submit to geospatial" as the page header.

I know at least 2 of us have been successful at posting. If this doesn't work for you, please let me know and I'll try and figure it out.

So you want to learn to learn about kriging …

2008-05-25T00:00:00-06:00

Guides like Tomislav Hengl's Practical Guide to Geostatistical Mapping of Environmental Variables and Rossiter's Introduction to applied geostatistics do an excellent job of providing a grounded, relatively easy to understand, introduction to geostatical prediction and kriging.

But if you're an experience learner (like me) you don't absorb the mathematics fully without doing something with the knowledge; Seeing it in action brings the concepts to life. Unfortunately most geostats/kriging software is either too complex for exploratory learning (not enough immediate feedback) or too simplistic (making too many assumptions, disallowing access to the nitty-gritty details). Either way, you're bound to produce output with fundamental flaws because you're not aware of the finer details of variogram modelling. I speak from exerience!

Luckily Dennis J. J. Walvoort of the Wageningen University & Research Center saw the same problem and created an nifty learning to to explore varigoram models and spatial predictions using ordinary kriging - EZ-Kriging. No degree in math or statistical theory required. Just drag the points around, play with the parameters and alter the underlying data as a table and see the results immediately.

Its nothing more than a simulation so don't expect to load your own datasets or produce any meaningful output with it. But it truly excels as a learning tool to understand the core concepts behind kriging and is a great complement to Hengl and Rossiter's work. With that knowledge you can do the real deal in Surfer, R, ILWIS or your geostats software of choice.

EDIT: One complaint about this EZ-Kriging that I have: it doesn't show the observed sample variogram cloud overlayed on the variogram model. Oh well still a nice tool.

EDIT2: It's a windows .exe but it runs smoothly under wine in linux.

Ubuntu as a GIS workstation (updated for Hardy Heron)

2008-05-14T00:00:00-06:00

As a followup to my previous post on turning Ubuntu Gutsy into a GIS workstation, Here are the revised instructions for Ubuntu 8.04 (The Hardy Heron).

Note that there are a few additonal apps and changes in here:

Postgis
Mapnik
New version of QGIS installed via repository
OpenStreetMap tools (JOSM and osm2pgsql)
Geotiff utilities
Some nice python spatial libs (shapely, owslib, geopy and pyproj)

Run the following as root on your new Hardy installation, answer a few configuration questions and you'll be ready to go.

echo 'deb http://ppa.launchpad.net/qgis/ubuntu hardy main' >> /etc/apt/sources.list

apt-get update

apt-get -y --force-yes install grass mapserver-bin \
gdal-bin cgi-mapserver python-qt4 python-sip4 python-gdal \
python-mapscript gmt gmt-coastline-data r-recommended gpsbabel \
shapelib qgis qgis-plugin-grass python-setuptools \
python-mapnik mapnik-plugins mapnik-utils osm2pgsql josm postgresql-8.3-postgis \
python-dev build-essential libgdal-dev geotiff-bin sun-java6-jre

easy_install shapely geopy owslib pyproj

EDIT: If you're looking for more up to date packages for geos, gdal, etc, try adding deb http://les-ejk.cz/ubuntu/ hardy multiverse to your /etc/apt/sources.list

'Hike of Doom #2- OGC KML'

2008-04-21T00:00:00-06:00

In commemoration of the OGC approval of KML as an open standard to share geographic content over the web, I'd like to share our recent "Hike of Doom #2" (kml provided by Mark Dotson).

The first weekend to hit 90 degrees, my friends and I travel inland to dive and swim in the Santa Ynez river. It is billed as a "30 minute" hike to our favorite watering hole. It becomes much more than that.

Of course the road leading up to the trailhead is closed due to construction so we have 3 miles of hiking on pavement just to get to the former trailhead- the Red Rocks parking lot.

Then the fun begins. A decent rainy season and some dam releases make for high flows and we've got half-dozen major river crossings to contend with. The recent fires added a good deal of organic matter to the river and the algae has bloomed accordingly. It is a wet, hot, rocky and slimy hike.

We make it to the swimming hole and enjoy the day. We dive, laugh, have a few beers.

The sun sets and the fun _ really _ begins.

Klaus, the Bavarian cyclist whom we'd met at the swimming hole, met up with us just after my girlfriend, Joselyne, sprained her ankle on a rock. Her ankle hadn't started to swell yet but I could tell, drawing from my basketball injuries from the past, that she was not putting weight on it any time soon. We fashioned crutches from some driftwood. We met up with some turkey hunters (dressed in more camouflage more effective than most military uniforms) who helped us out by providing us some ankle wrap.

David and Andy began the trek back to the car to get help. The rest of us could either go back via the river bed , a rocky and treacherous endeavor given the setting sun, or head up to the main road and get some help. We decided on the main road and Shaun took off to alert the others to our plans. The main fire road was a trek in the _ opposite_ direction - longer, more elevation changes but smooth enough for a bike or truck and more accessible to vehicles.

I carried Jos, over my shoulder fireman-style and/or piggy-back, over the river crossings. On the flats, Mark and I pushed Jos on Klaus' bike.

We pushed on up the trail until we reached the main road. Klaus, after drinking the last of our beer, biked up to the dam keeper's residence at Gibraltar Dam while Christina, Sarah, Mark, Jos and I continued up the trail. A half-hour later, Klaus and the dam keeper arrived in a pickup and drove the rest of us back to the Red Rock "parking lot".

But the construction and rebar on the causeway meant there was no way to cross with a normal vehicle so we went by foot. Jos got back on Klaus' bike and we pushed.

Luckily the slight downhill grade allowed her to glide back for a good portion, graciously sparing Mark and I from permanent back injury.

Meanwhile the away team had gotten some semblance of cellular reception and attempted to call the authorities. The goal was to get a ranger truck to drive out to get us or at least unlock the gate to meet us half way at the Red Rock parking lot. The authorities response was fantastic if not a bit overzealous. By the time we had gotten within a 1/4 mile of our car, we spotted helicopters. Then a firetruck. Then an ambulance. Joselyne was coasting by on Klaus' bike and they didn't even stop for her on the first pass! Apparently expecting to rescue a mangled body from the wilderness, the EMTs were somewhat disappointed at the less challenging situation they faced - a girl, coasting down the road on a bike with a sprained ankle.

We were back in the car, on the road before dark and got home in time for pizza.

So what did we learn from this? Well as a Boy Scout, I am ashamed to say I wasn't prepared. A well prepped emergency kit would have helped a lot. At least we had an LED headlamp. Some rope would have gone a long way towards making a stretcher. An instant-ice-pack, ankle wrap and some ibuprofen would have been handy. We were wet and the mercury was falling quickly; some emergency shelter and clothing would have assuaged my concerns about the nighttime chill.

But this was offset by the generosity of the many people we met for the first time - The hunters who lent us their medical supplies, the dam keeper who got up from his Sunday dinner to make sure we got back safely, the EMTs who put tremendous resources into organizing a military-scale search party, Klauss who so generously stuck with us and shared with us his bike, his wisdom and his company. Without their help and our group of friends, the story might have a less happy ending.

Never underestimate the power of human kindness, generosity and cooperation! And never believe me when I say it's a short hike.

A quick Cython introduction

2008-04-19T00:00:00-06:00

I love python for its beautiful code and practicality. But it's not going to win a pure speed race with most languages. Most people think of speed and ease-of-use as polar opposites - probably because they remember the pain of writing C. Cython tries to eliminate that duality and lets you have python syntax with C data types and functions - the best of both worlds. Keeping in mind that I'm by no means an expert at this, here are my notes based on my first real experiment with Cython:

EDIT: Based on some feedback I've received there seems to be some confusion - Cython is for generating C extensions to Python not standalone programs. The whole point is to speed up an existing python app one function at a time. No rewriting the whole application in C or Lisp. No writing C extensions by hand. Just an easy way to get C speed and C data types into your slow python functions.

So lets say we want to make this function faster. It is the "great circle" calculation, a quick spherical trig problem to calculate distance along the earth's surface between two points:

p1.py

import math

def great_circle(lon1,lat1,lon2,lat2):
    radius = 3956 #miles
    x = math.pi/180.0

    a = (90.0-lat1)*(x)
    b = (90.0-lat2)*(x)
    theta = (lon2-lon1)*(x)
    c = math.acos((math.cos(a)*math.cos(b)) +
                  (math.sin(a)*math.sin(b)*math.cos(theta)))
    return radius*c

Lets try it out and time it over 1/2 million function calls:

import timeit

lon1, lat1, lon2, lat2 = -72.345, 34.323, -61.823, 54.826
num = 500000

t = timeit.Timer("p1.great_circle(%f,%f,%f,%f)" % (lon1,lat1,lon2,lat2), 
                       "import p1")
print "Pure python function", t.timeit(num), "sec"

About 2.2 seconds. Too slow!

Lets try a quick rewrite in Cython and see if that makes a difference: c1.pyx

import math

def great_circle(float lon1,float lat1,float lon2,float lat2):
    cdef float radius = 3956.0 
    cdef float pi = 3.14159265
    cdef float x = pi/180.0
    cdef float a,b,theta,c

    a = (90.0-lat1)*(x)
    b = (90.0-lat2)*(x)
    theta = (lon2-lon1)*(x)
    c = math.acos((math.cos(a)*math.cos(b)) + (math.sin(a)*math.sin(b)*math.cos(theta)))
    return radius*c

Notice that we still import math - cython lets you mix and match python and C data types to some extent. The conversion is handled automatically though not without cost. In this example all we've done is define a python function, declare its input parameters to be floats, and declare a static C float data type for all the variables. It still uses the python math module to do the calcs.

Now we need to convert this to C code and compile the python extension. The best way to do this is through a setup.py distutils script. But we'll do it the manual way for now to see what's happening:

# this will create a c1.c file - the C source code to build a python extension
cython c1.pyx

# Compile the object file   
gcc -c -fPIC -I/usr/include/python2.5/ c1.c

# Link it into a shared library
gcc -shared c1.o -o c1.so

Now you should have a c1.so (or .dll) file which can be imported in python. Lets give it a run:

    t = timeit.Timer("c1.great_circle(%f,%f,%f,%f)" % (lon1,lat1,lon2,lat2), 
                     "import c1")
    print "Cython function (still using python math)", t.timeit(num), "sec"

About 1.8 seconds. Not the kind of speedup we were hoping for but its a start. The bottleneck must be in the usage of the python math module. Lets use the C standard library trig functions instead:

c2.pyx

cdef extern from "math.h":
    float cosf(float theta)
    float sinf(float theta)
    float acosf(float theta)

def great_circle(float lon1,float lat1,float lon2,float lat2):
    cdef float radius = 3956.0 
    cdef float pi = 3.14159265
    cdef float x = pi/180.0
    cdef float a,b,theta,c

    a = (90.0-lat1)*(x)
    b = (90.0-lat2)*(x)
    theta = (lon2-lon1)*(x)
    c = acosf((cosf(a)*cosf(b)) + (sinf(a)*sinf(b)*cosf(theta)))
    return radius*c

Instead of importing the math module, we use cdef extern which uses the C function declarations from the specified include header (in this case math.h from the C standard library). We've replaced the calls to some of the expensive python functions and are ready to build the new shared library and re-test:

    t = timeit.Timer("c2.great_circle(%f,%f,%f,%f)" % (lon1,lat1,lon2,lat2), 
                     "import c2")
    print "Cython function (using trig function from math.h)", t.timeit(num), "sec"

Now that's a bit more like it. 0.4 seconds - a 5x speed increase over the pure python function. What else can we do to speed things up? Well c2.great_circle() is still a python function which means that calling it incurs the overhead of the python API, constructing the argument tuple, etc. If we could write it as a pure C function, we might be able to speed things up a bit.

c3.pyx

cdef extern from "math.h":
    float cosf(float theta)
    float sinf(float theta)
    float acosf(float theta)

cdef float _great_circle(float lon1,float lat1,float lon2,float lat2):
    cdef float radius = 3956.0 
    cdef float pi = 3.14159265
    cdef float x = pi/180.0
    cdef float a,b,theta,c

    a = (90.0-lat1)*(x)
    b = (90.0-lat2)*(x)
    theta = (lon2-lon1)*(x)
    c = acosf((cosf(a)*cosf(b)) + (sinf(a)*sinf(b)*cosf(theta)))
    return radius*c

def great_circle(float lon1,float lat1,float lon2,float lat2,int num):
    cdef int i
    cdef float x
    for i from 0 < = i < num:
        x = _great_circle(lon1,lat1,lon2,lat2)
    return x

Notice that we still have a python function wrapper (def) which takes an extra argument, num. The looping is done inside this function with for i from 0 < = i < num: instead of the more pythonic but slower for i in range(num):. The actual work is done in a C function (cdef) which returns float type. This runs in 0.2 seconds - a 10x speed boost over the original python function.

Just to confirm that we're doing things optimally, lets write a little app in pure C and time it:

#include <math .h>
#include <stdio .h>
#define NUM 500000

float great_circle(float lon1, float lat1, float lon2, float lat2){
    float radius = 3956.0;
    float pi = 3.14159265;
    float x = pi/180.0;
    float a,b,theta,c;

    a = (90.0-lat1)*(x);
    b = (90.0-lat2)*(x);
    theta = (lon2-lon1)*(x);
    c = acos((cos(a)*cos(b)) + (sin(a)*sin(b)*cos(theta)));
    return radius*c;
}

int main() {
    int i;
    float x;
    for (i=0; i < = NUM; i++) 
        x = great_circle(-72.345, 34.323, -61.823, 54.826);
    printf("%f", x);
}

Now compile it with gcc -lm -o ctest ctest.c and test it with time ./ctest... about 0.2 seconds as well. This gives me confidence that my Cython extension is at least as efficient as my C code (which probably isn't saying much as my C skills are weak).

Some cases will be more or less optimal for cython depending on how much looping, number-crunching and python-function-calling are slowing you down. In some cases people have reported 100 to 1000x speed boosts. For other tasks it might not be so helpful. Before going crazy rewriting your python code in Cython, keep this in mind:

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." -- Donald Knuth

In other words, write your program in python first and see if it works alright. Most of the time it will... some times it will bog down. Use a profiler to find the slow functions and re-implement them in cython and you should see a quick return on investment.

Links: WorldMill - a python module by Sean Gillies which uses Cython to provide a fast, clean python interface to the libgdal library for handling vector geospatial data.

Writing Fast Pyrex code (Pyrex is the predecessor of Cython with similar goals and syntax)

Spatial data in SQLite

2008-04-15T00:00:00-06:00

Slashgeo pointed me to a very interesting set of projects - SpatiaLite and VirtualShape. They provide a spatial data engine for the sqlite database. Think of it as the PostGIS of SQLite. It looks like this extends sqlite's spatial capabilities far beyond the sqlite OGR driver.

SpatiaLite provides many of the basic OGC Simple Features functions - transforming geometries between projections, spatial operations of bounding boxes, and some basic functions to disect, analyze and export geometries.

VirtualShape provides the really neat ability to access a shapefile using the SpatiaLite/SQlite interface without having to import a copy - it reads directly off the shapefile by exposing the shapefile and its attributes as a "virtual table". I can think of a million uses for this. For example, lets say you have a shapefile of US counties and the number of voter in the 2004 election as an attribute in the dbf. You want to find the total voter count in each state:

$ ls -1 counties.*
counties.dbf
counties.prj
counties.shp
counties.shx
$ sqlite3 test.db
sqlite> .load 'SpatiaLite.so'
sqlite> .load 'VirtualShape.so'
sqlite> CREATE virtual table virtual_counties using VirtualShape(counties);
sqlite> select sum(voters) as total_voters, state_name 
            from virtual_counties 
            group by state_name 
            order by total_voters desc;
9830550.0|California
7563055.0|Florida
7346779.0|Texas
...

Now this is fairly straightforward non-spatial SQL but the ability to run it against a shapfile without having to export to an intermediate data format is a very valuable tool.

Links: * When to use SQlite. * A video presentation by Richard Hipp (the author of sqlite).

Shell history - Why not?

2008-04-11T00:00:00-06:00

What an odd meme .. I don't know why but I expected some more interesting results. I guess the majority of the commands I use are pretty pedestrian.

history|awk '{a[$2]++ } END{for(i in a){print a[i] " " i}}'|sort -rn|head
163 vi
48 screen
29 python
28 ls
17 cp
17 cd
9 sqlite3
6 rm
5 sudo
4 htop

Working hard for some REST

2008-04-02T00:00:00-06:00

I don't spend much time with web programming these days but I decided to give web.py (the minimalist python web framework) a shot and, while I was at it, try implementing a simple REST api.

First of all, web.py is truly everything it claims to be - small, light and easy to deploy behind lighttpd. It gives you a ton of flexibility to implement anything however you want - which is a plus or minus depending on how you look at it. I liked the inifinte flexibility but I can see alot of refactoring taking place and features needing to be implemented just to match the functionality built into a more structured framework like Django.

Back to the REST side of things. So I created a url-mapping to my "resources" or "nouns" and used the HTTP verbs (POST,GET,PUT,DELETE) to supply the interface. This was a joy to do in web.py which made it easy.

urls = ("/thing/(\d+)", "thing")
...
class thing:
    def GET(self, thingid):
        # select query and render to template
        ....
    def POST(self, thingid):
        # insert query and redirect to /thing/thingid
        ....
    def DELETE(self, thingid):
        # delete query
        ....
    def PUT(self, thingid):
        # use cgi args to run update query on specified thing
        ....

The hard part came when I realized that HTML forms do not implement DELETE or PUT methods! 2 of the 4 cornerstone HTTP verbs are not implemented in HTML forms?

Surely this can be accomplished with a top-notch AJAX library. I tried Prototype.js and it appears that the PUT and DELETE methods are simply tunneled over POST with an extra arg attached and the server side has to handle it accordingly. So I ended up just using a straight XMLHttpRequest which works but has it's own problems.

How are you supposed to call PUT or DELETE through a web page? Is XMLHttpRequest the only way? What about browsers without javascript?

Upcoming books

2008-03-12T00:00:00-06:00

There are two new books coming out this summer which fill a valuable niche in the open-source GIS bookshelf:

These are both written by some of the top developers within their respective topics and I'm really looking forward to reading them.

Google Earth and the tilt sensor joystick on the X61s

2008-02-17T00:00:00-07:00

The X61s is one bad-ass machine. Besides the great performance, battery life and solid engineering, there are other hidden gems. Like the tilt sensors that were designed to protect the hard drive in case of a drop can also be used to detect the laptops motion under more normal circumstances.

There are some interesting applications that use some simple statistics to determine when the machine is "tapped" or julted to left or right. You can then assign actions to unique combinations of taps.

These applications all use the sysfs interface to the sensors (_ cat /sys/bus/platform/devices/hdaps/position _ will show your position in the x and y axis). But the sensors also provide a joystick interface that allow you to tilt the laptop along the two horizontal axes to control any number of applications. Including Google Earth.

Install tp_smapi
Test the sensors by running hdaps-gl , a simple OpenGL app showing the real-time tilt of your thinkpad.
Run jscal to calibrate the joystick. You'll need to install the "joystick" package for this. The command is: jscal -c /dev/input/js0 After which you should keep your laptop level for a few seconds. Then, when prompted, tilt left, center, right, back (towards you), center, then forward.
Now fire up Google Earth. Open the Options menu, go to Navigation and select Enable Contoller.

You should now be able to zoom around by tilting the laptop. The keyboard shortcuts really help when you're in this mode (Ctl-Up/Down to zoom, Shift-Up/Down to tilt, Shift-Left/Right to pivot).

There's also a neat Perl-script technique to control a web-based google map which has some cool potential for an openlayers based system.

Since most Apple laptops have a similar sensor, you should be able to get the same thing going on your Macbook. Try it out..its alot more fun that using the mouse!

The shiny new X61s

2008-02-16T00:00:00-07:00

My HP laptop was nearing 5 years old. It had held up extremely well but most modern software taxed it to the absolute limits (just having firefox open with a flash ad in one tab was enough to send the system load through the roof). So I decided to try something new.

I was looking for something in the ultra-portable range. I tried out the OLPC and looked seriously at the Asus eeepc for a while. But they were far too difficult for me to type on. Ergonomics were extremely important and the only ultraportable that consistently rated high in that department was the IBM/Lenovo thinkpads. The X61s was appealing with its low voltage core2 duo and 2GB of RAM. All that in a small package about 3 lbs and about an inch thick.

So the X61s arrived and I figured I'd give it a try with the "stock" software. It was my first experience with Vista and I gave it my best shot. After about 1/2 hour of excessive clicking, boggy performance and pop-up windows, I shrunk the ntfs partition and installed Ubuntu Hardy Heron Alpha 4.

Sound, wireless with WPA, Compiz with 3D; the major things that normally plague a linux laptop install worked right out of the box. On the other hand, I'm running into a few bugs in nautilus (this is is alpha software after all), I can't get bluetooth working, suspending to ram works but is a little buggy (have to restart some services manually) and I had to edit a few config files and compile a kernel module to utilize all the bells and whistles provided by the hardware. But it is still more fun than using Vista.

One thing that really shines on this machine is the battery. I got the 8-cell extended life battery and used some powertop tweaks cut my power consumption and was able to get the wattage down in the 10 to 15 watt range depending on usage patterns. No wonder it is energy star compliant! With that kind of wattage and battery capacity, I'm easily getting about 6 to 7 full hours of battery life.

Some tips if you're setting up Linux on your X61s:

First and foremost, read thinkwiki. There you'll find 95% of your answers. But to summarize my experience:
Upgrade your BIOS first (this is a good reason to keep your Vista partion around since Lenovo ships some handy update utils for windows).
Install the tp_smapi kernel module with HDAPS support. This will enable Linux to access the hard drive sensors for disk protection, motion sensing and the joystick interface
The big blue "Thinkvantage" button doesn't work out of the box. I'm not sure what it should do but its a nicely placed button so don't let it go to waste.
Tweak the power consumption. For the impatient, just install powertop and follow the instructions .. it will tell you what processes are waking your CPU and how to stop them. Also check out Less Watts - a full resource for tweaking linux power consumption.
Configure your trackpoint pointer and buttons. This involves setting up you xorg.conf file to emulate a middle scroll wheel as well as tweaking the speed and sensitivity of the pointer. BTW - if you've never tried a pointer, give it a shot ... I've found it much more comfortable than a touchpad.
Laptop-mode , a set of kernel and userspace tools to manage hard-drive power consumption, can be handy. It can also be deadly to your disk if configured incorrectly. Basically it aggressively spins down the disk after short periods of inactivity to save power. Inevitably an application will try to hit the disk again and it will spin right back up. This leads to an unreasonably high amount of load cycles (100 per hour) and the drive can only handle a finite amount before failure (~600,000). You can configure it for more sane behavior but do your research before you enable laptop-mode! And check out smartctl to monitor the disks health.
If, after you unsuspend the machine, your screen is way too dark, try Ctl-Alt-F1 followed by Ctl-Alt-F7. There are some other hacks involving acpi configuration or grub kernel options but none of them have worked for me yet.

Human Impacts on the Global Marine Ecosystem

2008-02-15T00:00:00-07:00

We did it!

As some of you may know in 2005 through 2006, I was part of a research team[1] , led by Ben Halpern at NCEAS, developing a global model of human impacts on the marine ecosystem. We created or compiled 17 high-resolution global datasets of human-induced threats (land-based pollutants, fishing, shipping, climate change, etc.) and 20 ocean habitat datasets. These were combined to create an impact index which models the cumulative level of human-induced stress on our oceans.

The results were published today in Science magazine and presented yesterday at the AAAS Annual Meeting. To summarize, we found that the entire ocean is affected and 40% is heavily impacted. It is not all bad news as there are many areas of relatively low impact which could provide examples for ecosystem restoration and opportunities for conservation. The global map is the first of its kind and will help clarify and quantify our cumulative impacts on the ocean and allow us to focus efforts geographically. The model is not perfect and can't really be used to make decisions at a very localized scale but, given the available globally-consistent, reasonably-high-resolution data for all the various ocean threats and habitats, this is the best effort to date. The model itself is relatively simple with a very clear methodology which will allow scientists to tweak the parameters and add better data as it becomes available. For those of you interested in the GIS modeling end, NCEAS has a great summary of the data used in the model. Most of the data are available as raster data products or KML.

The media has picked up on the story with NPR, MSNBC, The Washington Post, USA Today and National Geographic covering it (to name a few). I especially recommend the NPR site as it has a great animation and an audio segment.

So congratulations to everyone who made this happen!

[1] Benjamin S. Halpern, Shaun Walbridge, Kimberly A. Selkoe, Carrie V. Kappel, Fiorenza Micheli, Caterina D'Agrosa, John F. Bruno, Kenneth S. Casey, Colin Ebert, Helen E. Fox, Rod Fujita, Dennis Heinemann, Hunter S. Lenihan, Elizabeth M.P. Madin, Matthew T. Perry, Elizabeth R. Selig, Mark Spalding, Robert Steneck, Reg Watson (2008). A global map of human impact on marine ecosystems. Science, vol. 319

EDIT:

Some additional articles:

New York Times
BBC Video on YouTube

Why is the command line a dying art?

2008-02-02T00:00:00-07:00

Sadly, a lot of GIS folks have never come into contact with a command line interface (CLI) . I've met even experienced computer users who, when faced with a command-line prompt, experience some autonomous nervous system lock up that causes their eyes to glaze over and prevents any knowledge from entering their brain from that moment forward. The all-Windows, all-GUI mentality of the current GIS market leaders just doesn't expose you to it (if you remember working with coverages at the ESRI Arc/Info command line, you official qualify as an "old-timer"). And the DOS command line is virtually invisible to XP and vista users. Linux users are more CLI aware but this is even becoming less important as distros such as ubuntu GUI-ify everything.

So why the fear of the command line? Why is it assumed to be more "complicated" than a graphical user interface (GUI)? I have found that, in some cases, the opposite is true ... there is something reassuringly simple about typing something and getting a response back. It feels like you are in direct control of the computer. Which, indeed, you always are. The computers always do exactly what you tell them, whether you are in a GUI or a CLI. But GUIs attempt to abstract away the details so that you don't need to know exactly what you're telling the computer to do. This nice fluffy feeling comes at the cost of many important factors.

The benefits of the command line interface

Automation

If you had, for instance, monitoring data coming in in a hourly basis and needed to process the data, would you want to be on call 24 hours a day to click a few buttons. Of course not. Write a command that performs the job and schedule it to execute at some regular interval. (I wonder if those guys on LOST ever thought to just set up a cron job to enter the numbers in the hatch?)

Repeatability

Whenever I show someone a CLI-based method for solving their problem, they almost immediately say (or at least imply) that the typing is too much trouble. Consider this command to convert a .tif image to ERDAS .img (HFA) format:

cd /data/images
gdal_translate -of HFA aerial.tif aerial.tif.img

You might ask, "Why not just use a GUI, click a button or two, and get your output". Sure. Now do that for 2,000 tif images. With a CLI you only have to type a few extra lines.

cd /data/images
for i in *.tif; do 
  gdal_translate -of HFA $i $i.img;
done

Documentability

There is nothing more important to a GIS Analyst than documenting his/her work! We live by metadata and methods write-ups. Now picture an intense 5 hour work session ... everything needed to get out by 2pm. You're done and now it's time to document your procedure and methods. With the CLI, you copy and paste your commands from the terminal or simply look at your command history which will show exactly what you did and how. You can store this in a text file and come back to it months later and be able to re-run the procedure.

With the GUI, you have to remember and describe every click, every sub-menu, every option, every action taken to arrive at the answer. Often this requires verbose description, screenshots, etc. None of which is recorded in any history file of course. And of course, when the client inevitably comes back the next day with modifications, none of it is repeatable in any automated fashion with a GUI.

Accessibility

It's just plain text with a CLI. You can print it out and study it on the bus. You can email the whole process to co-workers. You can use a concurrent versioning system to keep track of changes to scripted procedures. You can transfer massive amounts of knowledge without having to sit down and go through everything step-by-step, click-by-click in a visual interface.

Accuracy

Far too often, GUI designers make over-reaching assumptions about how things should work. The idea is often that the user should not need to know anything more than the absolute minumum. To use a car analogy, the driver turns the key, presses the pedal and steers but does not need to know what goes on under the hood. This works most of the time. But the law of leaky abstractions usually takes hold and something inevitably breaks or performs differently than expected. Since the CLI does not hold your hand (it executes the exact command you give it) it more accurately mimics the actual physical interaction with the computer and is much more useful in debugging and investigating complex problems.

So basically, don't make the mistake of thinking that a pretty window will always contain the magic button to get the job done. In many cases, a command line is much more efficient, even essential. If you don't know how to effectively work in a command-line environment, do yourself a huge favor and learn.

Oh and I'd be remiss if I didn't mention Neal Stephenson's book on the subject ... a bit technically outdated but a great quick read on why command lines are still very relevant in the face of increasingly sophisticated graphical interfaces.

Impervious surface deliniation with GRASS

2008-01-26T00:00:00-07:00

Watersheds with lots of roads, buildings, parking lots, rock surfaces, compacted dirt, etc tend to prevent inflitration and cause rapid runoff in response to rainfall. This poses a number of challenges for managing stormwater and water quality. Not surprisingly, the percentage of hydrologically impervious surface in a given watershed is an important factor in many hydrologic models. Using standard aerial photography and GRASS, it's a relatively simple process to create an impervious surface map using supervised classification.

First find an aerial photo. I grabbed a NAIP image from CASIL but you might want to try using OpenAerialMap. The red, green and blue visible bands are usually sufficient for differentiating between impervious and pervious land use types... For distinguishing different types of vegetation you might want to use a multispectral imagery source with non-visible bands (ie near infrared) but this is usually lower resolution (eg. 30 meter pixels of Landsat) or much more expensive.

Next we jump into GRASS and import our image into a new location:

r.in.gdal -e input=naip.img output=naip location=impervious

Exit and log back into your new location. If you look at the imported rasters, you'll see three rasters, not one. Each band (R, G and B) gets imported separately.

GRASS 6.3.cvs (impervious):~/>  g.list rast
raster files available in mapset permanent:
naip.1 naip.2 naip.3

We need to indicate that these rasters form a logical group

i.group group=naip2 subgroup=naip2 input=naip.3@PERMANENT,naip.2@PERMANENT,naip.1@PERMANENT
i.target -c group=naip2

At any time you can list the rasters in a given group/subgroup to confirm.

i.group -l -g group=naip2 subgroup=naip2

Now the real heart of the process. We need to define "training areas" which are polygons around representative land use types. I used QGIS to load the aerial photo and create a new polygon layer with an integer attribute field called vegnum. I digitized a few rocks, paved areas, rooftops and dirt roads to represent the impervious areas to which I assigned vegnum=1. Then I selected some grasslands, forests, lakes and chaparral and assigned 2 as the vegnum. The next step is to load the polygon data into GRASS and rasterize it (in retrospect it would have just been easier to create the grass vector layer from scratch in QGIS to avoid the import step). Note that the vegnum field is specified as the raster value column.

v.in.ogr -o dsn=./training/train1_utm/train1_utm.shp output=train1 layer=train1_utm min_area=0.0001 type=boundary snap=-1
v.to.rast input=train1 output=train1 use=attr column=vegnum type=point,line,area layer=1 value=1 rows=4096

Next we use i.gensig to generate a spectral signature (the statistical profile; mean and covariance matrix of the input pixels) for the training areas.

i.gensig trainingmap=train1 group=naip2 subgroup=naip2 signaturefile=naip2_train1.sig

Now that we have a signature of impervious vs. non-impervious surfaces, we can use the maximum likelihood method to classify each pixel into the highest probability category.

i.maxlik group=naip2 subgroup=naip2 sigfile=naip2_train1.sig class=imperv

You might notice a slight speckled, noisy appearance due to things like shadows, reflections or imperfect training areas. Usually these small 1-pixel deviations are not interesting enough to keep so we can smooth out the image taking the mode (most comon) cell in a 3x3 window.

r.neighbors input=imperv output=imperv_mode method=mode size=3

And here are the results... calculating imperviousness will most likely be an iterative process so be prepared to evaluate the output, tweak the training areas and rerun the process a few times. Once you're happy with the results, you can use zonal statistics with a tool like starspan to find the percent imperviousness of your watersheds or other regions.

A GUI for GDAL and GMT'

2008-01-06T00:00:00-07:00

In the why-haven't-I-ever-heard-of-this department:

Mirone is a Windows MATLAB-based framework tool that allows the display and manipulation of a large number of grid formats through its interface with the GDAL library. Its main purpose is to provide users with an easy-to-use graphical interface to the more commonly used programs of the GMT package.

There is also a version that does not depend on MATLAB which is what I decided to try. This is a great package; easy to install, very usable, lots of high-end raster functionality, and a good sense of humor...

Considering GMT and GDAL can be a bit challenging and unfamiliar for a typical windows user, Mirone is a huge step forward.

Among some of the functionality that is an absolute pleasure to work with compared to some other software packages: surface profiles, image-flipping, DEM derivatives, color-ramping, contouring, histograms, kernel filtering... And that's just scratching the surface. I highly recommend checking it out.

More on Google Charts and a python interface

2007-12-19T00:00:00-07:00

Well it's been almost a full two weeks since google charts API came out. A really nice service but it's only going to be useful with a high-level programming API. Enter PyGoogleChart .. a python interface to generate google chart urls.

Taking one of my previous example datasets, here's the 10-second howto:

<blockquote>from pygooglechart import SimpleLineChart
chart = SimpleLineChart(400,200)
data = [32.5,35.2,39.9,40.8,43.9,48.2,50.5,51.9,53.1,55.9,60.7,64.4]
chart.add_data(data)
url = chart.get_url()
print url
</blockquote>

which gives us:

http://chart.apis.google.com/chart?cht=lc&chs;=400x200&chd;=t:32.5,35.2,39.9,40.8,43.9,48.2,50.5,51.9,53.1,55.9,60.7,64.4

and our chart image:

Geologist vs. Engineer

2007-12-12T00:00:00-07:00

Uncylopedia, the self-proclaimed encyclopedia "full of misinformation and utter lies", has a hillarious article about Geologists. I especially like the "Geologist-Engineer Controversy" which, having worked with both geologists and engineers extensively, is a pretty accurate portrayal of their respective approaches.

Geology, being an art as much as a science, has always baffled and worried engineers, hence the engineers' defensive weapons of pocket protectors, slide rules, black socks, and eventually computers.

A related joke:

A geologist and engineer walk into a job interview. They are each asked a simple math question : 'What is 2 times 2?'. The engineer replies, 'It's 4.00000'. The geologist replies, 'Ah.. it's about 4'

Quick way to publish a point shapefile to html

2007-12-10T00:00:00-07:00

There are better ways to put data on the web but my latest little project wasn't about the best way but the quickest way to get some spatial data into the hands of those unfortunate souls who don't have GIS software. The goals were pretty simple:

Take a single point shapefile (or other OGR readable vector data source)
Convert it into html/js that would use one of the web mapping APIs to display the points and all their attributes.
The output had to be a standalone, self-contained html file that could be emailed. No server side anything required.

I came up with a quick python hack to do the job (source code). Mapstraction, with it's goal of providing a common javascript API for a number of map providers, seemed like an obvious choice. The python portion of the code reads the shapefile using OGR (you will need the python-gdal bindings, see FWTools) and constructs the html/js. All the javascript is sourced to external URLs so there is no software dependency except for a working network connection.

This allows for a single command:

shp2Mapstraction.py bearboxes.shp bearboxes.html Yahoo

which produces an html file providing a Yahoo maps interface to the data; in this case the point location of all the bear boxes (food storage lockers to keep your stuff separated from the bears) in the Sierra Nevada.

Currently it just supports Microsoft Virtual Earth and Yahoo. I had to bypass Google because their key system is restricted by URL. And the mapstraction-to-openlayers connection wasn't working too well though I haven't really investigated.

Anyways, it provides a quick and easy way to deliver spatial data to anyone with a browser and internet connection.

Google Charts - their latest web service

2007-12-06T00:00:00-07:00

Google Charts is a web based API for generating charts/graphs. It supports alot of the common types of graphics including line, pie, bar, scatter plots and Venn diagrams. I've relied on a bunch of other server-side graph generators (owtchart, jpgraph, sparklines, matplotlib, etc) but this looks like it might be a contender.

Still there is no higher-level programming API yet ... but give it a few days (interface with numpy anyone?). ExileJedi blog lists some other potential disadvantages:

* You are limited to 50,000 queries per user per day, which may pose some scalability concerns if you plan to build something big on this.
* You have to be careful about the number of data points you submit in your request as you can quickly exceed the allowable URL length, and furthermore you might end up with illegibly smooshed-together data points due to the scale of your output.
* There's always the "OMG Google will absorb all our data and become sentient, turn evil, and unleash an army of death robots on us all, run for your lives!" paranoia, but that's really just silly talk.

EDIT: It appears this service only support GET requests. On one hand you're adding new data so you should be POSTing it, right? On the other hand, you're asking to GET a graphical representation of a set of numerical values. What would a "restful" version of a web graphing API look like? Maybe some of the REST gurus can clear that up.

Take the larger view of GIS

2007-12-05T00:00:00-07:00

It's interesting to see the passionate responses to Joe Francia's article claiming that neogeography is != GIS. One one side there is a small group of folks bashing neogeography and claiming the superiority of "GIS". On the other side there is the attitude claiming that some "revolution" has occurred which has supplanted traditional geographic techniques. You'd think there was a cold war going on! Both memes are as wrong as they are arrogant.

I have always defined GIS as

Geographic Information System: The integration of hardware, software, procedures and people to manage the collection, creation, analysis, synthesis, sharing and visualization of spatial information.

Neogeography easily fits that bill. So does Enterprise IT. So does Desktop mapping. So does Geostatistics. Geodesy. Web Mapping. Remote Sensing. LBS mobile technologies. Cartography. Surveying. Spatial Analysis and Modeling. Database management. Sensor webs. GPS... These disciplines are all a small piece of the larger puzzle that is GIS (whether their staunch adherents will admit to it or not!).

The key word in this controversial acronym is System. In order for any organization to implement a successful GIS, they must figure out a) which technologies will work for them and b) how to integrate them into a coherent whole. All of these aspects of GIS have something to offer so it's important not to get stuck in a rut with blinders on. This goes for all "sides" of this ridiculous "neogeo vs GIS" argument.

For the cartographers in the house…

2007-12-04T00:00:00-07:00

Here's another one for the blogrolls:

http://strangemaps.wordpress.com/

Privacy, Location Technology and Bad Journalism

2007-11-20T00:00:00-07:00

The Ventura Star has run an article about privacy issues and modern geolocation technology.

As important as this topic is, John Moore (the author) is clearly uninformed. This is a horrible piece of journalism. Moore mixes the potential negative effects of various technology such as RFID, cellular communication, sensor networks, nanotech, community data collection efforts, navigation systems, and GPS into one chilling, over-simplified and baseless viewpoint. Instead of reporting the details of Michael Goodchild's talk at Ventura College, he treated us to his own paranoid, incoherent vision of the future of technology. Moore's entire premise is based on the fact that:

"GPS is a system that basically allows you to know where you are anywhere in the world within one meter"

That much is true. He uses this fact to extrapolate the conclusion that GPS allows some nefarious force to monitor your groceries, cell phone calls, and indeed your every movement.

GPS recieves satellite signals translates those signals into a location. It takes an entirely different technology to transmit these locations to some third party. I guarantee you that none of my gps tracks have gotten into anyones hands without my consent (come on John Moore, prove me otherwise).

The title speaks volumes to his ignorance:

"Where are you in life? If you don't know, others using GPS devices do"

Suggesting that other people with GPS can a) track my movements or b) be tracked by me , shows a complete lack of understanding of the technology. Sure there are privacy dangers. But those dangers must be presented clearly and concisely by someone with half a clue, not this paranoid bullshit journalism. This article would not even pass as a high school essay.

Looking for LIDAR services

2007-11-12T00:00:00-07:00

I'm looking for a LIDAR specialist to fly some sensors around San Diego. Ideally we would need someone who could collect both LIDAR data and digital aerial photography (high-res but only visible spectrum), process the data (generate bare-earth DEMs and georeferenced aerials) and deliver it in a GIS-compatible format. This is in response to the recent fires related to erosion control.. with rainy season coming we'd be on a tight schedule.

Does anyone have any suggestions of good companies who could provide this service? Please feel free to recommend your own services if you think it would be a good fit.

You can also contact me directly at perrygeo+lidar at gmail.com

Poetics of Cartography

2007-10-20T00:00:00-06:00

In case you missed the fantastic Chicago public radio program last night on This American Life, the NPR-syndicated show did an entire program on "mapping". It goes well beyond the idea of simply mapping our physical infrastructure and really opens up the idea of mapping to the widest possible definition; using all our senses to create a multi-dimensional representation of our world. Within the vast experience of life, mapping is described as the abstract process of summarizing and synthesizing a singular slice of that experience.

The show is available as a stream and is really worth a listen this weekend.

P.S. The title of this post comes directly from a quote by Denis Wood, the author of The Power Of Maps and geographer who is mapping some non-conventional aspects of his neighborhood in Raleigh, North Carolina. The first and arguably most interesting portion of the show from a geographer's standpoint.

Turning Ubuntu into a GIS workstation

2007-10-20T00:00:00-06:00

It just keeps getting easier and easier to get a fully functional open source GIS workstation up and running thanks to Ubuntu. The following instructions will take your vanilla installation of Ubuntu 7.10 and add the following top-notch desktop GIS applications:

Postgresql/PostGIS : a relational database with vector spatial data handling
GRASS : A full blown GIS analysis toolset
Quantum GIS: A user-friendly graphical GIS application
GDAL, Proj, Geos : Libraries and utilities for processing spatial data
Mapserver : web mapping program and utilites
Python bindings for QGIS, mapserver and GDAL
GPSBabel : for converting between various GPS formats
R : a high-end statistics package with spatial capabilities
GMT : the Generic Mapping Tools for automated high-quality map output

While this is not a comprehensive list of open source GIS software, these packages cover most of my needs. If you want to live on the bleeding edge and have to have the absolute latest versions, you'll be better off installing these from source. But for those of us that want a stable and highly functional GIS workstation with minimal fuss, this is the way to go:

Go to _ System > Administration > Software Sources _ and make sure the universe and multiverse repositories are turned on. Close the window and the list of available software packages will be refreshed.
Open up a terminal (ie the command line) via _ Applications > Accessories > Terminal_ and type the following:

sudo apt-get -y install qgis grass qgis-plugin-grass mapserver-bin gdal-bin cgi-mapserver \ python-qt4 python-sip4 python-gdal python-mapscript gmt gmt-coastline-data \ r-recommended gpsbabel shapelib libgdal1-1.4.0-grass

The sudo part indicates that the command will be run as the administrator user, _ apt-get -y install_ is the command telling it to install the list of packages and answer yes to any questions that pop up.

There is one package that is worth upgrading to the latest and greatest - Quantum GIS. The latest version (0.9) is due out very shortly and has the ability to write plugins using the python programming language. A big plus!

Download the latest build from http://qgis.org/uploadfiles/testbuilds/qgis0.9.0.debs_ubuntu_gutsy.tar.gz and extract it ( right-click > Extract Here ). In the directory you'll see 4 .deb files, only 3 of which you'll need unless you plan on doing any development work.

Double click libqgis1_0.9.0_i386.deb and you'll get a message saying an older version is available from directly from ubuntu. We already know this so just close and ignore it. Click Install Package and wait for it to complete then close out.

Repeat for qgis_0.9.0_i386.deb and qgis-plugin-grass_0.9.0_i386.deb (in that order).

And there we have it, about 15 minutes depending on your internet speed and you've installed a high-end GIS workstation built completely on free and open source software.

Update to QGIS Geocoding plugin

2007-10-19T00:00:00-06:00

With the release of QGIS 0.9 imminent , I decided to install in on Windows XP and noticed that the geocoding plugin was failing... sure enough I had hardcoded linux temporary directories. So I reworked the python code to determine the temp dir in a more cross-platform way (using tempfile.gettempdir() ) and it works fine.

The update can be downloaded here.

Assuming you've installed qgis in the standard location, just unzip this into C:\Program Files\Quantum GIS\python\plugins (windows) or /usr/share/qgis/python/plugins (Linux) and you should be good to go. Note that you'll have to create the "plugins" directory if it doesn't exist.

CTech software goes multithreaded

2007-10-12T00:00:00-06:00

CTech has announced that the next version of it's flagship software package, EVS (Environmental Visualization System), will take full advantage of multiple processors.

My experience with EVS is mostly in the realm of 3-dimensional kriging and geostatistics. Given the amount of data crunching involved, it's always been sluggish when dealing with a non-trivial amount of data. Nothing is more frustrating that seeing one of your CPU cores cranking away while the others sit idle! But some users are reporting that the new multithreaded modules get nearly linear performance increases when adding more processing cores.

CTech is certainly not the first scientific/geostats application to go parallel. But it is the first program that I personally use on a regular basis that will take advantage of a multi-processor system. I hope this marks the beginning of an industry trend in that direction.

Autodesk open sources coordinate system software

2007-09-25T00:00:00-06:00

Not very often do I see open source mentioned on the front page of my Google Finance page (let alone Geospatial Open Source). But here it is.. the announcement was made at FOSS4G2007 that autodesk will be open sourcing part of it's coordinate system and map projection technology.

So what motivation does Autodesk (or any other company) have to open source it's technology? An important line from Lisa Campbell, vice president, Autodesk Geospatial:

"Our intent to contribute again to the open source community is a reflection of our customers' desire for faster innovation, more frequent product releases, and lower total cost of ownership."

Parallel python and GIS

2007-09-18T00:00:00-06:00

Let's face it - processing speeds aren't going to be increasing according to Moore's Law anymore; Instead of faster CPUs, we'll be getting more of them. The future of programming, it seems to me, lies in the ability to leverage multiple processors. In other words, we have to write parallel code. Until I read Seans' post, I was unware that there was a viable python solution. I had been growing quite dissillusioned by python's dreaded Global Interpreter Lock which confines python to a single processing core. I've even started learning Erlang to leverage SMP processing (until I realized that Erlang and it's standard libraries are virtually useless for anything that needs to handle geospatial data).

So I gave Parallel Python (pp) a shot. Since Sean also offered up a bounty for the first GIS application that used pp, I thought it might be a good time to try ;-)

A good candidate for parallel processing is any application that has to crunch away on lists/arrays of data and whose individual members be handled independently (see pmap in Erlang). I have been working on an application to smooth linework using bezier curves. It's not quite polished yet but the image below shows the before and after

... but bezier curves aren't quite the subject of this post. Let's just say the algorithm takes some time to compute (if you're using a high density of verticies) and can be handled one LineString feature at a time. This makes it a prime candidate for parallelization.

Given a list of input LineStrings, I could process them the sequential way:

<blockquote>smooth_lines = []
for line in lines:
    smooth_lines.append( calcBezierFromLine( line, num_bezpts, beztype, t) )</blockquote>

Or use pp to start up a "job server" which doles the tasks out to as many "workers". A busy worker utilizes a single processing core so a good rule of thumb would be to start up as many workers as you have CPU cores:

<blockquote>numworkers = 2 # dual-core machine
job_server = pp.Server(numworkers, ppservers=ppservers)
smooth_lines = []
jobs = [(line, job_server.submit(calcBezierFromLine, (line, num_bezpts, beztype, t), \
                             (computeBezier, getPointOnCubicBezier), ("numpy",) ))  for line in lines]
for input, job in jobs:
    smooth_lines.append( job() )</blockquote>

Theoretically the parellized version should run twice as fast as the sequential version on my core2 duo machine. And reality was pretty darn close to that:

<blockquote>$ time python bezier_smooth_pp.py 2
Shapefile contains 1114 lines
Starting pp with 2 workers
Completed 1114 new lines with 8 additional verticies for each line segment along a cubic bezier curve

real    0m10.908s
...

$ time python bezier_smooth_pp.py 1
Shapefile contains 1114 lines
Starting pp with 1 workers
Completed 1114 new lines with 8 additional verticies for each line segment along a cubic bezier curve

real    0m20.007s
...
</blockquote>

Just think of the possibilities. In the forseeable future, the average computer might have 8+ cores to work with. This could mean that your app will move 8x faster if you parallize the code (assuming there are no IO or bandwidth bottlenecks). I'd love to test it out on a system with more than 2 processing cores but, unfortunately, I don't have access to any beowulf clusters, Sun UltraSparc servers, or 8-core Xeon Mac Pros. This is what I really need to complete my research ;-) So if anyone want to donate to the cause, send me an email!

And to answer Sean's bounty, I don't consider this an actual application (yet) but I hope it can spur some interest and move things in that direction. But if you feel the need to send me some New Belgium swag (or one of the machines listed above), feel free ;-)

The world turned right-side up

2007-09-05T00:00:00-06:00

I've been working alot in Surfer these days; an excellent geostats and surface mapping package. I was very happy to find that GDAL read it's .grd binary format until I noticed the output from gdalinfo:

> C:\Workspace\Temp\interpolation>gdalinfo svpce_5.grd
Driver: GS7BG/Golden Software 7 Binary Grid (.grd)
Files: svpce_5.grd
Size is 555, 339
Coordinate System is `'
Origin = (383371.000000000000000,3764907.000000000000000)
Pixel Size = (0.500000000000000,0.500000000000000)
Corner Coordinates:
Upper Left  (  383371.000, 3764907.000)
Lower Left  (  383371.000, 3765076.500)
Upper Right (  383648.500, 3764907.000)
Lower Right (  383648.500, 3765076.500)
Center      (  383509.750, 3764991.750)
Band 1 Block=555x1 Type=Float64, ColorInterp=Undefined
 NoData Value=1.70141e+038

Notice that upper Y value is south of the lower Y value! Basically the raster lines order is reversed (bottom-to-top instead of the normal raster orientation of top-to-bottom). I've also experienced the same issue with some NetCDF files so I thought it would be good to have a generic solution to the problem.

So I hacked up the gdal_merge.py script (distributed with gdal, fwtools, etc) and created a raster flip script that will invert the image along the y axis and retain the georeferencing and metadata. The resulting flip_raster.py script seems to work pretty well though it is far from tested.

Here's an example:

The standard gdal_translate method (which doesn't account for the inverted coordinate space):

gdal_translate -of GTiff krig1.grd krig1_translate.tif

And the flipped raster method:

flip_raster.py -o krig1_flip.tif -of GTiff krig1.grd

And we're good. gdalinfo confirms that we have the same extents, pixel sizes, metadata, etc as the original dataset.

Mapserver vs Mapnik revisited

2007-09-04T00:00:00-06:00

A while ago, I was enamored with mapnik's image quality despite it's limitations compared to the vast configurability of the mapserver mapfile. Now that mapserver uses the AGG rendering library, it might not be necessary to compromise configurability in order to get beautiful linework. I just installed the recent beta of mapserver 5.0 and the image quality is very crisp... but this comes at the expense of rendering speed.

All the times below are the average of ten runs using a full global view of a simplified shapefile of country borders.

mapserver (gd) : 0.082 sec , 18kb

OUTPUTFORMAT NAME "GD_JPEG" DRIVER "GD/JPEG" MIMETYPE "image/jpeg" IMAGEMODE RGB EXTENSION "jpg" END

shp2img -m test.map -o mapserver_gd_test.jpg

mapserver (agg) : 0.188 sec , 16kb

IMAGEQUALITY 80 OUTPUTFORMAT NAME 'AGG_JPEG' DRIVER AGG/JPEG IMAGEMODE RGB END

Note that if we bump up imagequality to 90% to (roughly) match the mapnik image, the rendering time and size increase a bit (.201 sec, 25kb)

shp2img -m test.map -o mapserver_agg_test.jpg

mapnik (agg) : 0.282 sec, 23kb python test_mapnik.py

Running this through the python interpreter is likely to interfere with the speed of the results so these times may not be very comparable to shp2img.

Using these preliminary results, it looks like mapserver 5.0 with AGG rendering is roughly equal to mapnik based on a balance of quality/speed/image size. But since I'd prefer to use mapfiles over the undocumented mapnik xml format any day, I think I'll stick with my beloved mapserver. Kudos to the mapserver developers for raising the bar once again.

Performance testing rasters with mapserver

2007-09-04T00:00:00-06:00

There's been some good talk on the mapserver list (thanks to Gregor's diligent testing) about performance related to serving up raster imagery.

First off, comparisons of image formats. Then a look at some TIFF optimization techniques like overviews (similar to "pyramids" in ESRI land) and internal tiling to boost rendering speed.

Most of the conclusions are not all that staggering:

TIFF is fastest but takes up more space compared to ECW and JPEG2000.
Overviews speed up TIFFs tremendously when zoomed out (ie when mapserver would otherwise have to perform some heavy downsampling)
Internal tiles in GeoTIFF format give a boost when zoomed in (only the necessary tiles are read from disk)
The TIFF comparison was run on two setups; a monsterous 8-core, RAID-5 equipped beast and a low-memory virtual machine on low-end PC hardware. The TIFF optimizations are very noticeable on the lesser machine but almost completely negligible on the high-end machine.

Both tiling and overviews are useful, but only on machines with resource shortages, such as slow disks or a lack of spare RAM for caching.

Nothing earth-shattering (these techniques are often mentioned as best practices) but is very nice to see some hard numbers to back it up. Plus the verbose test logs provide a good example for a newbie trying to implement them. Good stuff Gregor!

Mapping the Undesirable

2007-08-28T00:00:00-06:00

While by no means a new phenomenon, Vision 20/20 is offering a service allowing you to see a map of the registered sex offenders in your area. WorldChanging, one of my favorite blogs on emerging technologies, has a great article discussing the issues surrounding mapping of sex offenders .

Is this sort of service, based on powerful networked technologies -- and one being sold on the basis of fear -- an appropriate use of the technology? Where is the data being sourced from? How are the people inputting it being supervised? And what rights to privacy and presumptions of innocence are the people it tracks entitled to?

These are good points, but even more disturbing to me as a citizen and a GIS professional, is that these maps use geocoding services that are not nearly accurate enough for the scale at which they are being viewed. Even in suburban areas, using linear-referenced geocoding techniques can still yield errors of 100s of meters! The margin of error in the geocoding engine alone is enough to place the sex offender icon directly on an innocent citizens' home.

For instance, which of the homes in the map below is the residence of a sex offender? Does the ambiguity bother you? Would it matter more if you were the innocent person living next door?

For maps with this much social weight, I think that a bit more diligence is due to ensure that this data is as accurate as it needs to be!

Zaca Lake Fire Map

2007-08-03T00:00:00-06:00

Ah the joys of living in southern california. The Zaca Lake fire has been burning since July 4th and recently flared up again with a shift in winds which is blowing ash and a very ominous plume of smoke all over downtown santa barbara. While it's still burning in the wilderness areas north of town, the Paradise Road area along the Santa Ynez river has been evacuated. Check it out on google maps.

The Santa Barbara News Press is reporting the fire has reached 39,000 acres and has cost $43 million thus far to contain. The county supervisors are likely to declare a state of emergency and there is already a health warning in effect. So much for my bike ride this afternoon...

Desktop vs Web UI

2007-06-11T00:00:00-06:00

This might be a dup story for some but I thought it was interesting enough to post nonetheless:

Jeff Atwood wrote an interesting piece about Desktop vs Web UI that is directly relevant to mapping : Who Killed the Desktop Application?. He compares the usability of Microsoft Streets and Trips with Google Maps and concludes

All the innovation in user interface seems to be taking place on the web, and desktop applications just aren't keeping up.

OGR and matplotlib examples

2007-06-10T00:00:00-06:00

Jose Gomez-Dans posted a great example of using OGR, Postgis and Matplotlib with Python - OGR, Python y Matplotlib (Spanish only).

FDO, GDAL/OGR and FME ?

2007-05-31T00:00:00-06:00

FDO, GDAL and FME all seem to operate in roughly the same domain - Providing a data model, API and tools to translate between spatial data formats. Does anyone know of any good write-ups comparing/contrasting the features of these three libraries?

QGIS Geocoding plugin

2007-05-28T00:00:00-06:00

A few weeks back, I decided to take the plunge and learn the python bindings for QGIS 0.9. My first experiment was to implement a geocoder plugin. What started mostly as a learning experiment turned into something that might actually be useful!

The idea was to use web services to do all the actual geocoding work (the hard part!) and the delimited text provider to load the results into qgis. Right now it's built on top of the Yahoo geocoder which is, IMO, the best out there.. very flexible about the input format. The geopy module is used to interact with the geocoding services so it could potentially support other engines such as geocoder.us, virtual earth, google, etc.

The user interface is very straightforward; enter list of addresses/placenames seperated by a line break, pick an output file and go. To be legitimate, you should also sign up for a yahoo api key, though the 'YahooDemo' key will work ok for testing purposes.

Here's the install process (assuming you already have python, pyqt4, qgis 0.9, qgis bindings, etc. set up):

svn checkout http://perrygeo.googlecode.com/svn/trunk/qgis/geocode cd geocode emacs Makefile # change install directory if needed sudo make install

This is just a rough cut and it's my first attempt at using the qgis and qt apis so there are probably many things that could be improved upon. Ideally this plugin could:

Parse text files as input
Allow for a choice of geocoding engine
???

Feedback (and patches) welcome ;-)

Python gpsd bindings

2007-05-27T00:00:00-06:00

If you want to get a linux/unix machine talking to your GPS unit, most likely you'll be using gpsd. There are many great apps that build off of gpsd such as kismet and gpsdrive.

Installing gpsd on debian/ubuntu systems is as simple as

sudo apt-get install gpsd gpsd-clients

You should be able to connect your gps via serial port and start a gpsd server

sudo gpsd /dev/ttyS0

The gpsd server reads NMEA sentences from the gps unit and is accessed on port 2947. You can test if everything is working by running a pre-built gpsd client such as xgps.

This is very useful for situations where you need lower-level access to the gps data; for logging your position to a postgres database for example. The debian packages (and most others I'm assuming) come with gps.py, a python interface to gpsd allowing you to pull your lat/long from the gps in real time. This opens the door for all sorts of neat real-time gps apps.

import gps, os, time

session = gps.gps()

while 1:
    os.system('clear')
    session.query('admosy') 
    # a = altitude, d = date/time, m=mode,  
    # o=postion/fix, s=status, y=satellites

    print
    print ' GPS reading'
    print '----------------------------------------'
    print 'latitude    ' , session.fix.latitude
    print 'longitude   ' , session.fix.longitude
    print 'time utc    ' , session.utc, session.fix.time
    print 'altitude    ' , session.fix.altitude
    print 'eph         ' , session.fix.eph
    print 'epv         ' , session.fix.epv
    print 'ept         ' , session.fix.ept
    print 'speed       ' , session.fix.speed
    print 'climb       ' , session.fix.climb

    print
    print ' Satellites (total of', len(session.satellites) , ' in view)'
    for i in session.satellites:
        print '\t', i

    time.sleep(3)

... which gives you a simple readout to the terminal every 3 seconds.

Obviously there are much more interesting applications for this ( logging data to postgis, displaying real-time tracking data in QGIS via a python plugin, etc). But this is a good start for any python based app.

Sparklines in python

2007-05-19T00:00:00-06:00

Edward Tufte, the outspoken guru of data visualization, has long been an advocate of clear and concise (almost minimalist) graphical representations of data. He's got a lot of great ideas relevant to cartography (my cartography course at Humboldt State used his book "The Visual Display of Quantitative Information" as our text).

One of the coolest ideas are "sparklines" which he describes as "data-intense, design-simple, word-sized graphics". Instead of standalone charts that are often placed on their own and separate from the text that discusses them, sparklines are meant to be placed in-line with the text and provide memorable, simple and contextually-relevant data to support the surrounding text. For example:

_The US National Debt as a percentage of GDP increased during the Reagan and Bush presidencies but dropped off slightly during the Clinton administration . _

Now of course I had to figure out how to produce these in python. Theres a great cgi application, written in python by Joe Gregorio, that does sparklines. I needed something that was abstracted away from the CGI framework, more of a proper python module. Replacing all the CGI-specific code was straightforward and I came up with a standalone sparkline python module ( View / Download the Source Code. ) The only dependencies are python and the python imaging library.

In the minimalist spirt of sparklines, the interface was kept simple. First you create a list of data values then simply pass the list to one of the sparkline generators:

import spark a = [32.5,35.2,39.9,40.8,43.9,48.2,50.5,51.9,53.1,55.9,60.7,64.4] spark.sparkline_smooth(a).show()

Or if you prefer a more discrete, bar-graph-style instead of a smooth line:

spark.sparkline_discrete(a).show()

There's plenty of room for configuration. For example, in the national debt example above I wanted to keep the y axis at the same scale (instead of the default min-max scaling) and make each step 6 pixels wide:

spark.sparkline_smooth(a, dmin=30,dmax=70, step=6).show()

How does this relate to cartography? GIS typically takes a snapshot representation of earth, frozen in time. Since sparklines seem particularly good at representing change-over-time, it could be an interesting way to add a time dimension to a 2-D map. For example, instead of just displaying country polygons with labels, you could place a sparkline right under the label showing the population changes over the last century. It seems like it would be an ideal way to embed alot of useful information into a small map.

Anyone know of any good examples?

Blessed Unrest - Paul Hawken’s presentation

2007-05-14T00:00:00-06:00

I got the chance to see Paul Hawken speak tonight in Santa Barbara. I knew him best as the author of Natural Capitalism which provided a great roadmap for integrating ecologically sustainable practices with the business world. This talk was based on his recent book - Blessed Unrest - How the Largest Movement in the World Came into Being and Why No One Saw It Coming.

The basis of this book is simple: that organically-developed, bottom-up, non-hierarchical organizations (which number in the millions according to his research) are now leading the world in many diverse areas of service. He describes these environmental and social justice organizations as the "immune system" of our societies; our response to destructive and corrupt habits perpetrated by those in power who are willing to compromise our future for short-term gain.

One thing that struck me about the subject was the importance of sharing information and ideas (as opposed to spreading an ideology). I thought one of the most interesting stories of the night was his description of how the meme of non-violent civil disobedience evolved... from Emerson, to Thoreau, to Ghandi, to Rosa Parks to Martin Luther King, Jr. At each turn of the story, there was someone (often unnamed but vitally important) who turned on each of these people to the ideas of those who came before.

Paul was eager to point out the role of technology in this inter-connected mesh of grassroots community organizations. He mentioned open-source software a few times and even gave a shout out to Ruby on Rails (which I gather was the backbone for his WiserEarth.org site focussed on connecting these diverse organizations).

It was a careful mix of optimism and pessimism; Paul was careful in noting the many severe challenges we've been handed but was confident that this bottom-up mesh of interconnected citizens can form a community strong enough to withstand anything that comes it's way. In the end, his message was about doing what you love, connecting with others and standing up for your values. Sounds like good advice to me.

Cleaning up CAD data with postgis

2007-05-14T00:00:00-06:00

Don't you just love getting CAD data into GIS! I received a .dwg file with study areas delineated as polylines which we needed as polygons for analysis purposes. And it wasn't just one polyline surrounding each study area ... there were hundreds of little line segments which outlined a couple dozen areas (what was this CAD tech thinking?) . Luckily each segment had a name to associate it with the proper area.

I found that ArcMap's tools for doing this are painfully inadequate so I turned to postgis. After converting the dataset to a shapefile, the solution was simple:

shp2pgsql "study_areas.shp" areas | psql -d gisdata pgsql2shp -f "study_areas_poly.shp" gisdata \ "SELECT BuildArea(collect(the_geom)) AS the_geom, name FROM areas GROUP by name"

Viola... a new shapefile with my proper polygons instead of CAD chicken scratch.

Back on the train

2007-05-13T00:00:00-06:00

I'd like to have some interesting excuse as to why I haven't posted since last July. But I don't.

I've since left my postion at NCEAS, started a new job at Geosyntec and have been keeping busy with life, love and the pursuit of happiness. Oh and GIS of course.

Anyway, I expect to be posting on a much more regular basis from here on (unless I get distracted again ;-) ).

Worldwind Java - Jython example

2007-05-13T00:00:00-06:00

The worldwind java sdk has finally been released. It's a neat SDK, well organized, easy to bring into Eclipse with some good examples to start hacking away.

The only problem is the examples are written in Java ;-) . If braces make you cringe but you still want to work with all the excellent Java libraries out there, you'll want to take a look at Jython. Taking the AWT1Up.java code and porting a subset of the functionality to Jython was surprisingly easy and yielded much more readable code in my opinion. And the ability to manipulate objects at the interactive prompt is just so sweet.

View the Source Code

Setup is not too terrible:

Get a Java JDK (I'm using sun java 6)
Download and install Jython 2.2b2
Download and unzip the worldwind java sdk (ex: /opt/wwj )
Set your LD_LIBRARY_PATH variable to /opt/wwj
Set your CLASSPATH variable to /opt/wwj/worldwind.jar
Run jython wwj_demo.py

One thing that is a bit disappointing with the WorldWind SDK in general is the lack of support for rendering common formats. Maybe I missed something but I couldn't get gpx or georss feeds working properly. It is version 0.2 so I expect support for GeoRSS and GPX to improve and for GML, KML, GeoJSON, Shapefiles, Rasters, WMS, etc to be included eventually.

Anyone else out there started playing with Jython / Worldwind yet?

The reliability of web services

2006-07-24T00:00:00-06:00

A few months back I posted a link to my ten favorite Web Mapping Services. The post included live links directly to the WMS servers. At first I questioned this move as locally hosted images would be far more reliable. But I thought it would be a neat experiment to see the downtime of each site. So I checked it daily just out of curiosity...

Well with today's apparent disappearance of the NASA JPL site, all but one of my WMS layers mentioned have been down for at least a significant portion of a day. (The only one that's been consitently up has been http://mesonet.agron.iastate.edu) .

This echos back to what I was complaining about with the whole USGS National Map debacle. The bottom line is that whenever we rely heavily on a web service to deliver essential data, we are risking the integrity of the end product. The chain is only as strong as it's weakest link and, unfortunately as the USGS and NASA have shown, those links can and will fail completely from time to time.

Converting Shapefiles (and more) to KML

2006-07-14T00:00:00-06:00

A while back I wrote about converting KML files into a shapefile for use with GIS apps other than GoogleEarth. I got a ton of emails and site traffic from people looking to go the opposite direction; getting their GIS data into KML.

There are, of course, a couple of utilities already implemented: ArcMap-based extensions including KML Home Companion and Arc2Earth, a nice MapWindow app called Shape2Earth, and the open source WMS Geoserver all support KML output.

Not to be left behind, GDAL/OGR now supports KML output. Oddly enough it does not yet read KML. But hand it any OGR-readable vector dataset and it can be converted into KML. It currently doesn't offer as much control over the output as the above options but is quicker to implement, works with a wide variety of input formats and can be easily scripted.

This functionality is in CVS only at the moment but should be included in the next release. If you can't wait and don't feel like compiling from cvs source, try the 1.0.5 version of FWTools (for Windows and Linux).

The conversion process is pretty straightforward. For example, the following will convert a shapefile (sbpoints.shp) to KML (mypoints.kml).

ogr2ogr -f KML mypoints.kml sbpoints.shp sbpoints

The KML format flys in the face of the GIS mantra stating that content should be seperate from styling. Since styling information is purposefully absent from most standard vector formats, it makes for pretty bland KML output. The attributes just get dumped out into one big text block and there is no classification or styling control.

But in terms of getting your data into Google Earth quickly (esp. point data), the OGR method looks promising.

Wardriving with Ubuntu Linux and Google Earth

2006-07-03T00:00:00-06:00

Wardriving is fun. Going around the neighborhood and mapping all the wireless networks may be nothing more than a geeky hobby but it can sure teach you alot. And viewing the results in Google Earth is icing on the cake.

I've used NetStumbler on windows and this works great but since my computers at home are now nearly Microsoft-free, I had to relearn the process on Linux. It breaks down into a few easy steps:

Install the drivers for you wireless card. On my HP laptop with a Broadcom card, I followed the instructions on the ubuntu forums which worked great with one exception: the driver link on that page doesn't have a valid md5 sum so you can download it from this url instead
Install gpsd. This is the software that talks to your gps unit and is available in the ubuntu packages through apt. The one hitch is that I had to set up my Magellan GPS unit up for the correct baud rate and NMEA output. Once installed, I connected the GPS unit via a serial port, turned it on and ran _ gpsd /dev/ttyS0 _ to start the gpsd server.
Install kismet, the wireless packet sniffer. The version in the ubuntu repository is not recent enough to support my Broadcom driver so I had to download the latest source and compile it with the standard _ configure, make, sudo make install . Then I had to edit the /usr/local/etc/kismet.conf to reflect my system configuration; I changed the _suiduser, source and logtemplate variables. Once configured, you can start it with the command sudo kismet.
Now drive/bike/walk around for a bit with your laptop and gps unit. When you're done, shutdown kismet and you'll have a bunch of fresh logfiles to work with.
The main kismet log is an xml file containing all the info on the available wireless networks including their SSID, their encryption sheme, transfer rater and their geographic position via gpsd. I worked up a small python script, kismet2kml.py (based on a blog entry at jkx@Home), to parse the logfile into a KML file for use with Google Earth. It could certainly use some tweaking but it's a start. To run it, give it the kismet logfile and pipe the output to a kml file:

kismet2kml.py kismet-log-Jul-03-2006-1.xml > wardrive.kml

Now fire up Google Earth (Linux version now available!) and load your KML file.

Also, as James Fee points out, posting your data as KML files means that the data can be integrated into a growing number of kml-ready apps including google maps (just upload the kml and point your browser to http://maps.google.com/maps?q=http://your.server/wardrive.kml).

Another neat application I've found for dealing with kismet logs is the kismet2gpx script for converting the kismet gps tracklog into gpx. Since most gps units have pretty tight limitations on the length of stored tracks, logging them to your laptop with kismet could be an effective way of creating detailed tracks on very long trips.

Mapserver Include

2006-06-25T00:00:00-06:00

If you mange even a small number of Mapserver sites, eventually you notice that you use a number of identical layers in multiple mapfiles. The way this is typically done is to copy and paste the LAYER definition into each mapfile. But inevitably you'll need to change the styling or the data source and you have to manually go through each mapfile to sync the changes. Wouldn't it be nice to define the layer in a single file and use it in many mapfiles?

While Mapserver has no concept of an "include", the C preprocessor (cpp) does. This is mentioned on the Mapserver list every time the subject of includes comes up. Still I have yet to find an actual example so I thought I'd share my notes on how I accomplish a mapserver include:

Create your mapfile as usual but leave out any LAYER definitions that you wish to share amongst mapfiles. Instead use something like :

include "landsat.layer"

The C Preprocessor doesn't deal well with "#" which is the mapfile's chosen comment charachter. Instead replace with "##" to indicate a comment
Save this pseudo-mapfile as mymap.template
Create a file in the same directory called landsat.layer with the LAYER block.
Run the template through the preprocessor to generate the real mapfile :

cpp -P -C -o mymap.map mymap.template

The next step would be to script the preprocessing of all your mapfiles so that changing a layer definition in multiple mapfiles was as simple as changing the *.layer file and running the script.

Some thoughts on Where 2.0

2006-06-15T00:00:00-06:00

Oh man, it's a long drive from San Jose back to Santa Barbara! Anyways, just got back from where 2.0 and want to throw out my quick summary of the event.

There was alot of talk about all things open; open data, open source and open standards. There was lots of buzz around the open street map project, osgeo applications like grass, ossim, gdal, mapbender, etc., and tons of discussion of WMS, WFS and other relevant standards. This is great as I think all three will be the cornerstone of the spatial industry in the near future.

But, as I've mentioned before, people throw the word "open" around so much that it begins to loose meaning. From alot of conversations I had, I found many people were confused about the differences. Some folks seemed to think that the osgeo foundation was a data repository for open data (it may soon be! .. but not quite yet) and also that osgeo was an open standards organization trying to "compete" with the OGC. But that is what an event like this is for; to reach out and communicate, clarify and bridge the gaps between communities.

Of course I had to laugh as I heard a couple dozen people refer to Google Maps as an "open source" application.... it's proprietary source code using proprietary data through a proprietary data transfer mechanism. It may be "free" as in beer but that's about the extent of it's openness.

Social Data: using location technology as the basis for sharing personal experiences and social networking was a powerful theme at Where 2.0. It ran the gammut from tagging locations to writing personal travelogs to mobile location-based games to virtual worlds to mobile apps that would could differentiate stangers vs aquantainces in range of your bluetooth device.
Security and privacy: There are implications to the web/where2.0 mindframe. Publishing your location and personal information in real time through the web and mobile devices brings up some frightening security and privacy issues. Who owns the data? What licenses are your personal data distributed under? Do you need others permission to post their photos or locations? Who decides what is acceptable and what gets taken down? How is spam dealt with? Only two speakers were brave enough to fully address these issues head on and the panel had some good discussion on these topics. Kudos to them.
Bringing location technology to the masses: This was repeated by a few speakers; that in order to be successful in spatial technologies you need to bring your service to the masses. Certainly if you're trying to compete in the social networking space, this is true. But in general GIS and spatial tech has application that are far beyond the interests of the vast majority of people.. emergency management, infrastructure, environmental, real estate, etc.

The mantra that spatial data and services must appeal to a wide audience is analogous to saying that family cars are the only successful type of motorized vehicle. In terms of numbers, they may be a majority. But in terms of utility, there is a reason that construction companies pay hundreds of thousands of dollars for heavy industrial machinery.. because trying to haul tons of earth and debris with a Toyota Camry just doesn't work. Likewise there is a similar reason most municipalities don't use a Google Mashup to manage their parcel data.. it simply doesn't work. So what is appropriate for mass consumption may have little applicabilty to business/government/industry/research. And vice versa.

Mobile Applications: So much potential here and some really cool innovations in geotagging content. Really, for the first time, I got a sense that these personal devices could become a means for creating a vast database of socially relevant information. But the lack of security and privacy safegaurds along with the domination of the cellular networks and the heterogenous environment of mobile platforms, I still view most of this as pie-in-the-sky.
Some new discoveries:
- metacarta: A text parsing engine with a public API to extract geo info from plain text!
- gutenkarte: An application of the above to classic works of literature.
- open layers: A javascript application with a slick UI and simple API for displaying WMS and WFS
- open street map: A fantastic project focussing on collaborative development of a public street database
- mapstraction: A javascript layer on top of the 'Big 3' Mapping APIs that allows yoiu to switch seamlessly between the service providers.
- Google Earth & Sketchup: GE for linux!!! Wooo-hooo!! There was also a sweet demo of creating 3D drawings in Sketchup and placing them in GE. Very slick.
- Google Maps: Now with kml support! Just try http://maps.google.com/?q=http://path.to.your.kml
- Mapguide: I am embarrased to say I have never tried out Autodesk's open source offering but the demo was sweet.. a very high powered GIS for a web app. And the Autodesk folks were about the nicest group of guys you could meet.
- ArcGIS/Server 9.2: Author a map in ArcMap. Save as .mxd. Drop into web server. Instant kml and wms server!
- And while not new to me, there were alot of good overviews of some of my favorite software packages like OSSIM, GRASS, GDAL, Geoserver and World Wind (Java version coming this fall!!).
Finally, the prize for most interesting talk goes to Chris Spurgeon who spoke about the best geohacks of the last 3000 years. Long before computers, Chris showed how Eratosthenes measured the diameter of the earth, how the Polypenesian's used the stars as an advanced navigation system, how the post-renaissance world _re_discovered stars as a the key to navigation. And in more recent times he showed how Harry Beck reinvented the cartography of transportation with the London subway maps and how the VOR transmitters created highways in the featureless sky. This presentation really put current innovations in location technologies into perspective.

OK sorry about the lack of links but it's too late in the evening for that. Hope you enjoyed my rundown and I'm sure I'll have more to say after I get some sleep!

Animating the Blue Marble

2006-06-09T00:00:00-06:00

A while back I posted my technique for creating an animated gif out of a time series of maps. While this may have been the pinnacle of web animation circa 1997, the animated gif just didn't quite seem hip enough for this day and age.

Today I found a more modern example. This WorldKit interface, built with Flash, shows the seasonal progression of snow and land cover changes courtesy of the next generation Blue marble images. Complete with time slider, image fading and full animation controls, this interface really shines at providing an interactive experience rather than a passive visual display.

HostGIS Linux 3.6 Released

2006-06-03T00:00:00-06:00

Though probably not as big of a news item as this week's release of Ubuntu Dapper, there's another Linux release that might be of interest to us GIS folk:

Built off of a Slackware base (one of the oldest, most stable linux distros), HostGIS Linux aims to be a "minimal yet complete" distribution specifically built with GIS in mind. It is first and foremost a server platform; it does not include any window system at all. If you're looking for desktop GIS applications out-of-box, it might not be the best for you.

But for a GIS server, it comes with most of the open source stack preinstalled and configured. This latest release has a few changes and version upgrades for most of the components.

PHP, Python and Perl Mapscript
GDAL/OGR with PHP, Python and perl bindings
Postgresql 8.1 with PostGIS 1.1
drivers for many extra formats including jpeg2000 and ecw
Apache web server with Mapserver CGI

The primary motivation for creating HGL was to speed up the installation of new gis-enabled servers. Gregor Mosheh, the head programmer for HostGIS, has done an excellent job pretty much single-handedly putting this together. ( In full disclosure, I do consulting work for HostGIS, though I wasn't really involved in the creation of HostGIS Linux. )

The setup is your standard text-based install and is a piece of cake if you've ever installed Linux before. When you're through, you have the good ole' black and white text console staring at you. Not very interesting... But the really satisfying part is to fire up a web browser after the install and be able to point it to a working webGIS application. Anyone who has spent the time to set up the mapserver stack and its seemingly infinite dependencies can appreciate the amount of work this saves!

If you're not into learning a new distro, there is always the FGS linux installer which will set up a similar software stack on pretty much any linux.

And for Desktop GIS, many linux distros have a selection of GIS apps in their package repositories (You'll want to certainly grab GRASS, GDAL and QGIS) . FWTools can be a good option on both Linux and Windows to get you up and running quickly. Finally there are a number of other more desktop-oriented distros for GIS including Knoppix GIS and GeoLivre, both of which run as a live-cd so you can check it out before you install.

Anyways, back to sum up HostGIS Linux:

If you need to set up a GIS server with minimal fuss and you have some experience with Linux, you might like to try it out. It will save lots of time.

If you're a GIS user who needs a graphic windows environment to do GIS work on the Desktop, HostGIS Linux will not really make you happy out-of-the-box. Of course, since HGL is slackware based, you can use the slackware package management system to build an impressive Desktop system. But if you don't need to run a server or really care about having the latest versions, Ubuntu comes with a solid desktop environment and packages for alot of good GIS apps.

More on Mapnik WMS

2006-05-18T00:00:00-06:00

One of my initial complaints about the Mapnik WMS server was that it would not accept any parameters that were not in the OGC WMS spec. Some WMS clients will tag on extra parameters for various reasons and the OGC supports this in relation to vendor-specific parameters. The fix was pretty simple;in mapnik/ogcserver/common.py you can simply comment out

    #for paramname in params.keys():
    #    if paramname not in self.SERVICE_PARAMS[requestname].keys():
    #        raise OGCException('Unknown request parameter "%s".' % paramname)

to get the desired effect.

There was also the question of speed and how it compared to other WMS servers such as Mapserver. Since I already had both a Mapnik and Mapserver WMS set up using the exact same data source, styled in the same fashion, it was pretty simple to write a quick python script that would smack each WMS server with a given number of back-to-back WMS GetMap requests:

#!/usr/bin/env python
import urllib

server = sys.argv[1]
hits = int(sys.argv[2])

if server == 'mapnik':
    url = "http://localhost/fcgi-bin/wms?VERSION=1.1.1&REQUEST;=GetMap&SERVICE;=WMS&LAYERS;=world_borders&SRS;=EPSG:4326&BBOX;=-4.313249999999993,20.803500000000003,59.58675000000002,52.75350000000002&WIDTH;=800&HEIGHT;=400&FORMAT;=image/png&STYLES;=&TRANSPARENT;=TRUE&UNIQUEID;="
elif server == 'mapserver':
    url = "http://localhost/cgi-bin/mapserv?map=/home/perrygeo/mapfiles/world.map&VERSION;=1.1.1&REQUEST;=GetMap&SERVICE;=WMS&LAYERS;=worldborders&SRS;=EPSG:4326&BBOX;=-4.313249999999993,20.803500000000003,59.58675000000002,52.75350000000002&WIDTH;=800&HEIGHT;=400&FORMAT;=image/png&STYLES;=&TRANSPARENT;=TRUE&UNIQUEID;="

for i in range(0,hits):
    urllib.urlretrieve(url)

Then just run the script from the command line, specifying the server and number of hits, and wrap it in the time command. Here are the results:

Pretty close. Mapserver was just slightly faster in every case. Now this is just a preliminary test and it would be interested to see a comparison:

With larger datasets and more complex styling including classification and text labelling
With data from other sources such as postgis where the connection overhead might be significant
With Mapserver running as a fastcgi
With concurrent requests as opposed to back-to-back requests

Overall though, my opinion of Mapnik WMS remains high and I'd love to put it in production use in the near future. Stay tuned...

Mapnik WMS Server

2006-05-17T00:00:00-06:00

A few months ago, Mapnik came onto my radar and I was immediately impressed with the beautiful cartography. But, until recently, it was just a C++ libary with some python bindings that could be used to programmatically build nice map images from shapfiles, geotiffs or postgis layers. There were no common interfaces such as WMS to access mapnik... until last month. Jean Francois Doyon recently added a prototype WMS interface to Mapnik. It runs as a fastcgi script under apache. It is still a bit rough around the edges but the result is well worth a little extra setup effort.

I set up Mapnik as a WMS server recently and would like to share my process and results. This tutorial assumes you already have python, postgresql/postgis, proj4, python imaging library and apache2 already running. The examples are for Ubuntu Dapper Drake.. they may work well on other versions of Ubuntu and Debian but for other unixes (and certainly windows) many things may need to be tweaked.

First off, we have to install the base mapnik libs. These depend on the boost python bindings and the whole compile process is very simple (if a bit slow) in Ubuntu:

sudo apt-get install \
 libboost-python1.33.1 libboost-python-dev \
 libboost-regex1.33.1 libboost-regex-dev \
 libboost-serialization-dev \
 libboost-signals1.33.1 libboost-signals-dev \
 libboost-thread1.33.1 libboost-thread-dev \
 libboost-program-options1.33.1 libboost-program-options-dev \
 libboost-filesystem1.33.1 libboost-filesystem-dev \
 libboost-iostreams1.33.1 libboost-iostreams-dev
cd ~/src
svn checkout svn://svn.berlios.de/mapnik/trunk mapnik
cd mapnik
python scons/scons.py PYTHON=/usr/bin/python PGSQL_INCLUDES=/usr/local/include/postgresql \
  PGSQL_LIBS=/usr/local/lib/postgresql BOOST_INCLUDES=/usr/include/boost BOOST_LIBS=/usr/lib
sudo python scons/scons.py install PYTHON=/usr/bin/python PGSQL_INCLUDES=/usr/local/include/postgresql \
  PGSQL_LIBS=/usr/local/lib/postgresql BOOST_INCLUDES=/usr/include/boost BOOST_LIBS=/usr/lib
sudo ldconfig

Now we have to set up some additional libs in order to run the WMS:

cd ~/src
wget http://easynews.dl.sourceforge.net/sourceforge/jonpy/jonpy-0.06.tar.gz
tar -xzvf jonpy-0.06.tar.gz
cd jonpy-0.06/
sudo python setup.py install









# copy the ogcserver stuff into its own dir
mkdir /opt/mapnik; cd /opt/mapnik
cp ~/src/mapnik/utils/ogcserver/* .

Now you'll want to edit the ogcserver.conf file and change the following lines. The module is essentially the name of a python file (minus the .py extension) that we'll create later. The height and width just cutoff the maximum possible image size that can be requested.

  module=worldMapFactory
  maxheight=2048
  maxwidth=2048

Create our "map factory" module defining data sources, styles, etc.( worldMapFactory.py ). Most of this configuration is explained in the mapnik docs and well-commented examples. One thing to note is that the shapefile must be specified without the .shp extension :

from mapnik.ogcserver.WMS import BaseWMSFactory
from mapnik import *

class WMSFactory(BaseWMSFactory):

        def __init__(self):
                BaseWMSFactory.__init__(self)
                sty = Style()

      rl = Rule()
      rl.symbols.append(PolygonSymbolizer(Color(248,216,136)))
      rl.symbols.append(LineSymbolizer(Color(0,0,0),1))
                sty.rules.append( rl )

      self.register_style('style1', sty)

                lyr = Layer(name='world_borders', type='shape', \
                            file='/opt/data/world_borders/world_borders')

                lyr.styles.append('style1')
                self.register_layer(lyr)
                self.finalize()

Now we need to set up apache2 to handle fastcgi:

sudo apt-get install libapache2-mod-fcgid
sudo a2enmod fcgid

... and add some config lines to the apache config files, usually /etc/apache/httpd.conf but, in the case of this Ubuntu install, /etc/apache2/sites-enabled/default :

        ScriptAlias /fcgi-bin/ /usr/lib/fcgi-bin/
        < Directory "/usr/lib/fcgi-bin" >
                AllowOverride All
                Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
                Order allow,deny
                Allow from all
                SetHandler fastcgi-script
        < Directory>

Create the fast-cgi directory refered to by apache

sudo mkdir /usr/lib/fcgi-bin

Now create the actual server script as /usr/lib/fcgi-bin/wms

#!/usr/bin/env python

# Your mapnik dir containing the map factory 
# must be in the python path!

import sys
sys.path.append('/opt/mapnik')

from mapnik.ogcserver.cgiserver import Handler
import jon.fcgi as fcgi

class WMSHandler(Handler):
    configpath = '/opt/mapnik/ogcserver.conf'

fcgi.Server({fcgi.FCGI_RESPONDER: WMSHandler}).run()

Finally restart the apache server

sudo /etc/init.d/apache2 force-reload

Now you can access it with a WMS request like so:

http://localhost/fcgi-bin/wms?VERSION=1.1.1&REQUEST;=GetMap&LAYERS;=world_borders&
FORMAT=image/png&SRS;=EPSG:4326&STYLES;=&BBOX;=-81.54375,-58.3125,-59.04375,-47.0625&
EXCEPTIONS=application/vnd.ogc.se_inimage&width;=600&height;=300

Compare the linework with a comparable WMS service with UMN Mapserver on the backend. I'll let the results speak for themselves...

Even if it's map rendering is smooth, Mapnik's WMS server is still a bit rough around the edges:

It does not support GetFeatureInfo requests
The server has trouble with extra parameters. For instance some WMS clients like mapbuilder like to tag on an extra 'UNIQUEID' parameter to the URL and this causes an unnecessary error with mapnik's WMS server.
Mapnik intself does not support reprojection
It only supports shapefiles, geotiffs and postgis layers.

The readme.txt file in docs/ogcserver/ directory of the recent mapnik SVN checkout has a full list of known features and caveats so refer to them for the complete story.

But, all in all, I am very impressed with the quality of the Mapnik WMS server. I figured that, since Mapnik's goal has been high-quality cartographic output, speed would be sacrificed but I didn't notice any significant lag; on the contrary I think it was actually about on-par with Mapserver running as a CGI. If it was any slower, I didn' t notice it immediately. But then again it was only working with a relatively small shapefile and I was the only user. I'd like to do more rigourous stress tests on the Mapnik WMS to see how it compares to Mapserver and Geoserver under varying loads with greater volumes of data.

Educational ways to waste some time

2006-05-12T00:00:00-06:00

It's always great to find fun internet-based games that actually challenge you in "real world" skills. (And no, working on your wizard's Ether Flame spell in EverQuest is NOT a real world skill). After all, if you going to waste some time, it might as well be educational, right. Can you tell that my mother is a school teacher? Happy Mothers day!

Anyways, these might be old news to some folks but I've found two fun games that will keep your brain fresh.

First, there is GeoSense. This is a fanstastic interactive game that pits users one-on-one in a timed geography quiz. You're given a city and country and you have 10 seconds to click the map. The player with the best combination of speed and accuracy wins. Given American youth's horrible knowledge of geography, this site could be really helpful. I would recommend it to children of all ages if it weren't for the chatroom being infested with pubescent teen sex fiends. Just go use use myspace or something...

Secondly, for you Python programmers out there, there is the Python Challenge, a surprisingly challenging and mind-boggling course of puzzles that can be solved with Python. Actually some people have solved them with UNIX shell commands, perl or ruby, but many of the hints are python specific. They require a good dose of logic, persistence, knowledge of python libraries and a knack for finding patterns. Basically your goal is, given a minimal set of hints to find and process the data that will lead you to the next URL. I'm on level 9 right now and, well, I'm not going to admit to anyone how long it took to get there. Addictively challenging...

Thats it for now. Have fun.

The impact of urban areas on CO2 emmissions

2006-05-06T00:00:00-06:00

Increases in atmoshperic carbon dioxide (CO2) due to vehicle emmisions are considered one of the most important human-induced factors of climate change. Conventional wisdom would say that urban areas, with their huge populations, dense road networks and congested freeways, are the biggest offenders. This is true to some extent. But, viewed from a different perspective, the per-capita CO2 emmissions for these urban areas can be considerably less than surrounding rural and suburban areas.

Travelmatter.org has posted a series of maps comparing these two conflicting views. Here's a sample from Chicago that demonstrates the sharp dichotomy; both entirely accurate but different ways of analyzing the same data:

In every case, the total CO2 emmissions are much greater in dense urban areas. But, per-capita, the urban areas have much lower emmissions, sometimes dramatically lower. This second view indicates, as WorldChanging points out, that living in denser neighborhoods can reduce your climate impact. It makes sense that living closer to the places you need to go on a daily basis and having more access to public transportation would reduce the emmissions impact. Maybe cities are "greener" than most of us percieve them to be?

USGS Seamless is back

2006-05-05T00:00:00-06:00

Two weeks after I first noticed something had gone awry with the USGS Seamless site, they appear to have fixed their server issues. As of this morning, the interactive data viewer and download interface is fully functionaly as far as I can tell.

Now be gentle on their server. Rumour has it, if you download more than 3 DEMs at a time, the server might go down for another 2 weeks! Just kidding... everything seems to be working fine. Download away....

What’s going on with seamless.usgs.gov ?

2006-04-25T00:00:00-06:00

Since April 21, I have not been able to view or extract any data from the USGS Seamless site, ostensibly the central distribution center for the US National Map. The site has been changing rapidly from day to day ever since and it seems that changes are underway so at least we know someone is working on it.. or hacking it to pieces. The last day or two they appear to have given up and are just redirecting people to gisdata.usgs.gov which, of course, has no mention of the outage on the home page.

When I develop an internet application, even if it's only used by a few people, I usually seperate the development version from the stable, live version to minimize any downtime. And if you absolutely can't keep the app running, at least put a big banner on the page indicating that the system is down so people (like me) don't waste half an hour trying to figure out what they're doing wrong. Is this too much to ask of the USGS? They are supposed to be the official portal for accessing our nation's spatial data, right? And we're not talking about a small server hiccup here, it has been down since at least April 21st with no public indication that problems are occuring on the site.

I just recieved an email this morning from the USGS web mapping admin. The emphasis is mine:

We apologize for any issues you may have experienced lately. The Seamless server, and all related map services will be unavailable for at least the next few days. During this time, the sites may still appear to be functioning. Some may ask for a password, and others may not show up at all. Normally our status messages are posted at http://seamless.usgs.gov. However, since this server has been affected by this outage, users are being re-directed to http://gisdata.usgs.net. We are in the process of posting a message here as well, which you will be able to monitor for any updates. We are estimating that the site will be available again by Monday May 1st 2006. Our team is working diligently to have this service available as soon as possible. We appreciate you patience during this time.

I really shouldn't be surprised that a government agency botched it so badly; that seems to be the norm here in the US. But I've really come to rely on the seamless site for alot of data and it seems that 10 days of downtime for the sole distributor of our seamless national spatial data archive is a bit... amateur.

The distinction between open source and open standards

2006-04-23T00:00:00-06:00

Time and time again I see open source and open standards mentioned in the same sentence. While I'm a strong proponet of both, it is a bit disheartening to see how closely intertwined the two concepts are in the eyes of many GIS folks.

Open source refers to software distributed with a license that allows access to view and modify the source code. There are also some other criteria but unrestricted access to the source code is the key component.

Open standards refers to software-neutral specifications, usually developed collaboratively, to accomplish a technical goal. In the GIS world, this typically means OpenGIS specifications for sharing data across a network (WMS/ WFS/ WCS), data formats (GML), or for working with spatial data in a relational database (Simple Features Spec for SQL). We could arguably include pseudo-open specifications for data such as shapfiles and KML.

Open source applications do not always conform to open standards. Standards-compliant software does not necessarily have to be open source. So why are the two often mentioned in the same breath as though they were synonymous? Perhaps open source software is perceived as being "ahead" of other types of software in terms of adoption of standards; and maybe that's true. But there are many proprietary software companies that have devoted alot of effort towards making their software communicate via open standards and their efforts should not go unnoticed (ESRI and Cadcorp just to name the two I'm familiar with).

The promise of open standards is that anyone can develop and use compliant applications that can easily interoperate regardless of the chosen software package. While that promise is far from being fully realized, associating open standards with a particular type of software will not get us any closer.

Update: Or maybe we are getting close... check out Geoff Ziess' post on the OGC interoperability demonstration in Tampa. Ten vendors interoperating and sharing data in real time... this is what it's all about.

Animating Static Maps - The Geologic Evolution of North America

2006-04-11T00:00:00-06:00

The Cartography blog recently talked about a series of excellent Paleogeographic maps developed by Dr. Ron Blakey at Northern Arizona University. Ever since I first studied geology, I had dreamed of an atlas that would clearly and visually demostrate how our current land masses came to be. This time series of maps focuses on North America and the geologic events that shaped have shaped it for the last 500 million years. Truly fascinating and excellent work. I encourage everyone to check out the site and read a little about it as well as the narrative by Geoff Manaugh .

Now it occured to me that a time series of maps lends itself very well to an animated sequence. While I am no graphic artist, I have done a few projects in the past that required stiching together a time-series of maps into an animated gif. The process is fairly simple:

Download or create each map you want to include in the series. For best results, all maps should have the same size and extents.
Rename the images in alpha-numeric order (001.jpg, 002.jpg.... 045.jpg)
Install ImageMagick - a collection of efficient command line tools for image processing. It supports almost every common image format available these days.
run the convert command to create the animated gif:

convert -geometry 500x483 -delay 200 -loop 0 *.jpg mymovie.gif

The geometry is simply the WIDTHxHEIGHT dimensions of the output image (it helps if this is proportional to the original image dimensions).

The delay parameter specifies how many hundreths of a second delay occurs between each frame.

The loop parameter, when set to zero, indicates the gif will loop infinitely.

The *.jpg, if your operating environment supports wildcards, will take each of the jpg images in the current directory and stich them into an animated gif named mymovie.gif

Viola! An animated movie from a series of static maps. In the case of the Paleogeologic maps, there were 41 maps which produced a sizable animated gif (about 7.5 MB). You can check out the results here. I could watch this play for hours!! Really fascinating stuff.. many thanks to Dr. Ron Blakey for putting this project together.

LIDAR data processing with open source tools

2006-04-01T00:00:00-07:00

LIDAR data is certainly a hot technology these days. LIght Detection And Ranging data can be used to create extremely detailed terrain models but there are lots of barriers to using LIDAR data effectively. USGS Center for LIDAR Information Coordination and Knowledge was put in place to "facilitate data access, user coordination and education of lidar remote sensing for scientific needs".

Beyond the sheer size of the datasets and the knowledge and hardware required to process them, software is a big issue. In the realm of open-source GIS tools, there are many applications (GRASS being the most prominent) for dealing with elevation point data and processing it into more meaningful products such as elevation DEMs and contours.

Usually the data comes as simple ASCII text files and the x,y and z values are easily extracted from such a file. But take a look at the USGS data distribution site and you'll notice some of the datasets are distributed as LAS binary files. It makes sense to store such massive datasets in binary so I started looking for some LAS conversion tools.So after some searching, I found a bunch of proprietary products for working with LAS but no open source tools. Luckily, the format is well documented thanks to the efforts by the ASPRS to make it an open specification.

So dusting off my notes about parsing binary files in python, I set out to create a python module for extracting LIDAR data from LAS files. The LAS format contains a header which needs to be parsed first in order to read the point cloud. Once you have the header info, you can scan your way through the dataset to pick out the x,y,z values.

Here's an example of the python interface that will read the first 10,000 points into a 2D shapefile with the elevation as a attribute in the dbf:

import pylas
infile = 'sanand000001.las'
outfile = 'lidar.shp'
header = pylas.parseHeader(infile)
pylas.createShp(outfile, header, numpts=10000, rand=False)

The issue I struggled with is the sheer size of these datasets. A USGS quarter quad can contain 10 million points which is an excessive number of points to create, say, a 10 meter DEM over such a small area. Clearly there was a need to extract a subset of this dataset but just taking the points sequentially gives you a subset of the total area. So, by default, pylas randomly scans the data to pull the number of specified points so that the point cloud could cover the entire area (at a much lower point density). Without numpts specified, it will randomly select 1/2000th of the total number.

So the simplified interface to make a more manageable lidar shapefile would be:

header = pylas.parseHeader(infile)
pylas.createShp(outfile, header)

Once the shapefile is created, you can bring it into GRASS to do the processing to generate DEMs, contours and other derived elevation products:

v.in.ogr dsn=lidar.shp layer=lidar output=lidar
g.region vect=lidar
g.region res=10
v.surf.rst input=lidar elev=lidar_dem zcolumn=elev

# Launch the interactive 3D viewer
nviz lidar_dem

Of course the method I just described is very simplistic and does not even come close to utilizing the full potential of the LIDAR point cloud, but it's a start.

The pylas.py module can be downloaded here. The code has worked for me on the few datasets I've tested it with but it should certainly be considered a rough-cut, alpha product. There is much room for improvement and, of course, if you have any suggestions or contributions, please get in touch.

My Top Ten'

2006-03-26T00:00:00-07:00

Web Mapping Services (WMS) are not always my prefered option for accessing data; relying on a remote server to generate a pretty picture of the data is hardly a substitute for having the raw data in hand. But for many cases, I just need a decent looking basemap image and don't want to download gigabytes of data, especially if that data is updated frequently.

Software like GeoServer and Mapserver are making it easier to publish data via WMS and the number of WMS servers is surely growing... but how do you find them? There is no central registry for WMS servers but efforts like the refractions research ogc survey, mapdex and a few google tricks are making it easier to find data distributed via WMS. After many hours digging through WMS services to find the ones that suite my mapping needs, I've come across a number of gems that I use time and time again. Hopefully this will inspire some others to share their secret stash of WMS servers!

(Update: Anything Geospatial has a great link to a well-organized WMS server list for public use. Nice. )

You should be able to provide the online resource URL to your favorite WMS client software (my personal choice is openjump) and the client should display the list of layers available from that service.

If you're contructing WMS URLs "by hand" or in a browser, you can do a capabilities request (the online resource URL with _service=WMS?request=GetCapabilities _ appended to it) which will return an XML document describing the available layers, image formats, projections,etc. Take a look at the image src for any of the thumbnails below to see how the map request is constructed.

TerraServer Digital Raster Graphic (DRG): USGS Topo Quads Online Resource URL : _ http://terraservice.net/ogcmap.ashx? _

Layer Name : DRG

TerraServer Digital Ortho Photo Quads (DRG): Black and white aerial photos for the US Online Resource URL : _ http://terraservice.net/ogcmap.ashx? _

Layer Name : DOQ

NASA Landsat Imagery The Landsat mosaic is available in fase color (default) or in natural color (style=visual) as shown below.

Online Resource URL : http://onearth.jpl.nasa.gov/wms.cgi?

Layer Name : global_mosaic

45-minute Weather Radar Images (NEXRAD Base Reflectivity). Since this is a dynamic data source, the image below may look really boring (ie blank) if there's no storms over the Continental US.

Online Resource URL : http://mesonet.agron.iastate.edu/cgi-bin/wms/nexrad/n0r.cgi?

Layer Name : nexrad-n0r-m45m

USGS National Landcover The 30-meter natial landcover dataset. USGS is nice enough to provide a legend, of course.

Online Resource URL : http://gisdata.usgs.net/servlet/com.esri.wms.Esrimap?ServiceName=USGS_WMS_NLCD&

Layer Name : US_NLCD

USGS National Elevation - Shaded Relief Online Resource URL : http://gisdata.usgs.net:80/servlet/com.esri.wms.Esrimap?servicename=USGS_WMS_NED&

Layer Name : US_NED_Shaded_Relief

USGS Reference Maps Online Resource URL : http://gisdata.usgs.net:80/servlet/com.esri.wms.Esrimap?servicename=USGS_WMS_REF&

Layer Names : States,County,Roads,Route_Numbers,Streams,Federal_Lands

Life Mapper Besides the standard WMS paramters, some services can take extra parameters in order to render a map. In this excellent service, LifeMapper requires that you provide the species name and it will render maps of known species locations and modelled distributions. Here's an example of the distribution of Black Bear (ie. Ursus americanus) over central california

Online Resource URL : http://www.lifemapper.org/Services/WMS/?ScientificName=Ursus%20americanus&

Layer Names : Species Distribution Models,Political Boundaries,Species Data Points

MODIS Daily Satellite Imagery Online Resource URL : http://wms.jpl.nasa.gov/wms.cgi?

Layer Names : daily_terra, daily_aqua

Terra Aqua

SRTMPlus 90 Meter DEM The image below doesn't make for a very good basemap OR a very good DEM for analytical purposes since all the values are scaled to an 8-bit color depth. However, JPL also offers this layer as an integer (16bit) GeoTIFF (Use format=image/geotiff and styles=short_int), so this can be a valuable way to quickly grab a DEM for a given region. Online Resource URL : http://wms.jpl.nasa.gov/wms.cgi?

Layer Names : srtmplus

If you'd like to view these layers interactively, here's a mapserver application which "cascades" the above WMS layers through a single interface. If you're interested in setting up these layers in a mapserver application, check out the WMS Mapfile for some examples.

StarSpan for vector-on-raster analysis

2006-02-17T00:00:00-07:00

It's amazing how many excellent open source GIS applications are out there just waiting to be discovered. I've been working with open source GIS for over 3 years now and I still find new and interesting software on a regular basis. The latest "Why haven't I heard of this before?" discovery came from the GRASS mailing list discussion on StarSpan, a tool developed at University of California at Davis "designed to bridge the raster and vector worlds of spatial analysis using fast algorithms for pixel level extraction from geometry features".

Our research project for the Ecosystem Based Management group at UCSB is in need of this exact tool in order to extract raster statistics based on a vector watersheds layer. ArcGIS and GRASS both have some of the capabilities we need through the Zonal_Statistics and v.rast.stats functions respectively. However they have their limitations and neither really handles categorical raster summaries by polygon. StarSpan looks like a more efficient option in terms of speed, scriptability and capabilities.

Installation is very smooth. It requires a recent version of GDAL (>= 1.2.6) and GEOS (>= 2.1.2). Once the dependencies are met, compilation on a unix system is as easy as configure, make, make install (There are also Windows binaries available). There is a single command line interface for all the functionality and StarSpan is able to handle all GDAL rasters and OGR vectors.

For classified rasters such as a land cover raster, we'd like to get the number of pixels for each landcover class by watershed. StarSpan creates a nice, normalized csv with three columns; The vector feature id, the raster value, and the number of pixels. There will be up to (number of features X number of classes) rows.

starspan --vector watershed.shp --raster landcover.tif --count-by-class landcover_by_watershed.csv

In order to find the percentage of a given raster class for each watershed, you can bring the csv into a relational database and do a quick SQL query. Here's an example of finding the percentage of cropland (class value is 12) for each watershed:

SELECT t.fid AS fid, (t.count::numeric / s.total::numeric) * 100 AS percentage_cropland
FROM landcover_by_watershed t,
               (SELECT fid, sum(count) AS total 
                FROM lancover_by_watershed 
                GROUP BY fid) as s 
WHERE t.fid = s.fid
AND t.class = 12;

Which gives us...

 fid |      percentage_cropland
-----+------------------------------------------------
   1 | 28.571428571428571429
   2 | 71.428571428571428571
   3 | 36.363636363636363636
   4 | 63.636363636363636364

For continuous surfaces such as elevations and slopes, we'll need to get quantitative statistics of those rasters by watershed. StarSpan can easily generate averages, mode, standard deviation, min and max:

starspan --vector watershed.shp --raster slope.tif --stats slope_stats.csv avg mode stdev min max

Which outputs a csv with one row per feature identified by feature id and each stat as a column:

FID,numPixels,avg_Band1,mode_Band1,stdev_Band1,min_Band1,max_Band1
1,25921,34.694822,38.917000,14.491952,0.347465,66.241035
2,21755,7.965552,0.000000,5.484245,0.000000,42.017155
...

While I can confirm that these small test cases work very quickly and give us pretty much the exact outputs we need, it will be interesting to see how well it stacks up to ArcGIS and GRASS when it comes to cranking out the big datasets. We'll likely try all three methods and I'll make sure to post the results.

Oh and the comparison between StarSpan and GRASS may become at moot point in the future since there is talk about integrating it with the GRASS project. While a GRASS module would be nice, not everyone has GRASS installed so I would hope the stand-alone version is still maintained since it can deal with pretty much any vector or raster data source.

Forest Service plans largest land sale in decades

2006-02-13T00:00:00-07:00

The Seattle Times is reporting some details on President Bush's "Secure Rural Schools Initiative" which involves the largest US Forest Service land sales in decades in order to pay for rural school and roads. Some 309,421 acres will be up for sale which amounts to only 0.16 % of the 190 million acres managed by the Forest Service. Most of the parcels are isolated areas bordering private land.

Details and some limited maps of the initiative can be found here as well as a listing of forest service land that are potentially eligible for sale.

No doubt environmentalists, developers and timber companies will be scrutinizing these pieces of land in the coming months. More details and maps should be available around Feb 28th. Since the Forest Service is required to request public input on the sales, it would be nice if they could provide a GIS version of the maps for download... A public web GIS would inform the process immensely. I'll be keeping an eye out for some detailed GIS data. Let me know if you know of a good source.

GDAL-based DEM utilities

2006-02-08T00:00:00-07:00

These DEM tools have been incorporated into GDAL. The code referenced on this page is no longer maintained and I'd highly recommend using gdaldem instead.

A few months ago, I began looking for some efficient command-line tools to analyze and visualize DEMs. I typically use GRASS for such tasks but GRASS only works with it's native raster format. Sure you can import/export to common formats but that's not as efficient as a single command line tool that could work with the native DEM format, run on systems without GRASS installed and provide easy scriptablity.

Not having found anything that fit the bill, I decided to port some of the common GRASS DEM modules to C++ using the GDAL libraries. For someone with very little experience with C++, this was surprisingly not that difficult though I learned quite alot along the way. The result: 3 command line utilities to generate hillshades, slope and aspect maps and 1 excellent utility contributed by Paul Surgeon to apply color ramping to a DEM.

Installation

Requirements

I built these utilities on Ubuntu Linux. I admittedly have no idea how to compile them on Windows but some folks have confirmed that the hillshade code compiles under VC++. So to get these running under Linux (and presumably other unixes), there are very minimal requirements

GDAL shared libraries
GNU C++ Compiler

Download

Get the current source and unzip it. EDIT : This code is now avaible through my SVN repository : http://perrygeo.googlecode.com/svn/trunk/demtools/,

Compiling

Alas there is no makefile but installation should be fairly painless. To compile the source code under linux, the following commands should take care of it:

g++ hillshade.cpp -lgdal -o hillshade
g++ color-relief.cxx -lgdal -o color-relief
g++ aspect.cpp -lgdal -o aspect
g++ slope.cpp -lgdal -o slope

The four binaries can then be placed wherever your local binaries reside (typically /usr/local/bin)

Examples

The original DEM

In this particular example the input DEM is a GeoTIFF but these utilities can use any GDAL-supported raster source.

Slope

This command will take a DEM raster and output a 32-bit GeoTiff with slope values. You have the option of specifying the type of slope value you want: degrees or percent slope. In cases where the horizontal units differ from the vertical units, you can also supply a scaling factor.

slope dem.tif slope.tif

Aspect

This command outputs a 32-bit GeoTiff with values between 0 and 360 representing the azimuth of the terrain.

aspect dem.tif aspect.tif

Hillshade

This command outputs an 8-bit GeoTiff with a nice shaded relief effect. It's very useful for visualizing the terrain. You can optionally specify the azimuth and altitude of the light source, a vertical exaggeration factor and a scaling factor to account for differences between vertical and horizontal units.

hillshade dem.tif shade.tif

Color ramps

After I posted the hillshade utility to the gdal-dev mailing list, there was some discussion about creating color relief maps to supplement the hillshades. Paul Surgeon took up the challenge and created a gdal-based C++ utility to colorize DEMs (or any other single band raster data sources for that matter). The technique is simple and powerful; by using a text-based color configuration file, you can create any range of color ramps for your data.

color-relief dem.tif scale.txt colordem.tif

Where scale.txt is a text file containting 4 columns per line, the elevation value and the corresponding RGB values:

3500   255 255 255
2500   235 220 175
1500   190 185 135
700    240 250 150
0      50  180  50
-32768 200 230 255

The colors between the given elevation values are blended smoothly and the result is a nice colorized DEM:

Color Shaded Relief (blending hillshade and colorized DEM)

There are two ways I've come up with to blend the hillshade and the colorized DEM:

Using GIMP or Photoshop, open both images, copy the shaded relief, paste on top of the color DEM and adjust the opacity in the layers dialog.
If you're publishing to the web with Mapserver, just stack the two images in your mapfile and set the TRANSPARENCY for the hillshade to a value between 30 and 70 depending on your preference

Though both methods work nicely, neither is really ideal since they don't generate a georeferenced tiff. You can get around this in the GIMP method by creating a world file (.tfw) for the output tiff. It might be nice, in the future, to do this step programatically but for now...

Let me know if you've got any suggestions or comments. The technique for all of these utilities is a simple 3x3 moving window so this code might serve as a good template to develop other raster processing utilities... let me know what you come up with!

First thoughts on the Open Source Geospatial Foundation

2006-02-04T00:00:00-07:00

Well after a long and productive day in Chicago, the 25 attendees (and a few dozen more from IRC) were able to establish a solid plan for the foundation. Gary Sherman at Spatial Galaxy has a good overview of the meeting outcome and has set up a very helpful IRC log of the meeting and the focus group discussions (Go Gary!). Tyler Mitchell has posted some photos of the meeting. I attended via IRC and phone for only a few hours so my understanding of the entire meeting is limited but I'll add a few thoughts on what went down.

First of all, the name was decided early on to be the "Open Source Geospatial Foundation". IMO, this name fits very well. Now that Autodesk's open source contribution has been rebranded from Mapserver Enterprise to MapGuide Open Source, I am glad to see the final chapter in the whole naming debacle!

I was also very interested in the funding discussion. The general consesus seemed to be that the foundation would generate income through sponsorships. The benefits to being a sponsor/supporter of OSGF include official recognition and the obvious PR value in addition to being able to direct your funds to a particular project. It would work something like this: 2/3 of your donation could be directed to a particular software project while 1/3 would go to the foundation itself. Of the 2/3 going to the project, the Project Steering Comittee (PSC) would decide how to best allocate those funds. There was brief discussion of doing some sort of "bounty" system that would allow sponsors to fund a particular feature but this was generally thought to be a bad idea since there are so many aspects of software development that are not "sexy" enough to generate income... like cleaning up and optimizing code, bug fixes, etc. By allowing the PSC to allocate the funds, the focus can be on a solid code base and careful feature additions. Of course those who want to fund specific features can still contract directly with the developers.

One of the ironies of the initial foundation's project membership is that Mapserver (the project that was the center of the original Mapserver Foundation) is not yet a member! While this may seem strange at first, the reasoning is so that the Mapserver community can vote on the matter. Other community-based efforts such as QGIS are likely waiting to hear from their users as well. Once the official statements from the OSGF are released, I suspect there will be a vote from these communities (and others) to decide whether they should join.

The criteria for projects to join the foundation was not entirely clear but it appears that they will be based on the commonalities of the initial projects. Requirements such as licensing and open standards are still foggy but will likely be written in such a way that they don't conflict with any of the initial projects.. a sort of reverse engineering of the criteria if you will.

There were many interesting discussions as far as the implementation of the foundation web presence, the legal protections that would be provided by the foundation, the expected costs of running the foundation, promotion and the structure of the governing board. I'll wait until the official announcement to see how these issues were resolved.

Overall it was an exciting and historic day for open source GIS. Many thanks to all the attendees and IRC participants for all the interesting and productive discussions. The future of the foundation is looking very bright and I look forward to seeing where it's heading in the coming months...

One quick update: Schuyler Erle, who deserves an extra round of thanks for his amazing efforts to keep us IRC attendees informed of the meeting, has some great first-hand insights on the OSGF.

OK another quick update: The official foundation website will eventually reside at www.osgeo.org.

Mexico-US Border Crossing Maps

2006-01-24T00:00:00-07:00

Thousands of Mexicans come to the United States every year and, besides the legal troubles of border crossing, they face a tough journey across the desert in order to reach their destination. They have very little information to go on and many die from dehydration as they attempt to find their way through the vast deserts of the american southwest.

A faith-based organization called Humane Borders is trying to help the situation. They have produced a number of maps documenting town locations, roads, water stations, walking distances, cell phone towers and even places where other immigrants have died along the way. This was made possible in part by GIS software donated by ESRI.

I heard today on CNN with Lou Dobbs that the Mexican government is now printing and distributing these maps to citizens. Though the maps will clearly state "Don't Do It! It's Hard! There's Not Enough Water!", critics are saying the maps aid criminals and will enourage illegal aliens to cross the border. Others have pointed out that, from an economic standpoint, this may benefit the US border patrol since so much of their budget is devoted to aiding sick and injured imigrants and properly taking care of the dead. Humane Borders is hoping to make people aware of the risks so that they can either choose not to go or be better prepared should they decide to cross.

In any case, it is an interesting example of how geographic information is still so important (and controversial) in our society.

Geocoding an address list to shapefile

2006-01-20T00:00:00-07:00

Most commercial software comes with fairly elaborate geocoding engines and there are nice geocoding services on the web that can do one-at-a-time geocoding but the recent post at Spatially Adjusted pointed out a great free resource for batch geocoding named, conveniently enough, Batch Geocode. Just give it a list of tab or pipe delimited addresses and it outputs a table with your original data plus a lat/long for every row.

I have been working on a python script to convert text files into point shapefiles and thought this would be a great chance to put it to work. The only dependency is a recent version of python with the ogr module (see FWTools for an easy to install package for windows or linux).

First, I take a list of cities and feed it to batchgeocode.com (a very nice feature is that the yahoo geocoder, on which batchgeocode is based, does not require street level addresses):

City|State Santa Barbara|CA Arcata|CA New Milford|CT Blacksburg|VA

After running the geocoder, I get back a table with lat/longs:

City|State|lat|long|precision
Santa Barbara|CA|34.419769|-119.696747|city
Arcata|CA|40.866261|-124.081673|city
New Milford|CT|41.576599|-73.408821|city
Blacksburg|VA|37.229359|-80.413963|city

Copy and paste that into a text file and add a second header row that defines the data type for each column. It would be possible to autodetect the column types but there are cases where a string of numeric digits should be kept as a string (for instance the zipcode 06776 would become 6776 if it was read as an integer).The possible column types are string, integer,real, x and y with x and y representing the coordinates.

City|State|lat|long|precision
string|string|y|x|string
Santa Barbara|CA|34.419769|-119.696747|city
Arcata|CA|40.866261|-124.081673|city
New Milford|CT|41.576599|-73.408821|city
Blacksburg|VA|37.229359|-80.413963|city

Now run the txt2shp.py utility. The input and output parameters are self-explanatory and the d parameter defines the string used as a delimiter. Notice that the syntax follows the GRASS standard of parameter=value:

txt2shp.py input=cities.txt output=cities.shp d='|'

And now you've got a shapefile of the geocoded cities!

The txt2shp.py script can be downloaded here. Try it out and let me know how it's working for you.

Update: In order to generate a .prj file for your output shapefile, you can use the epsg_tr.py utility if you know the EPSG code. Batch Geocoder returns everything in lat/long (presumably with a WGS84 datum?) so you can use EPSG code 4326:

epsg_tr.py -wkt 4326 > cities.prj

KML to Shapefile Scripting

2005-12-11T00:00:00-07:00

Christian Spanring has been doing some great work with Google Earth's KML data format. The latest offering is a fairly robust XSLT stylesheet for transforming KML into GML.

In the article, he mentions ogr2ogr as a method to convert GML to shapefiles so I immediately had to try it out! I came up with a simple bash script, kml2shp.sh, that provides a quick command-line interface:

kml2shp.sh input.kml output.shp

Here's the step-by-step:

Make sure you have xsltproc (the command-line xslt processor) and OGR installed.
Copy the xslt stylesheet to /usr/local/share/kml2gml/
Create the kml2shp.sh script below (make sure to change the paths to reflect your system, chmod +x it, etc)

!/bin/bash

if [ $# -ne 2 ]; then echo "usage: kml2shp.sh input.kml output.shp" exit fi

echo "Processing KML file" sed 's/ xmlns=\"http\:\/\/earth.google.com\/kml\/2.0\"//' $1 > /tmp/temp.kml xsltproc -o /tmp/temp.gml /usr/local/share/kml2gml/kml2gml.xsl /tmp/temp.kml

echo "Creating new Shapefile" ogr2ogr $2 /tmp/temp.gml myFeature

echo "Cleaning up temp files" rm /tmp/temp.gml rm /tmp/temp.kml

echo "New shapefile has been created:" echo $2

Now as far as I can tell, the XSLT is fairly robust although I've only tested it on a few datasets. The wrapper script, however, could use alot of work. Type and error checking would be nice for starters and a better method to remove the xml namespace might be necessary. This is really meant as a starting point.

One potential problem with this technique is that you will most likely get a 3D shapefile (x, y AND z coordinates). Many applications can handle 3D shapefiles but some (QGIS, others?) cannot at the present time. Once the geometry type is known, one could always specify the ogr2ogr "-nlt" parameter to force 2D output. But that's all for now... let me know if anyone has any suggestions on improving this technique.

Tissot Indicatrix - Examining the distortion of map projections

2005-12-11T00:00:00-07:00

The Tissot Indicatrix is a valuable tool for showing the distortions caused by map projections. It is essentially a series of imaginary polygons that represent perfect circles of equal area on a 3D globe. When projected onto a 2D map, their shape, size and/or angles will be distorted accordingly allowing you to quickly assess the projection's accuracy for a given part of the globe.

I've seen great Tissot diagrams in text books but I wanted to create the indicatrix as a polygon dataset so that I could project and overlay it with other data in a GIS. To do this I wrote a python script using the OGR libraries, which I will revist in a minute. But first the visually interesting part:

Here is a world countries shapefile overlaid with the Tissot circles in geographic (unprojected lat-long) coordinates:

Next I reprojected the datasets to the Mercator projection using ogr2ogr:

ogr2ogr -t_srs "+proj=merc" countries_merc.shp countries_simpl.shp countries_simpl
ogr2ogr -t_srs "+proj=merc" tissot_merc.shp tissot.shp tissot

Note that the angles are perfectly preserved (the trademark feature of the Mercator projection) but the size is badly distorted.

Now lets try Lambert Azimuthal Equal Area (in this case the US National Atlas standard projection - EPSG code 2163).

ogr2ogr -t_srs "epsg:2163" countries_lambert.shp countries_simpl.shp countries_simpl
ogr2ogr -t_srs "epsg:2163" tissot_lambert.shp tissot.shp tissot

This is a great projection for preserving area but get outside the center and shapes become badly distorted:

The best way to experiment with this is to bring the tissot.shp file into ArcMap (or another program that supports on-the-fly projection) and play with it in real time. The distortions of every projection just leap off the screen...

OK, now for the geeky part. Here's the python/OGR script used to create the tissot shapefile. The basic process is to lay out a grid of points across the globe in latlong, loop through the points and reproject each one to an orthographic projection centered directly on the point, buffer it, then reproject to latlong. The end result is a latlong shapefile representing circles of equal area on a globe.

 #!/usr/bin/env python
 # Tissot Circles
 # Represent perfect circles of equal area on a globe
 # but will appear distorted in ANY 2d projection.
 # Used to show the size, shape and directional distortion
 # by Matthew T. Perry
 # 12/10/2005

 import ogr
 import os
 import osr

 output = 'tissot.shp'
 debug = False

 # Create the Shapefile
 driver = ogr.GetDriverByName('ESRI Shapefile')
 if os.path.exists(output):
         driver.DeleteDataSource(output)
 ds = driver.CreateDataSource(output)
 layer = ds.CreateLayer(output, geom_type=ogr.wkbPolygon)

 # Set up spatial reference systems
 latlong = osr.SpatialReference()
 ortho = osr.SpatialReference()
 latlong.ImportFromProj4('+proj=latlong')

 # For each grid point, reproject to ortho centered on itself,
 # buffer by 640,000 meters, reproject back to latlong,
 # and output the latlong ellipse to shapefile
 for x in range(-165,180,30):
     for y in range (-60,90,30):
         f= ogr.Feature(feature_def=layer.GetLayerDefn())
         wkt = 'POINT(%f %f)' % (x, y)
         p = ogr.CreateGeometryFromWkt(wkt)
         p.AssignSpatialReference(latlong)
         proj = '+proj=ortho +lon_0=%f +lat_0=%f' % (x,y)
         ortho.ImportFromProj4(proj)
         p.TransformTo(ortho)
         b = p.Buffer(640000)
         b.AssignSpatialReference(ortho)
         b.TransformTo(latlong)
         f.SetGeometryDirectly(b)
         layer.CreateFeature(f)
         f.Destroy()

 ds.Destroy()

Processing S57 soundings

2005-12-03T00:00:00-07:00

NOAA Electronic Navigational Charts (ENC) contain (among many other things) depth soundings that can be processed into raster bathymetry grids. The ENC files are available as a huge torrent from geotorrent.org (http://geotorrent.org/details.php?id=58).

Download this torrent and check readme.txt to find the chart of interest:

Port Hueneme to Santa Barbara|5|2005-10-03|2005-10-03|US5CA65M

First check out the gdal documentation for s57 files at http://www.gdal.org/ogr/drv_s57.html.

Change to the US5CA65M directory and you'll see a .000 file (and maybe .001, .002 etc). Run ogrinfo on the .000 file and you'll see ~ 61 layers, one of which ("SOUNDG") represents the soundings. Let's start by examining the soundings layer:

ogrinfo -summary US5CA65M.000 SOUNDG

We see that there are 43 "features" but since the features are multipoints, there are actually thousands of soundings. The multipoints are 3D so If we convert to a shapefile with ogr2ogr's default settings we loose the 3rd dimension. To solve this, we need to append "25D" to the layer type. Furthermore, the multipoint geometry confuses some applications so we want to split it into a layer with simple 3D point geometries. Luckily there is a SPLIT_MULITPOINT option that must be specified as an environment variable:

export OGR_S57_OPTIONS="RETURN_PRIMITIVES=ON,RETURN_LINKAGES=ON,LNAM_REFS=ON,SPLIT_MULTIPOINT=ON,ADD_SOUNDG_DEPTH=ON" 
ogr2ogr -nlt POINT25d test3.shp US5CA65M.000 SOUNDG

Now we get ~ 3000 3D points with the depth added as an attribute for good measure.

Now bring these into grass and create a raster:

v.in.ogr -zo dsn=test3.shp output=soundg layer=test3
v.info soundg
g.region vect=soundg nsres=0.001 ewres=0.001
v.surf.rst input=soundg elev=bathy layer=0
r.info bathy

since depths actually show up as positive elevations, we want to multiply the grid by -1

r.mapcalc sb_bathy=bathy*-1

And of course we want to make some nice shaded relief and contour maps for viewing with QGIS:

r.shaded.relief map=sb_bathy shadedmap=sb_shade altitude=45 azimuth=315
r.contour input=sb_bathy output=sb_contour step=5
qgis &

From the screenshot, we see the pits and spikes from potential outliers so we might want to go back and adjust the tension and smoothing on the raster creation (the v.surf.rst command).

The new blog

2005-12-03T00:00:00-07:00

Well I finally got around to installing some real blogging software. SimplePHP Blog was just not cutting it and WordPress looks like a healthy option. So far I've been really impressed! Let me know if you have any troubles accessing it...

perrygeo.com

Getting started with application configuration in Rust

Project setup

Creating the configuration struct

Clap annotations

Self documentation

Environment handling

Constructor

Main

Result

Don't install PostgreSQL - Using containers for local development.

Running postgres in Docker, the naive approach

An alternative to system-wide PostgreSQL installs

Running it

What about in production?

Conclusion

Zonal Stats with PostGIS Rasters, part 2

Reproducible containers

Raster dataset

Vector data

Zonal Stats using python-rasterstats

Zonal Stats using postgis_raster

Load the raster data

Load the vector data

Run the query

Effect of tile size

Conclusion

Zonal Stats with PostGIS Rasters

The Dataset

Python with rasterstats

Postgis

Loading the data

Zonal Statistics in SQL

Thoughts

Processing vector features in Python

GeoJSON guides the way

The IO Sandwich

Other guidelines

An Example

Conclusion

Running Python with compiled code on AWS Lambda

Outline

Start EC2

Build shared libraries from source

Create a virtualenv

Python handler function

Worker

Bundle

Publish

The end result

Python affine transforms

Raspberry Pi: real-time sensor plots with websocketd

The circuit

Reading digital input pins from an analog sensor

Streaming websockets

HTML/Javascript interface

Zonal statistics: histograms as user-defined aggregate functions

Introduction

Zonal Histograms

User-defined aggregate functions

Topological simplification of simple features

The case for topology

The plan

Step 1: Convert to WGS84 shapefile

Step 2: Convert to TopoJSON and simplify

Step 3: Convert to GeoJSON

Optional Step 4: Convert to any OGR format

Case study: evaluating simplification tolerances

Sensitivity Analysis in Python

Demonstrates the use of the SALib python module to sample and test the sensitivity of models

SALib: a python module for testing model sensitivity

Case Study: Climate effects on forestry

Leaflet SimpleCSV

Simple leaftlet-based template for mapping tabular point data on a slippy map

Usage

Python rasterstats

This article introduces a python module for summarizing geospatial raster datasets based on vector geometries (i.e. zonal statistics).

Example

Geo interface

Categorical

Zonal Stats using `python-rasterstats`

Zonal Stats using `postgis_raster`

Python with `rasterstats`

Demonstrates the use of the `SALib` python module to sample and test the sensitivity of models