Skip to content

Performance & data compactness

Joost Farla edited this page Feb 18, 2016 · 15 revisions

Performance & data compactness

Complex geo features provided by base registries (like multi polygons) are often very accurate. Especially when querying feature collections, this results in very large response payloads eating up bandwidth and causing slow response times. For web applications, these payload sizes are unacceptable and will cause applications to become slow or unresponsive.

There are several solutions to overcome this issue:

Compression

Compression is an easy way to achieve reduction of the response payload size. While this will speed up the transfer of the data from the server to the client, the data still needs to be uncompressed and loaded into memory within the browser of app.

HTTP compression

HTTP compression is a capability that can be built into web servers and web clients to improve transfer speed and bandwidth utilization. The most common compression schemes include Gzip and Deflate.

Gzip is a (de-)compression utility using Lempel-Ziv coding (LZ77) and is broadly used by webservers like Apache and Nginx to compress response payloads without the risk of data loss.

In order to receive a gzip-encoded response the client must set an Accept-Encoding request header. Here is an example of a properly formed HTTP header for enabling Gzip compression:

Accept-Encoding: gzip

TopoJSON

TopoJSON is an extension of GeoJSON that encodes topology. Rather than representing geometries discretely, geometries in the TopoJSON format are stitched together from shared line segments called arcs. Arcs are sequences of points, while line strings and polygons are defined as sequences of arcs. Each arc is defined only once, but can be referenced several times by different shapes, thus reducing redundancy and decreasing the payload size. This way of storing spatial data is most efficient on polygons that share a lot of borders with neighbours, for example city-boundaries.

Besides GeoJSON, TopoJSON supports other source formats like ESRI shapefiles and CSV. It is also capable of simplifying geo features, a topic we will cover in the next section. Similar to HTTP compression, TopoJSON needs to be decompressed before the data can be used. Since TopoJSON performs compression at another level than HTTP, these two compression methods can be used in conjunction to return the smallest payload possible.

Reduce precision

Many use cases don't require the level of accuracy the source data offers. To simplify geometry to suit the displayed resolution, various line simplification algorithms exist.

Ramer–Douglas–Peucker

The Ramer–Douglas–Peucker algorithm (RDP) is an algorithm for reducing the number of points in a curve that is approximated by a series of points. It does so by "thinking" of a line between the first and last point in a set of points that form the curve. It checks which point in between is farthest away from this line. If the point (and as follows, all other in-between points) is closer than a given distance epsilon, it removes all these in-between points. If on the other hand this outlier point is farther away from our imaginary line than epsilon, the curve is split in two parts.

Visvalingam–Whyatt

While Douglas–Peucker is the most well-known, the Visvalingam–Whyatt algorithm, also used by TopoJSON, may be more effective and has a remarkably intuitive explanation: it progressively removes points with the least-perceptible change. For example, the GeoJSON file used to draw the continental United States above can be reduced from 531KB to 27KB with only minor visual changes (example here).

Cluster

Lots of libraries make it very easy to draw markers (geo points) on a map. However, spatial datasets often get very large and markers start to clutter. This image illustrates the problem:

Cluttered markers

Some libraries have clustering capabilities (example), but this is only a partial solution to the problem. The client (browser/app) still has to deal with large payloads and high memory/cpu consumption, what can result in slow or unresponsive behavior. Therefore, it would be better if the server handles the clustering and only serves the necessary data for a clustered map view.

The result we want looks like this:

Clustered markers

There are several ways to handle clustering server-side.

Geohash

Geohash is a hierarchical spatial data structure which subdivides space into buckets of grid shape. Geohashes offer properties like arbitrary precision and the possibility of gradually removing characters from the end of the code to reduce its size (and gradually lose precision). Example:

Geohash

Each bucket has a short unique identifier, which can be of great help indexing these in an efficient manner. A popular search engine that supports geohashing is Elasticsearch.

Quadtree

Quadtree

Aggregate

Distance

Geohash

Serve from (fast) index

Elasticsearch

Clone this wiki locally