Vector Data and FATs

Problem Description

We need to ingest ESRI Shapefiles into Cate Desktop so that geometries can be displayed and user can interact with them on the 3D globe.

While the Cate WebAPI can easily read and process shape files through GeoPandas and/or Fiona, the challenge is to efficiently stream also very large Shapefiles into the display components of Cate Desktop. Note that e.g. the Glaciers CCI products contain ~80 MB of binary geometry coordinates.

Note that GeoPandas internally uses Fiona, which again uses GDAL for reading/writing from/to geo-data files. The supported vector data formats for both are therefore the ones supported by GDAL, e.g. GeoJSON (.geojson), Shapefile (.shp), or zipped Shapefile directory (*.zip).

The following facts have a major impact on the streaming and display performance and need to be addressed either in the back-end or front-end:

geometry data must be converted from binary Shapefile format into memory representation used by some Python library (e.g. geopandas, shapely), from there to some textual representation which can be interpreted by JavaScript libraries (e.g. GeoJSON, GML, KML, CZML), and finally into memory representation used by some JavaScript library (e.g. Cesium, OpenLayers, D3).
geometry data must be transformed from its source CRS to some target CRS used in the display. For example, source coordinates may be in UTM, but GeoJSON only supports EPSG-4326, and the display may be set to Polar Stereographic.
geometry data should be loaded only for the visible portion of the Earth and only for an adequate level of detail, i.e. hide all details that are not perceivable at the display's current zoom level.

Current solution

Note: This is implemented only in branch 477-nf-support_glaciers_cci in cate and cate-desktop.

Python back-end:

In the Python back-end geo-data resource is either represented

by the geopandas.GeoDataFrame returned by the read_geo_data_frame operation, where the variable are the columns of the FAT represented by a gpd.GeoSeries or
by the fiona.Collection returned by the read_geo_data_collection operation, where each variable is represented by a property within each collection record, which are GeoJSON Feature objects.

Data Model for Geo-Data / FATs

There are many pros and cons for representing a geo-data or FAT resource as either a geopandas.GeoDataFrame or a fiona.Collection.

Pros and cons of using a gpd.GeoDataFrame as geo-data or FAT data model:

(+) provides many ready-to-use functions, all the ones from Pandas so operations fat_min, fat_max, fat_query can be easily implemented in Cate. GeoPandas geometry objects are actually Shapely geometric objects which provide numerous geometric functions that can be used to implement Cate operations.
(+) objects don't change state while reading records.
(+) objects are compatible with a pandas.DataFrame thus can share operations with it. (pandas.DataFrame can be read from CSV/Excel files and are useful for e.g. in-situ data.)
(-) objects hold all FAT data in-memory which makes it hard to deal with large geo-data resources.
(-) objects must be converted to GeoJSON to be used as WebAPI response and for display in Cate Desktop. This is slow, because Fiona already converted any geo-data file to GeoJSON, then GeoPandas converted it into Pandas DataFrames plus Shapely geometry objects, which must be then transformed back to GeoJSON. (This is not only slow, but also stupid.)

Pros and cons of using a fiona.Collection as geo-data or FAT data model:

(-) Fiona uses a plain Python, GeoJSON representation for geo-data and FATs. Therefore also Fiona geometry objects are GeoJSON geometry objects with no functionality. There are no ready-to-use functions, thus geometric operations and operations such as fat_min, fat_max, fat_query must be implemented in Cate
- from-scratch (fast)
- by internally converting to geopandas.GeoDataFrame (slow)
(-) objects change state while reading records, see here
(-) objects are not compatible with a pandas.DataFrame thus cannot share operations with it. (pandas.DataFrame can be read from CSV/Excel files and are useful for e.g. in-situ data.)
(+) objects don't hold all FAT data in-memory, instead records are created while read which allows for fast streaming of large geo-data resources.
(+) objects can be directly used as GeoJSON WebAPI response and for display in Cate Desktop. This is way faster (several 10x) than using a geopandas.GeoDataFrame.

Regarding point 5, we've implemented a RESTful method in Cate's WebAPI which allows streaming GeoJSON from resources of type fiona.Collection at a given level of detail (/res/geojson/<resource_name>?level=<level>&...):

Simplification is performed on source CRS coordinates using an own implementation of the Visvalingam’s algorithm powered by numba JIT compilation. But it still uses the pure Python heapq implementation of a min-heap, which makes it slow. We remove a ratio of p polygon points where p = 2 ** -(num_levels - (level + 1)) and where max_level=8 (hard-coded constant). If level=0 we turn any geometry into a single point computed by the averages of longitudes and latitudes, respectively;
Transformation from the CRS used by the fiona.Collection object to EPSG-4326 as required by GeoJSON is done by proj4 package;
Streaming is done by a Tornado request handler implemented as an asynchronous "co-routine" (we still have issues here with concurrent invocations!).

An urgent problem is how we decide when to convert to points and when is it ok to stay with original geometries. Ideally, we would cluster the geometries, symbolize on low levels of detail until we reach the highest level of detail where we display the full geometries. An additional option would be to offer an extra clickable symbol (button) at highest highest level of detail that allows expanding into original geometry and collapsing into a symbol by clicking it.

JavaScript front-end:

Plan: Invoke that REST method for a given resource-variable pair if the resource's type is fiona.Collection or gpd.GeoDataFrame. Within a display, any polygons are then filled by the selected variable's value between a display min/max range using a given color bar (similar to raster data).

State: We stream GeoJSON into 3D globe and 2D map using custom data sources for the respective Cesium and OpenLayers APIs and display polygons at a constant simplification level using the default style settings (no styling by color mapping implemented yet). Loading of GeoJSON stream is done in a separate Web Worker process.

The current (not really award-winning) solution for displaying large Shapefiles (or any large geo-data sources) is to shrink them beforehand using some external tool, e.g. the GDAL command-line tool ogr2ogr:

$ ogr2ogr output.shp input.shp -simplify 0.0001

There is still a lot of work to be done to let users efficiently work with vector layers in cate-desktop:

Implement loading of GeoJSON data only for visible area and the required level of detail (this may be the hardest part!)
Cancel a streaming process if no longer required (e.g. view closed, selected variable changed).
Implement a mapping from the selected variable used for the current vector layer to some geometry style. Provide GUI for the mapping and the style settings (e.g. details section of LAYERS panel if vector layer is selected)
- For polygon data there must be a mapping from the values of the selected variable to polygon fill colors
- For point data there must be a mapping from the values of the selected variable to symbols of varying shape, icon, size, color.
Implement the default style settings for geometry of a selected vector layer if no variables exist or no variable is selected. Provide GUI for the default style settings (e.g. details section of LAYERS panel if vector layer is selected and/or user preferences dialog)
- For polygon data set the default stroke and fill
- For line data set the default stroke
- For point data set the default symbol
- How a selected geometry shall appear
We must allow users to interact with geometry, e.g. select a point or polygon, and then to use a selected geometry as an input for operations that accept geometry objects (subset_spatial, tseries_point, etc).

For a complete list refer to issue #477.

Performance Results

Data: glaciers_cci_gi_rgi05_TM-ETM_1994-2009_v150505/cci_gi_greenland_2000.shp (20280 polygons + FAT)

Here are some performance numbers for Python WebAPI (cate/webapi/rest.py) for full polygons:

~7s for reading from Shapefile with Fiona including coordinate transformation froma AEA to WGS84
~20s for additional conversion of features to GeoJSON text
~25s for additional actual writing to WebAPI output stream when transferring to Cate Desktop
~45s for additional adding of features as polygon entities to Cesium 3D globe. Actual drawing in Cesium takes much longer (1-2 minutes)

If we simplify to center points, the 45s drop down to 6s! Actual drawing in Cesium is done within a few seconds.

Remaining open issues

How to best implement a filter, e.g. filter_geo_data_collection, that will return a new Fiona collection. Fiona does not seem to support creating new in-memory collections.
Each collection must have its own layer and style. How to style the geometries for the display?
Features of a collection must be selectable, so that we can use their geometry as input to other operations --> Features may be harmonized with our current implementation of Placemarks and the current Countries layer.

The GeoJSON DataSource is currently downloaded for a selected variable:

export function getGeoJSONUrl(baseUrl: string, baseDir: string, layer: VariableVectorLayerState): string {
    return baseUrl + `ws/res/geojson/${encodeURIComponent(baseDir)}/${encodeURIComponent(layer.resName)}?`
        + `level=8`
        + `&var=${encodeURIComponent(layer.varName)}`
        + `&index=${encodeURIComponent((layer.varIndex || []).join())}`
        + `&cmap=${encodeURIComponent(layer.colorMapName)}`
        + `&min=${encodeURIComponent(layer.displayMin + '')}`
        + `&max=${encodeURIComponent(layer.displayMax + '')}`;
}

This means, if we change the selected variable, Cate Desktop will request a new GeoJSON DataSource! This is very inefficient and we should get rid of the variable-dependency. Color coding w.r.t. a selected variable shall be done in Cate Desktop. We also need to share a DataSource across multiple instances of the Cesium 3D globe, otherwise a 2nd globe will create download and keep in-memory a 2nd DataSource.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector Data and FATs

Problem Description

Current solution

Python back-end:

Data Model for Geo-Data / FATs

JavaScript front-end:

Performance Results

Remaining open issues

References and Resources

OpenLayers 2D Map

Cesium 3D Globe

Related JS Technology

Related Python Technology

Vector Tiles

Algorithms

GeoJSON

Clone this wiki locally