Vector Data and FATs

Problem Description

We need to ingest ESRI Shapefiles into Cate Desktop so that geometries can be displayed and user can interact with them on the 3D globe.

While the Cate WebAPI can easily read and process shape files through geopandas and/or fiona, the challenge is to efficiently stream also very large Shapefiles into the display components of Cate Desktop. Note that e.g. the Glaciers CCI products contain ~80 MB of binary geometry coordinates.

The following facts have a major impact on the streaming and display performance and need to be addressed either in the back-end or front-end:

geometry data must be converted from binary Shapefile format into memory representation used by some Python library (e.g. geopandas, shapely), from there to some textual representation which can be interpreted by JavaScript libraries (e.g. GeoJSON, GML, KML, CZML), and finally into memory representation used by some JavaScript library (e.g. Cesium, OpenLayers, D3).
geometry data must be transformed from its source CRS to some target CRS used in the display. For example, source coordinates may be in UTM, but GeoJSON only supports EPSG-4326, and the display may be set to Polar Stereographic.
geometry data should be loaded only for the visible portion of the Earth and only for an adequate level of detail, i.e. hide all details that are not perceivable at the display's current zoom level.

Current solution

Note: This is implemented only in branch 477-nf-support_glaciers_cci in cate and cate-desktop.

Python back-end:

In the Python back-end geo-data resource is either represented

by the gpd.GeoDataFrame, the variable is represented by a gpd.GeoSeries or
by the fiona.Collection, the variable is represented by a property within each collection record, which are GeoJSON Feature objects.

We currently provide the read_geo_data_collection operation which returns a fiona.Collection given a file path to a supported vector data format such as GeoJSON (.geojson) or Shapefile (.shp) or zipped Shapefile directory.

Important note: we prefer using Fiona over GeoPandas because reading Shapefiles by means of a fiona.Collection is way faster (several 10x) than using a pd.GeoDataFrame. This means, Cate users should read Shapefiles using the read_geo_data_collection() operation rather than read_geo_data_frame() for maximum display performance.

In addition, we've implemented a RESTful method in Cate's WebAPI which allows streaming GeoJSON from resources of type fiona.Collection at a given level of detail (/res/geojson/<resource_name>?level=<level>&...):

Simplification is performed on source CRS coordinates using an own implementation of the Visvalingam’s algorithm powered by numba JIT compilation. But it still uses the pure Python heapq implementation of a min-heap, which makes it slow. We remove a ratio of p polygon points where p = 2 ** -(num_levels - (level + 1)) and where max_level=8 (hard-coded constant). If level=0 we turn any geometry into a single point computed by the averages of longitudes and latitudes, respectively;
Transformation from the CRS used by the fiona.Collection object to EPSG-4326 as required by GeoJSON is done by proj4 package;
Streaming is done by a Tornado request handler implemented as an asynchronous "co-routine" (we still have issues here with concurrent invocations!).

An urgent problem is how we decide when to convert to points and when is it ok to stay with original geometries. Ideally, we would cluster the geometries, symbolize on low levels of detail until we reach the highest level of detail where we display the full geometries. An additional option would be to offer an extra clickable symbol (button) at highest highest level of detail that allows expanding into original geometry and collapsing into a symbol by clicking it.

JavaScript front-end:

Plan: Invoke that REST method for a given resource-variable pair if the resource's type is fiona.Collection or gpd.GeoDataFrame. Within a display, any polygons are then filled by the selected variable's value between a display min/max range using a given color bar (similar to raster data).

State: We stream GeoJSON into 3D globe and 2D map using custom data sources for the respective Cesium and OpenLayers APIs and display polygons at a constant simplification level using the default style settings (no styling by color mapping implemented yet). Loading of GeoJSON stream is done in a separate Web Worker process.

The current (not really award-winning) solution for displaying large Shapefiles (or any large geo-data sources) is to shrink them beforehand using some external tool, e.g. the GDAL command-line tool ogr2ogr:

$ ogr2ogr output.shp input.shp -simplify 0.0001

There is still a lot of work to be done to let users efficiently work with vector layers in cate-desktop:

Implement loading of GeoJSON data only for visible area and the required level of detail (this may be the hardest part!)
Cancel a streaming process if no longer required (e.g. view closed, selected variable changed).
Implement a mapping from the selected variable used for the current vector layer to some geometry style. Provide GUI for the mapping and the style settings (e.g. details section of LAYERS panel if vector layer is selected)
- For polygon data there must be a mapping from the values of the selected variable to polygon fill colors
- For point data there must be a mapping from the values of the selected variable to symbols of varying shape, icon, size, color.
Implement the default style settings for geometry of a selected vector layer if no variables exist or no variable is selected. Provide GUI for the default style settings (e.g. details section of LAYERS panel if vector layer is selected and/or user preferences dialog)
- For polygon data set the default stroke and fill
- For line data set the default stroke
- For point data set the default symbol
- How a selected geometry shall appear
We must allow users to interact with geometry, e.g. select a point or polygon, and then to use a selected geometry as an input for operations that accept geometry objects (subset_spatial, tseries_point, etc).

Remaining open issues

How to best implement a filter, e.g. filter_geo_data_collection, that will return a new Fiona collection. Fiona does not seem to support creating new in-memory collections.
Each collection must have its own layer and style. How to style the geometries for the display?
Features of a collection must be selectable, so that we can use their geometry as input to other operations --> Features may be harmonized with our current implementation of Placemarks and the current Countries layer.

The GeoJSON DataSource is currently downloaded for a selected variable:

export function getGeoJSONUrl(baseUrl: string, baseDir: string, layer: VariableVectorLayerState): string {
    return baseUrl + `ws/res/geojson/${encodeURIComponent(baseDir)}/${encodeURIComponent(layer.resName)}?`
        + `level=8`
        + `&var=${encodeURIComponent(layer.varName)}`
        + `&index=${encodeURIComponent((layer.varIndex || []).join())}`
        + `&cmap=${encodeURIComponent(layer.colorMapName)}`
        + `&min=${encodeURIComponent(layer.displayMin + '')}`
        + `&max=${encodeURIComponent(layer.displayMax + '')}`;
}

This means, if we change the selected variable, Cate Desktop will request a new GeoJSON DataSource! This is very inefficient and we should get rid of the variable-dependency. Color coding w.r.t. a selected variable shall be done in Cate Desktop. We also need to share a DataSource across multiple instances of the Cesium 3D globe, otherwise a 2nd globe will create download and keep in-memory a 2nd DataSource.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector Data and FATs

Problem Description

Current solution

Python back-end:

JavaScript front-end:

Remaining open issues

References and Resources

OpenLayers 2D Map

Cesium 3D Globe

Related JS Technology

Related Python Technology

Vector Tiles

Algorithms

GeoJSON

Clone this wiki locally