Merge branch 'databrickslabs:main' into main

a0x8o · Apr 14, 2023 · ea07ae6 · ea07ae6
2 parents 2c80340 + f59edd2
commit ea07ae6
Show file tree

Hide file tree

Showing 61 changed files with 3,400 additions and 244 deletions.
diff --git a/.github/workflows/build_main.yml b/.github/workflows/build_main.yml
@@ -8,10 +8,7 @@ on:
       - "scala/*"
   pull_request:
     branches:
-      - "R/*"
-      - "r/*"
-      - "python/*"
-      - "scala/*"
+      - "**"
 jobs:
   build:
     runs-on: ubuntu-20.04

diff --git a/docs/source/api/api.rst b/docs/source/api/api.rst
@@ -4,6 +4,8 @@ API Documentation
 .. toctree::
    :maxdepth: 2
 
+   vector-format-readers
+   raster-format-readers
    geometry-constructors
    geometry-accessors
    spatial-functions

diff --git a/docs/source/api/raster-format-readers.rst b/docs/source/api/raster-format-readers.rst
@@ -0,0 +1,154 @@
+=====================
+Raster Format Readers
+=====================
+
+
+Intro
+################
+Mosaic provides spark readers for the following raster formats:
+    * GTiff (GeoTiff) using .tif file extension - https://gdal.org/drivers/raster/gtiff.html
+    * COG (Cloud Optimized GeoTiff) using .tif file extension - https://gdal.org/drivers/raster/cog.html
+    * HDF4 using .hdf file extension - https://gdal.org/drivers/raster/hdf4.html
+    * HDF5 using .h5 file extension - https://gdal.org/drivers/raster/hdf5.html
+    * NetCDF using .nc file extension - https://gdal.org/drivers/raster/netcdf.html
+    * JP2ECW using .jp2 file extension - https://gdal.org/drivers/raster/jp2ecw.html
+    * JP2KAK using .jp2 file extension - https://gdal.org/drivers/raster/jp2kak.html
+    * JP2OpenJPEG using .jp2 file extension - https://gdal.org/drivers/raster/jp2openjpeg.html
+    * PDF using .pdf file extension - https://gdal.org/drivers/raster/pdf.html
+    * PNG using .png file extension - https://gdal.org/drivers/raster/png.html
+    * VRT using .vrt file extension - https://gdal.org/drivers/raster/vrt.html
+    * XPM using .xpm file extension - https://gdal.org/drivers/raster/xpm.html
+    * GRIB using .grb file extension - https://gdal.org/drivers/raster/grib.html
+    * Zarr using .zarr file extension - https://gdal.org/drivers/raster/zarr.html
+Other formats supported by GDAL will be added in future releases.
+
+Mosaic provides two flavors of the readers:
+    * spark.read.format("gdal") for reading 1 file per spark task
+    * mos.read().format("raster_to_grid") reader that automatically converts raster to grid.
+
+
+spark.read.format("gdal")
+*************************
+A base Spark SQL data source for reading GDAL raster data sources.
+It reads metadata of the raster and exposes the direct paths for the raster files.
+The output of the reader is a DataFrame with the following columns:
+    * path - path to the raster file on dbfs (StringType)
+    * ySize - height of the raster in pixels (IntegerType)
+    * xSize - width of the raster in pixels (IntegerType)
+    * bandCount - number of bands in the raster (IntegerType)
+    * metadata - raster metadata (MapType(StringType, StringType))
+    * subdatasets - raster subdatasets (MapType(StringType, StringType))
+    * srid - raster spatial reference system identifier (IntegerType)
+    * proj4Str - raster spatial reference system proj4 string (StringType)
+
+.. function:: spark.read.format("gdal").load(path)
+
+    Loads a GDAL raster file and returns the result as a DataFrame.
+    It uses standard spark.read.format(*).option(*).load(*) pattern.
+
+    :param path: path to the raster file on dbfs
+    :type path: *StringType
+    :rtype: DataFrame
+
+    :example:
+
+.. tabs::
+    .. code-tab:: py
+
+        >>> df = spark.read.format("gdal")\
+            .option("driverName", "TIF")\
+            .load("dbfs:/path/to/raster.tif")
+        >>> df.show()
+        +--------------------+-----+-----+---------+--------------------+--------------------+----+--------------------+
+        |                path|ySize|xSize|bandCount|            metadata|         subdatasets|srid|            proj4Str|
+        +--------------------+-----+-----+---------+--------------------+--------------------+----+--------------------+
+        |dbfs:/path/to/ra...|  100|  100|        1|{AREA_OR_POINT=Po...|                null| 4326|+proj=longlat +da...|
+        +--------------------+-----+-----+---------+--------------------+--------------------+----+--------------------+
+
+    .. code-tab:: scala
+
+        >>> val df = spark.read.format("gdal")
+            .option("driverName", "TIF")
+            .load("dbfs:/path/to/raster.tif")
+        >>> df.show()
+        +--------------------+-----+-----+---------+--------------------+--------------------+----+--------------------+
+        |                path|ySize|xSize|bandCount|            metadata|         subdatasets|srid|            proj4Str|
+        +--------------------+-----+-----+---------+--------------------+--------------------+----+--------------------+
+        |dbfs:/path/to/ra...|  100|  100|        1|{AREA_OR_POINT=Po...|                null| 4326|+proj=longlat +da...|
+        +--------------------+-----+-----+---------+--------------------+--------------------+----+--------------------+
+
+
+
+mos.read().format("raster_to_grid")
+***********************************
+Reads a GDAL raster file and converts it to a grid.
+It uses a pattern similar to standard spark.read.format(*).option(*).load(*) pattern.
+The only difference is that it uses mos.read() instead of spark.read().
+The raster pixels are converted to grid cells using specified combiner operation (default is mean).
+If the raster pixels are larger than the grid cells, the cell values can be calculated using interpolation.
+The interpolation method used is Inverse Distance Weighting (IDW) where the distance function is a k_ring
+distance of the grid.
+The reader supports the following options:
+    * fileExtension - file extension of the raster file (StringType) - default is *.*
+    * vsizip - if the rasters are zipped files, set this to true (BooleanType)
+    * resolution - resolution of the output grid (IntegerType)
+    * combiner - combiner operation to use when converting raster to grid (StringType) - default is mean
+    * retile - if the rasters are too large they can be re-tiled to smaller tiles (BooleanType)
+    * tileSize - size of the re-tiled tiles, tiles are always squares of tileSize x tileSize (IntegerType)
+    * readSubdatasets - if the raster has subdatasets set this to true (BooleanType)
+    * subdatasetNumber - if the raster has subdatasets, select a specific subdataset by index (IntegerType)
+    * subdatasetName - if the raster has subdatasets, select a specific subdataset by name (StringType)
+    * kRingInterpolate - if the raster pixels are larger than the grid cells, use k_ring interpolation with n = kRingInterpolate (IntegerType)
+
+
+.. function:: mos.read().format("raster_to_grid").load(path)
+
+    Loads a GDAL raster file and returns the result as a DataFrame.
+    It uses standard mos.read().format(*).option(*).load(*) pattern.
+
+    :param path: path to the raster file on dbfs
+    :type path: *StringType
+    :rtype: DataFrame
+
+    :example:
+
+.. tabs::
+    .. code-tab:: py
+
+        >>> df = mos.read().format("raster_to_grid")\
+            .option("fileExtension", "tif")\
+            .option("resolution", "8")\
+            .option("combiner", "mean")\
+            .option("retile", "true")\
+            .option("tileSize", "1000")\
+            .option("kRingInterpolate", "2")\
+            .load("dbfs:/path/to/raster.tif")
+        >>> df.show()
+        +--------+--------+------------------+
+        |band_id |cell_id |cell_value        |
+        +--------+--------+------------------+
+        |       1|       1|0.1400000000000000|
+        |       1|       2|0.1400000000000000|
+        |       1|       3|0.2464000000000000|
+        |       1|       4|0.2464000000000000|
+        +--------+--------+------------------+
+
+    .. code-tab:: scala
+
+        >>> val df = MosaicContext.read.format("raster_to_grid")
+            .option("fileExtension", "tif")
+            .option("resolution", "8")
+            .option("combiner", "mean")
+            .option("retile", "true")
+            .option("tileSize", "1000")
+            .option("kRingInterpolate", "2")
+            .load("dbfs:/path/to/raster.tif")
+        >>> df.show()
+        +--------+--------+------------------+
+        |band_id |cell_id |cell_value        |
+        +--------+--------+------------------+
+        |       1|       1|0.1400000000000000|
+        |       1|       2|0.1400000000000000|
+        |       1|       3|0.2464000000000000|
+        |       1|       4|0.2464000000000000|
+        +--------+--------+------------------+
diff --git a/docs/source/api/spatial-indexing.rst b/docs/source/api/spatial-indexing.rst
@@ -8,13 +8,13 @@ from the selected spatial grid.
 The grid system can be specified by using the spark configuration `spark.databricks.labs.mosaic.index.system`
 before enabling Mosaic.
 
-The valid values are
-* `H3` - Good all-rounder for any location on earth
-* `BNG` - Local grid system Great Britain (EPSG:27700)
-* `CUSTOM(minX,maxX,minY,maxY,splits,rootCellSizeX,rootCellSizeY)` - Can be used with any local or global CRS
-    * `minX`,`maxX`,`minY`,`maxY` can be positive or negative integers defining the grid bounds
-    * `splits` defines how many splits are applied to each cell for an increase in resolution step (usually 2 or 10)
-    * `rootCellSizeX`,`rootCellSizeY` define the size of the cells on resolution 0
+The valid values are:
+    * `H3` - Good all-rounder for any location on earth
+    * `BNG` - Local grid system Great Britain (EPSG:27700)
+    * `CUSTOM(minX,maxX,minY,maxY,splits,rootCellSizeX,rootCellSizeY)` - Can be used with any local or global CRS
+        * `minX`,`maxX`,`minY`,`maxY` can be positive or negative integers defining the grid bounds
+        * `splits` defines how many splits are applied to each cell for an increase in resolution step (usually 2 or 10)
+        * `rootCellSizeX`,`rootCellSizeY` define the size of the cells on resolution 0
 
 Example