Merge pull request #316 from s22s/docs/constant_tiles_hotfix

Display-oriented tweaks to docs
locationtech · Aug 24, 2019 · ba3d30d · ba3d30d
2 parents f1e4646 + 6aa3c53
commit ba3d30d
Show file tree

Hide file tree

Showing 4 changed files with 65 additions and 88 deletions.
diff --git a/pyrasterframes/src/main/python/docs/aggregation.pymd b/pyrasterframes/src/main/python/docs/aggregation.pymd
@@ -8,50 +8,62 @@ from pyrasterframes.rasterfunctions import *
 from pyspark.sql import *
 import os
 
+import numpy as np
+np.set_printoptions(precision=3, floatmode='maxprec')
+
 spark = create_rf_spark_session()
 ```
 
 There are three types of aggregate functions: _tile_ aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a _tile_ column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of _tiles_.
 
 ## Tile Mean Example
 
-We can illustrate aggregate differences by computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_ where the first _tile_ is composed of 25 values of 1.0 and the second _tile_ is composed of 25 values of 3.0.
+We can illustrate aggregate differences by computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_. The _tiles_ will contain normally distributed cell values with the first row's mean at 1.0 and the second row's mean at 3.0. For details on use of the `Tile` class see @ref:[the page on numpy interoperability](numpy-pandas.md).
 
-```python, sql_dataframe
-import pyspark.sql.functions as F
+```python, create_tile1
+from pyrasterframes.rf_types import Tile, CellType
 
-df1 = spark.range(1).select('id', rf_make_ones_tile(5, 5, 'float32').alias('tile'))
-df2 = spark.range(1).select('id', rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), F.lit(3)).alias('tile'))
+t1 = Tile(1 + 0.1 * np.random.randn(5,5), CellType('float64raw'))
 
-rf = df1.union(df2)
+t1.cells  # display the array in the Tile
+```
 
-tiles = rf.select("tile").collect()
-print(tiles[0]['tile'].cells)
-print(tiles[1]['tile'].cells)
+```python, showt5
+t5 = Tile(5 + 0.1 * np.random.randn(5,5), CellType('float64raw'))
+t5.cells
 ```
 
-We use the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
+Create a Spark DataFrame from the Tile objects.
+
+```python, create_dataframe
+import pyspark.sql.functions as F
+from pyspark.sql import Row
+
+rf = spark.createDataFrame([
+    Row(id=1, tile=t1),
+    Row(id=2, tile=t5)
+]).orderBy('id')
+```
+
+We use the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is about 1.0 and the second mean is about 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
 
 ```python, tile_mean
-means = rf.select(F.col('id'), rf_tile_mean(F.col('tile')))
-means
+rf.select(F.col('id'), rf_tile_mean(F.col('tile')))
 ```
 
-We use the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
+We use the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages values across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
 
 ```python, agg_mean
-mean = rf.agg(rf_agg_mean(F.col('tile')))
-mean
+rf.agg(rf_agg_mean(F.col('tile')))
 ```
 
 We use the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. For this aggregation, we are computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.
 
 To compute an element-wise local aggregate, _tiles_ need to have the same dimensions. In this case, both _tiles_ have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal _tile_ dimensions, we would get a runtime error.
 
 ```python, local_mean
-t = rf.agg(rf_agg_local_mean(F.col('tile')).alias('local_mean')) \
-    .collect()[0]['local_mean']
-print(t.cells)    
+rf.agg(rf_agg_local_mean('tile')) \
+    .first()[0].cells.data  # display the contents of the Tile array
 ```
 
 ## Cell Counts Example
@@ -92,12 +104,11 @@ stats
 The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
 
 ```python, agg_local_stats
-df1 = spark.range(1).select('id', rf_make_ones_tile(5, 5, 'float32').alias('tile'))
-df2 = spark.range(1).select('id', rf_make_constant_tile(3, 5, 5, 'float32').alias('tile')) 
-df3 = spark.range(1).select('id', rf_make_constant_tile(5, 5, 5, 'float32').alias('tile')) 
-
-rf = df1.union(df2).union(df3) \
-    .agg(rf_agg_local_stats('tile').alias('stats'))
+rf = spark.createDataFrame([
+    Row(id=1, tile=t1),
+    Row(id=3, tile=t1 * 3),
+    Row(id=5, tile=t1 * 5)
+]).agg(rf_agg_local_stats('tile').alias('stats'))
 
 agg_local_stats = rf.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance').collect()
 

diff --git a/...rames/src/main/python/docs/description.md → ...mes/src/main/python/docs/description.pymd b/...rames/src/main/python/docs/description.md → ...mes/src/main/python/docs/description.pymd
@@ -1,5 +1,28 @@
 # Overview
 
+```python, setup, echo=False
+import pyrasterframes
+import pyrasterframes.rf_ipython
+from pyrasterframes.rasterfunctions import rf_crs, rf_extent, rf_tile, rf_data_cells
+from pyspark.sql.functions import col, lit
+spark = pyrasterframes.get_spark_session()
+
+# Note that this is the same URI as in the getting started page...
+df = spark.read.raster('https://modis-pds.s3.amazonaws.com/MCD43A4.006/11/08/2019059/MCD43A4.A2019059.h11v08.006.2019072203257_B02.TIF')
+
+
+df = df.select(
+    lit('2019-02-28').alias('timestamp'),
+    rf_crs('proj_raster').alias('crs'),
+    rf_extent('proj_raster').alias('extent'),
+    col('proj_raster').alias('tile')
+) \
+    .orderBy(-rf_data_cells('tile')) \
+    .limit(4)
+
+assert df.select('crs').first() is not None, "example dataframe is going to be empty"
+```
+
 RasterFrames® provides a DataFrame-centric view over arbitrary Earth-observation (EO) data, enabling spatiotemporal queries, map algebra raster operations, and compatibility with the ecosystem of [Apache Spark](https://spark.apache.org/docs/latest/) [ML](https://spark.apache.org/docs/latest/ml-guide.html) algorithms. It provides APIs in @ref:[Python, SQL, and Scala](languages.md), and can scale from a laptop computer to a large distributed cluster, enabling _global_ analysis with satellite imagery in a wholly new, flexible, and convenient way.
 
 ## Context
@@ -29,7 +52,9 @@ RasterFrames introduces georectified raster imagery to Spark SQL. It quantizes s
 
 As shown in the figure below, a "RasterFrame" is a Spark DataFrame with one or more columns of type @ref:[_tile_](concepts.md#tile). A _tile_ column typically represents a single frequency band of sensor data, such as "blue" or "near infrared", but can also be quality assurance information, land classification assignments, or any other raster spatial data. Along with _tile_ columns there is typically an @ref:[`extent`](concepts.md#extent) specifying the geographic location of the data, the map projection of that geometry (@ref:[`crs`](concepts.md#coordinate-reference-system--crs-)), and a `timestamp` column representing the acquisition time. These columns can all be used in the `WHERE` clause when filtering.
 
-@@include[RasterFrame Example](static/rasterframe-sample.md)
+```python show_example_df, echo=False
+df
+```
 
 RasterFrames also includes support for working with vector data, such as [GeoJSON][GeoJSON]. RasterFrames vector data operations let you filter with geospatial relationships like contains or intersects, mask cells, convert vectors to rasters, and more.
 

diff --git a/pyrasterframes/src/main/python/docs/static/rasterframe-sample.md b/pyrasterframes/src/main/python/docs/static/rasterframe-sample.md
diff --git a/pyrasterframes/src/main/python/docs/vector-data.pymd b/pyrasterframes/src/main/python/docs/vector-data.pymd
@@ -57,37 +57,26 @@ Since it is a geometry we can do things like this:
 the_first['geometry'].wkt
 ```
 
-You can also write user-defined functions that take geometries as input, output, or both, via user defined types in the [geomesa_pyspark.types](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py) module. Here is a simple example of a user-defined function that uses both a geometry input and output to compute the centroid of a geometry.
+You can also write user-defined functions that take geometries as input, output, or both, via user defined types in the [geomesa_pyspark.types](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py) module. Here is a simple **but inefficient** example of a user-defined function that uses both a geometry input and output to compute the centroid of a geometry. Observe in a sample of the data the geometry columns print as well known text (wkt).
 
 ```python, add_centroid
 from pyspark.sql.functions import udf
 from geomesa_pyspark.types import PointUDT
 
 @udf(PointUDT())
-def get_centroid(g):
+def inefficient_centroid(g):
     return g.centroid
 
-df = df.withColumn('naive_centroid', get_centroid(df.geometry))
-df.printSchema()
-```
-
-We can take a look at a sample of the data. Notice the geometry columns print as well known text (wkt).
-
-```python, show_centroid
-df.limit(3)
+df.select(df.state_code, inefficient_centroid(df.geometry))
 ```
 
-
 ## GeoMesa Functions and Spatial Relations
 
 As documented in the @ref:[function reference](reference.md), various user-defined functions implemented by GeoMesa are also available for use. The example below uses a GeoMesa user-defined function to compute the centroid of a geometry. It is logically equivalent to the example above, but more efficient.
 
-
 ```python, native_centroid
 from pyrasterframes.rasterfunctions import st_centroid
-df = df.withColumn('centroid', st_centroid(df.geometry))
-centroids = df.select('geometry', 'name', 'naive_centroid', 'centroid')
-centroids.limit(3)
+df.select(df.state_code, inefficient_centroid(df.geometry), st_centroid(df.geometry))
 ```
 
 The RasterFrames vector functions and GeoMesa functions also provide a variety of spatial relations that are useful in combination with the geometric properties of projected rasters. In this example, we use the @ref:[built-in Landsat catalog](raster-catalogs.md#using-built-in-experimental-catalogs) which provides an extent. We will convert the extent to a polygon and filter to those within approximately 500 km of a selected point.