Skip to content

Commit

Permalink
Merge pull request #316 from s22s/docs/constant_tiles_hotfix
Browse files Browse the repository at this point in the history
Display-oriented tweaks to docs
  • Loading branch information
metasim authored Aug 24, 2019
2 parents f1e4646 + 6aa3c53 commit ba3d30d
Show file tree
Hide file tree
Showing 4 changed files with 65 additions and 88 deletions.
59 changes: 35 additions & 24 deletions pyrasterframes/src/main/python/docs/aggregation.pymd
Original file line number Diff line number Diff line change
Expand Up @@ -8,50 +8,62 @@ from pyrasterframes.rasterfunctions import *
from pyspark.sql import *
import os

import numpy as np
np.set_printoptions(precision=3, floatmode='maxprec')

spark = create_rf_spark_session()
```

There are three types of aggregate functions: _tile_ aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a _tile_ column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of _tiles_.

## Tile Mean Example

We can illustrate aggregate differences by computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_ where the first _tile_ is composed of 25 values of 1.0 and the second _tile_ is composed of 25 values of 3.0.
We can illustrate aggregate differences by computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_. The _tiles_ will contain normally distributed cell values with the first row's mean at 1.0 and the second row's mean at 3.0. For details on use of the `Tile` class see @ref:[the page on numpy interoperability](numpy-pandas.md).

```python, sql_dataframe
import pyspark.sql.functions as F
```python, create_tile1
from pyrasterframes.rf_types import Tile, CellType

df1 = spark.range(1).select('id', rf_make_ones_tile(5, 5, 'float32').alias('tile'))
df2 = spark.range(1).select('id', rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), F.lit(3)).alias('tile'))
t1 = Tile(1 + 0.1 * np.random.randn(5,5), CellType('float64raw'))

rf = df1.union(df2)
t1.cells # display the array in the Tile
```

tiles = rf.select("tile").collect()
print(tiles[0]['tile'].cells)
print(tiles[1]['tile'].cells)
```python, showt5
t5 = Tile(5 + 0.1 * np.random.randn(5,5), CellType('float64raw'))
t5.cells
```

We use the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
Create a Spark DataFrame from the Tile objects.

```python, create_dataframe
import pyspark.sql.functions as F
from pyspark.sql import Row

rf = spark.createDataFrame([
Row(id=1, tile=t1),
Row(id=2, tile=t5)
]).orderBy('id')
```

We use the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is about 1.0 and the second mean is about 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.

```python, tile_mean
means = rf.select(F.col('id'), rf_tile_mean(F.col('tile')))
means
rf.select(F.col('id'), rf_tile_mean(F.col('tile')))
```

We use the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
We use the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages values across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.

```python, agg_mean
mean = rf.agg(rf_agg_mean(F.col('tile')))
mean
rf.agg(rf_agg_mean(F.col('tile')))
```

We use the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. For this aggregation, we are computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.

To compute an element-wise local aggregate, _tiles_ need to have the same dimensions. In this case, both _tiles_ have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal _tile_ dimensions, we would get a runtime error.

```python, local_mean
t = rf.agg(rf_agg_local_mean(F.col('tile')).alias('local_mean')) \
.collect()[0]['local_mean']
print(t.cells)
rf.agg(rf_agg_local_mean('tile')) \
.first()[0].cells.data # display the contents of the Tile array
```

## Cell Counts Example
Expand Down Expand Up @@ -92,12 +104,11 @@ stats
The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.

```python, agg_local_stats
df1 = spark.range(1).select('id', rf_make_ones_tile(5, 5, 'float32').alias('tile'))
df2 = spark.range(1).select('id', rf_make_constant_tile(3, 5, 5, 'float32').alias('tile'))
df3 = spark.range(1).select('id', rf_make_constant_tile(5, 5, 5, 'float32').alias('tile'))

rf = df1.union(df2).union(df3) \
.agg(rf_agg_local_stats('tile').alias('stats'))
rf = spark.createDataFrame([
Row(id=1, tile=t1),
Row(id=3, tile=t1 * 3),
Row(id=5, tile=t1 * 5)
]).agg(rf_agg_local_stats('tile').alias('stats'))

agg_local_stats = rf.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance').collect()

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,28 @@
# Overview

```python, setup, echo=False
import pyrasterframes
import pyrasterframes.rf_ipython
from pyrasterframes.rasterfunctions import rf_crs, rf_extent, rf_tile, rf_data_cells
from pyspark.sql.functions import col, lit
spark = pyrasterframes.get_spark_session()

# Note that this is the same URI as in the getting started page...
df = spark.read.raster('https://modis-pds.s3.amazonaws.com/MCD43A4.006/11/08/2019059/MCD43A4.A2019059.h11v08.006.2019072203257_B02.TIF')


df = df.select(
lit('2019-02-28').alias('timestamp'),
rf_crs('proj_raster').alias('crs'),
rf_extent('proj_raster').alias('extent'),
col('proj_raster').alias('tile')
) \
.orderBy(-rf_data_cells('tile')) \
.limit(4)

assert df.select('crs').first() is not None, "example dataframe is going to be empty"
```

RasterFrames® provides a DataFrame-centric view over arbitrary Earth-observation (EO) data, enabling spatiotemporal queries, map algebra raster operations, and compatibility with the ecosystem of [Apache Spark](https://spark.apache.org/docs/latest/) [ML](https://spark.apache.org/docs/latest/ml-guide.html) algorithms. It provides APIs in @ref:[Python, SQL, and Scala](languages.md), and can scale from a laptop computer to a large distributed cluster, enabling _global_ analysis with satellite imagery in a wholly new, flexible, and convenient way.

## Context
Expand Down Expand Up @@ -29,7 +52,9 @@ RasterFrames introduces georectified raster imagery to Spark SQL. It quantizes s

As shown in the figure below, a "RasterFrame" is a Spark DataFrame with one or more columns of type @ref:[_tile_](concepts.md#tile). A _tile_ column typically represents a single frequency band of sensor data, such as "blue" or "near infrared", but can also be quality assurance information, land classification assignments, or any other raster spatial data. Along with _tile_ columns there is typically an @ref:[`extent`](concepts.md#extent) specifying the geographic location of the data, the map projection of that geometry (@ref:[`crs`](concepts.md#coordinate-reference-system--crs-)), and a `timestamp` column representing the acquisition time. These columns can all be used in the `WHERE` clause when filtering.

@@include[RasterFrame Example](static/rasterframe-sample.md)
```python show_example_df, echo=False
df
```

RasterFrames also includes support for working with vector data, such as [GeoJSON][GeoJSON]. RasterFrames vector data operations let you filter with geospatial relationships like contains or intersects, mask cells, convert vectors to rasters, and more.

Expand Down
48 changes: 0 additions & 48 deletions pyrasterframes/src/main/python/docs/static/rasterframe-sample.md

This file was deleted.

19 changes: 4 additions & 15 deletions pyrasterframes/src/main/python/docs/vector-data.pymd
Original file line number Diff line number Diff line change
Expand Up @@ -57,37 +57,26 @@ Since it is a geometry we can do things like this:
the_first['geometry'].wkt
```

You can also write user-defined functions that take geometries as input, output, or both, via user defined types in the [geomesa_pyspark.types](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py) module. Here is a simple example of a user-defined function that uses both a geometry input and output to compute the centroid of a geometry.
You can also write user-defined functions that take geometries as input, output, or both, via user defined types in the [geomesa_pyspark.types](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py) module. Here is a simple **but inefficient** example of a user-defined function that uses both a geometry input and output to compute the centroid of a geometry. Observe in a sample of the data the geometry columns print as well known text (wkt).

```python, add_centroid
from pyspark.sql.functions import udf
from geomesa_pyspark.types import PointUDT

@udf(PointUDT())
def get_centroid(g):
def inefficient_centroid(g):
return g.centroid

df = df.withColumn('naive_centroid', get_centroid(df.geometry))
df.printSchema()
```

We can take a look at a sample of the data. Notice the geometry columns print as well known text (wkt).

```python, show_centroid
df.limit(3)
df.select(df.state_code, inefficient_centroid(df.geometry))
```


## GeoMesa Functions and Spatial Relations

As documented in the @ref:[function reference](reference.md), various user-defined functions implemented by GeoMesa are also available for use. The example below uses a GeoMesa user-defined function to compute the centroid of a geometry. It is logically equivalent to the example above, but more efficient.


```python, native_centroid
from pyrasterframes.rasterfunctions import st_centroid
df = df.withColumn('centroid', st_centroid(df.geometry))
centroids = df.select('geometry', 'name', 'naive_centroid', 'centroid')
centroids.limit(3)
df.select(df.state_code, inefficient_centroid(df.geometry), st_centroid(df.geometry))
```

The RasterFrames vector functions and GeoMesa functions also provide a variety of spatial relations that are useful in combination with the geometric properties of projected rasters. In this example, we use the @ref:[built-in Landsat catalog](raster-catalogs.md#using-built-in-experimental-catalogs) which provides an extent. We will convert the extent to a polygon and filter to those within approximately 500 km of a selected point.
Expand Down

0 comments on commit ba3d30d

Please sign in to comment.