Add STAC Summary Endpoint for Global 10m LULC Data #117

rajadain · 2024-05-30T01:50:09Z

Overview

Adds a Stac Summary endpoint that takes a shape, a year, a STAC URI and a STAC Collection, and returns the histogram of pixels intersecting the AoI.

There is also a sister repository https://github.com/rajadain/mmw-io-10m-lulc-summary which has helper scripts to exercise this new endpoint, as well as run a Python based comparison for two sample shapes.

Currently, the Python implementation is faster:

time ./scripts/run_python examples/LowerSchuylkill.geojson

{
  "List(0)": 19620912,
  "List(1)": 113354,
  "List(11)": 400816,
  "List(2)": 831411,
  "List(4)": 369,
  "List(5)": 225703,
  "List(7)": 3708295,
  "List(8)": 4074
}

________________________________________________________
Executed in   11.88 secs    fish           external
   usr time    2.47 secs    0.08 millis    2.47 secs
   sys time    2.39 secs    1.22 millis    2.39 secs

time ./scripts/run_geotrellis examples/LowerSchuylkill.geojson

{
  "List(1)": 106351,
  "List(11)": 390430,
  "List(2)": 792084,
  "List(4)": 3060,
  "List(5)": 173543,
  "List(7)": 3529171,
  "List(8)": 3357
}

________________________________________________________
Executed in   15.82 secs    fish           external
   usr time    0.38 secs  131.00 micros    0.38 secs
   sys time    3.85 secs  994.00 micros    3.85 secs

with the GeoTrellis implementation operating at 0.5-0.75x the Python speed.

The resulting numbers are quite comparable though:

Note that "List(0)" represents NODATA values which are present in the Python output but not in the GeoTrellis one. Ultimately they are ignored so it doesn't matter.

The close percentage values are especially promising, because that is what is used in Model My Watershed.

Closes #113

Demo

xh --print HhBb :8090/stac < scratch/request.json

POST /stac HTTP/1.1
Accept: application/json, */*;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Content-Length: 149983
Content-Type: application/json
Host: localhost:8090
User-Agent: xh/0.22.0

{
    "shape": "{\"type\":\"MultiPolygon\",\"coordinates\":[[[[-75.638687083659,40.3002222895293],...]]]}",
    "year": 2019,
    "stacUri": "https://api.impactobservatory.com/stac-aws",
    "stacCollection": "io-10m-annual-lulc"
}



HTTP/1.1 200 OK
Content-Length: 118
Content-Type: application/json
Date: Thu, 30 May 2024 01:14:45 GMT
Server: akka-http/10.5.3

{
    "List(2)": 792084,
    "List(5)": 173543,
    "List(11)": 390430,
    "List(7)": 3529171,
    "List(1)": 106351,
    "List(4)": 3060,
    "List(8)": 3357
}

Notes

I'm looking for help in three places:

Correctness: In the Spark RDD case, we use Rasterizer.foreachCellByMultiPolygon() to count the pixels. Here we're doing a mask and then histogram.binCount(). Is this implementation correct?
Performance: In my tests using the https://github.com/rajadain/mmw-io-10m-lulc-summary repo, the Python version is twice as fast as the GeoTrellis version. How can this be improved?
Idiomaticness: Is the code written in an idiomatic GeoTrellis style?

Testing Instructions

Check out this branch
Run ./scripts/update && ./script/server
With this server running at http://localhost:8090/, check out this repository: https://github.com/rajadain/mmw-io-10m-lulc-summary
Within that repo, run ./scripts/setup
Within that repo, run ./scripts/run_python examples/LowerSchuylkill.geojson to get a baseline
Within that repo, run ./scripts/run_geotrellis examples/LowerSchuylkill.geojson to exercise this new endpoint

We want to use an Equal Area (Albers) projection for this. If the AoI is in the Continental United States, then we use ConusAlbers.

jpolchlo · 2024-06-17T15:55:34Z

I've generated a flame graph for the run of the Lower Schuylkill example:

The slowdown relative to Python appears to be arising from the fact that you're doing two reproject operations: once in MosaicRasterSource and once in StacAssetReprojectRasterSource. I'm not clear on whether both operations are necessary, but if this can be consolidated down to a single reproject step, then you should be in much better shape, since those two operations are currently taking up 35% and 48% of the total runtime (though I can't guarantee that runtime is exactly accurate, since I'm not a Java Flight Recorder expert).

Edit: looking at this more closely, I misspoke: there's a resample operation and a reproject operation, both arising from MosaicRasterSource that should be joinable. I know from experience that reprojection should be able to resample in a single operation.

rajadain · 2024-06-18T18:31:43Z

Thanks! Taking a look at this now. How did you generate this flamegraph? Would be useful to do as I iterate on this.

Looks like we're reprojecting on line 83, but also passing the targetCRS as a parameter in lines 88 and 90:

mmw-geoprocessing/api/src/main/scala/package.scala

Lines 83 to 90 in 025bfac

    
           val reprojectedSources = NonEmptyList.of(head, tail: _*).map(_.reproject(targetCRS)) 
        
           val attributes = reprojectedSources.toList.attributesByName 
        
           val mosaicRasterSource = 
        
             if (parallelMosaicEnabled) 
        
               MosaicRasterSourceIO.instance(reprojectedSources, targetCRS, collectionName, attributes)(IORuntime.global) 
        
             else 
        
               MosaicRasterSource.instance(reprojectedSources, targetCRS, collectionName, attributes)

If I remove the reprojection from line 83, I get an empty output (albeit much faster, likely because it's not reading any data 😅)

diff --git a/api/src/main/scala/package.scala b/api/src/main/scala/package.scala
index 199fce0..010a792 100644
--- a/api/src/main/scala/package.scala
+++ b/api/src/main/scala/package.scala
@@ -80,14 +80,14 @@ package object geoprocessing {
       sources match {
         case head :: Nil => head.some
         case head :: tail =>
-          val reprojectedSources = NonEmptyList.of(head, tail: _*).map(_.reproject(targetCRS))
-          val attributes = reprojectedSources.toList.attributesByName
+          val sources = NonEmptyList.of(head, tail: _*)
+          val attributes = sources.toList.attributesByName
 
           val mosaicRasterSource =
             if (parallelMosaicEnabled)
-              MosaicRasterSourceIO.instance(reprojectedSources, targetCRS, collectionName, attributes)(IORuntime.global)
+              MosaicRasterSourceIO.instance(sources, targetCRS, collectionName, attributes)(IORuntime.global)
             else
-              MosaicRasterSource.instance(reprojectedSources, targetCRS, collectionName, attributes)
+              MosaicRasterSource.instance(sources, targetCRS, collectionName, attributes)
 
           mosaicRasterSource.some
         case _ => None

time ./scripts/run_geotrellis examples/LowerSchuylkill.geojson
{}

________________________________________________________
Executed in    6.81 secs      fish           external
   usr time   20.76 millis    0.10 millis   20.65 millis
   sys time   21.00 millis    1.04 millis   19.95 millis

I'll see if I can figure out how to do the resampling only once while still selecting the right data. The above may imply that 6.8s is a timing floor below which we cannot go.

rajadain added 13 commits May 10, 2024 15:45

Add stac4s for querying STACs

e6db479

WIP

b69f7ac

WIP

89652f8

Query pixel counts

6c519dd

Clean up

2aeb60a

Get STAC summary working

7b27df4

Always reproject everything to targetCRS

86f728c

We want to use an Equal Area (Albers) projection for this. If the AoI is in the Continental United States, then we use ConusAlbers.

Limit counts to the AoI

3db3fab

Add minor comments

4feb66a

Receive uri, year, collection from input as well

a684964

Enable parallel mosaicking for speedup

3d3b813

Follow VS Code recommendation

038dead

Add credit, source

025bfac

rajadain requested a review from jpolchlo May 30, 2024 01:50

rajadain assigned jpolchlo May 30, 2024

rajadain changed the title ~~Add Stac Summary Endpoint for Global 10m LULC Data~~ Add STAC Summary Endpoint for Global 10m LULC Data Jun 4, 2024

rajadain mentioned this pull request Jul 15, 2024

AWS 2-4: Add Celery Task for Global Land Cover Data Fetching WikiWatershed/model-my-watershed#3633

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add STAC Summary Endpoint for Global 10m LULC Data #117

Add STAC Summary Endpoint for Global 10m LULC Data #117

rajadain commented May 30, 2024 •

edited

Loading

jpolchlo commented Jun 17, 2024 •

edited

Loading

rajadain commented Jun 18, 2024

Add STAC Summary Endpoint for Global 10m LULC Data #117

Are you sure you want to change the base?

Add STAC Summary Endpoint for Global 10m LULC Data #117

Conversation

rajadain commented May 30, 2024 • edited Loading

Overview

Demo

Notes

Testing Instructions

jpolchlo commented Jun 17, 2024 • edited Loading

rajadain commented Jun 18, 2024

rajadain commented May 30, 2024 •

edited

Loading

jpolchlo commented Jun 17, 2024 •

edited

Loading