Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add STAC Summary Endpoint for Global 10m LULC Data #117

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from

Conversation

rajadain
Copy link
Member

@rajadain rajadain commented May 30, 2024

Overview

Adds a Stac Summary endpoint that takes a shape, a year, a STAC URI and a STAC Collection, and returns the histogram of pixels intersecting the AoI.

There is also a sister repository https://github.com/rajadain/mmw-io-10m-lulc-summary which has helper scripts to exercise this new endpoint, as well as run a Python based comparison for two sample shapes.

Currently, the Python implementation is faster:

time ./scripts/run_python examples/LowerSchuylkill.geojson

{
  "List(0)": 19620912,
  "List(1)": 113354,
  "List(11)": 400816,
  "List(2)": 831411,
  "List(4)": 369,
  "List(5)": 225703,
  "List(7)": 3708295,
  "List(8)": 4074
}

________________________________________________________
Executed in   11.88 secs    fish           external
   usr time    2.47 secs    0.08 millis    2.47 secs
   sys time    2.39 secs    1.22 millis    2.39 secs
time ./scripts/run_geotrellis examples/LowerSchuylkill.geojson

{
  "List(1)": 106351,
  "List(11)": 390430,
  "List(2)": 792084,
  "List(4)": 3060,
  "List(5)": 173543,
  "List(7)": 3529171,
  "List(8)": 3357
}

________________________________________________________
Executed in   15.82 secs    fish           external
   usr time    0.38 secs  131.00 micros    0.38 secs
   sys time    3.85 secs  994.00 micros    3.85 secs

with the GeoTrellis implementation operating at 0.5-0.75x the Python speed.

The resulting numbers are quite comparable though:

image

image

Note that "List(0)" represents NODATA values which are present in the Python output but not in the GeoTrellis one. Ultimately they are ignored so it doesn't matter.

The close percentage values are especially promising, because that is what is used in Model My Watershed.

Closes #113

Demo

image
xh --print HhBb :8090/stac < scratch/request.json
POST /stac HTTP/1.1
Accept: application/json, */*;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Content-Length: 149983
Content-Type: application/json
Host: localhost:8090
User-Agent: xh/0.22.0

{
    "shape": "{\"type\":\"MultiPolygon\",\"coordinates\":[[[[-75.638687083659,40.3002222895293],...]]]}",
    "year": 2019,
    "stacUri": "https://api.impactobservatory.com/stac-aws",
    "stacCollection": "io-10m-annual-lulc"
}



HTTP/1.1 200 OK
Content-Length: 118
Content-Type: application/json
Date: Thu, 30 May 2024 01:14:45 GMT
Server: akka-http/10.5.3

{
    "List(2)": 792084,
    "List(5)": 173543,
    "List(11)": 390430,
    "List(7)": 3529171,
    "List(1)": 106351,
    "List(4)": 3060,
    "List(8)": 3357
}

Notes

I'm looking for help in three places:

  1. Correctness: In the Spark RDD case, we use Rasterizer.foreachCellByMultiPolygon() to count the pixels. Here we're doing a mask and then histogram.binCount(). Is this implementation correct?
  2. Performance: In my tests using the https://github.com/rajadain/mmw-io-10m-lulc-summary repo, the Python version is twice as fast as the GeoTrellis version. How can this be improved?
  3. Idiomaticness: Is the code written in an idiomatic GeoTrellis style?

Testing Instructions

  • Check out this branch
  • Run ./scripts/update && ./script/server
  • With this server running at http://localhost:8090/, check out this repository: https://github.com/rajadain/mmw-io-10m-lulc-summary
  • Within that repo, run ./scripts/setup
  • Within that repo, run ./scripts/run_python examples/LowerSchuylkill.geojson to get a baseline
  • Within that repo, run ./scripts/run_geotrellis examples/LowerSchuylkill.geojson to exercise this new endpoint

@rajadain rajadain requested a review from jpolchlo May 30, 2024 01:50
@rajadain rajadain changed the title Add Stac Summary Endpoint for Global 10m LULC Data Add STAC Summary Endpoint for Global 10m LULC Data Jun 4, 2024
@jpolchlo
Copy link
Contributor

jpolchlo commented Jun 17, 2024

I've generated a flame graph for the run of the Lower Schuylkill example:
flamegraph
The slowdown relative to Python appears to be arising from the fact that you're doing two reproject operations: once in MosaicRasterSource and once in StacAssetReprojectRasterSource. I'm not clear on whether both operations are necessary, but if this can be consolidated down to a single reproject step, then you should be in much better shape, since those two operations are currently taking up 35% and 48% of the total runtime (though I can't guarantee that runtime is exactly accurate, since I'm not a Java Flight Recorder expert).

Edit: looking at this more closely, I misspoke: there's a resample operation and a reproject operation, both arising from MosaicRasterSource that should be joinable. I know from experience that reprojection should be able to resample in a single operation.

@rajadain
Copy link
Member Author

Thanks! Taking a look at this now. How did you generate this flamegraph? Would be useful to do as I iterate on this.

Looks like we're reprojecting on line 83, but also passing the targetCRS as a parameter in lines 88 and 90:

val reprojectedSources = NonEmptyList.of(head, tail: _*).map(_.reproject(targetCRS))
val attributes = reprojectedSources.toList.attributesByName
val mosaicRasterSource =
if (parallelMosaicEnabled)
MosaicRasterSourceIO.instance(reprojectedSources, targetCRS, collectionName, attributes)(IORuntime.global)
else
MosaicRasterSource.instance(reprojectedSources, targetCRS, collectionName, attributes)

If I remove the reprojection from line 83, I get an empty output (albeit much faster, likely because it's not reading any data 😅)

diff --git a/api/src/main/scala/package.scala b/api/src/main/scala/package.scala
index 199fce0..010a792 100644
--- a/api/src/main/scala/package.scala
+++ b/api/src/main/scala/package.scala
@@ -80,14 +80,14 @@ package object geoprocessing {
       sources match {
         case head :: Nil => head.some
         case head :: tail =>
-          val reprojectedSources = NonEmptyList.of(head, tail: _*).map(_.reproject(targetCRS))
-          val attributes = reprojectedSources.toList.attributesByName
+          val sources = NonEmptyList.of(head, tail: _*)
+          val attributes = sources.toList.attributesByName
 
           val mosaicRasterSource =
             if (parallelMosaicEnabled)
-              MosaicRasterSourceIO.instance(reprojectedSources, targetCRS, collectionName, attributes)(IORuntime.global)
+              MosaicRasterSourceIO.instance(sources, targetCRS, collectionName, attributes)(IORuntime.global)
             else
-              MosaicRasterSource.instance(reprojectedSources, targetCRS, collectionName, attributes)
+              MosaicRasterSource.instance(sources, targetCRS, collectionName, attributes)
 
           mosaicRasterSource.some
         case _ => None
time ./scripts/run_geotrellis examples/LowerSchuylkill.geojson
{}

________________________________________________________
Executed in    6.81 secs      fish           external
   usr time   20.76 millis    0.10 millis   20.65 millis
   sys time   21.00 millis    1.04 millis   19.95 millis

I'll see if I can figure out how to do the resampling only once while still selecting the right data. The above may imply that 6.8s is a timing floor below which we cannot go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWS 2-3: Add New Operation for RasterGroupedCount on COGs
2 participants