[SEDONA-630] Improve ST_Union_Aggr performance #1526

zhangfengcdt · 2024-07-18T20:49:41Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-XXX. The PR name follows the format [SEDONA-XXX] my subject.

What changes were proposed in this PR?

Switch to JTS OverlayNGRobust.union function to perform geometry union and add geometry cache capability.
https://locationtech.github.io/jts/javadoc/org/locationtech/jts/operation/overlayng/OverlayNGRobust.html

How was this patch tested?

All existing unit tests should pass.

Did this PR include necessary documentation updates?

No, this PR does not affect any public API so no need to change the documentation.

Switch to JTS `OverlayNGRobust.union` function to perform geometry union and add geometry cache capability.

zhangfengcdt · 2024-07-18T22:46:37Z

@jiayuasu I noticed that after switching from geo.buffer to OverlayNGRobust.union, the complex geometry representation returned from ST_Union_Aggr might change due to the reordering of polygon/polyline vertex. For example:

New: POLYGON ((1 0, 0 0, 0 1, 1 1, 2 1, 2 0, 1 0))

Old: POLYGON ((0 0, 0 1, 1 1, 2 1, 2 0, 1 0, 0 0))

They represent the same polygon, but the vertex order has changed.

jiayuasu · 2024-07-18T23:06:06Z

@jiayuasu I noticed that after switching from geo.buffer to OverlayNGRobust.union, the complex geometry representation returned from ST_Union_Aggr might change due to the reordering of polygon/polyline vertex. For example:

New: POLYGON ((1 0, 0 0, 0 1, 1 1, 2 1, 2 0, 1 0))

Old: POLYGON ((0 0, 0 1, 1 1, 2 1, 2 0, 1 0, 0 0))

They represent the same polygon, but the vertex order has changed.

I think this is fine.

In addition, we want to make sure the behavior of ST_Union_Aggr is similar to PostGIS ST_Union (array variant): https://postgis.net/docs/ST_Union.html

jiayuasu

@zhangfengcdt Did you see performance improvement using this implementation, compared to the previous one?

zhangfengcdt · 2024-07-19T17:36:21Z

@zhangfengcdt Did you see performance improvement using this implementation, compared to the previous one?

Yeah, I am adding some tests to report the performance measure and we can see the improvements for different cases there.

zhangfengcdt · 2024-07-22T17:30:53Z

@zhangfengcdt Did you see performance improvement using this implementation, compared to the previous one?

@jiayuasu I have used the newly added test to measure both old and new runtime for different number of geometries. Here are the results:

 Number of Polygons   |  OLD ST_Union_Aggr (in ms) |  NEW ST_Union_Aggr (in ms)
 -------------------------------------------------------------------------------
 100                  |             297            |              354     
 500                  |             750            |              386
 1,000                |           2,231            |              430
 5,000                |          53,870            |            1,400
 10,000               |         243,465            |            3,474

I think it shows clearly the new method is much efficient and scalable.

jiayuasu · 2024-07-22T18:21:05Z

spark/common/src/test/scala/org/apache/sedona/sql/aggregateFunctionTestScala.scala

+         |SELECT explode(array($polygonArrayStr)) AS geom
+     """.stripMargin
+
+    sparkSession.sql(sqlQuery).createOrReplaceTempView("geometry_table")


Can you return a reference of the DF as the return value of the function, instead of creating a new temp view? Otherwise this might pollute the global namespace and lead to bugs that are hard to find.

jiayuasu · 2024-07-22T18:22:23Z

spark/common/src/test/scala/org/apache/sedona/sql/aggregateFunctionTestScala.scala

+      createPolygonDataFrame(numPolygons)
+
+      // cache the table to eliminate the time of table scan
+      sparkSession.sql("cache table geometry_table")


Can you also unpersist this table at the end of the test case? Otherwise this will lead to memory leak.

[SEDONA-630] Improve ST_Union_Aggr performance

6aec540

Switch to JTS `OverlayNGRobust.union` function to perform geometry union and add geometry cache capability.

github-actions bot added the sedona-spark label Jul 18, 2024

zhangfengcdt marked this pull request as ready for review July 18, 2024 20:56

zhangfengcdt requested a review from jiayuasu as a code owner July 18, 2024 20:56

fix pythion test

a217d4f

github-actions bot added the sedona-python label Jul 19, 2024

jiayuasu reviewed Jul 19, 2024

View reviewed changes

jiayuasu added this to the sedona-1.6.1 milestone Jul 20, 2024

jiayuasu added the improvement label Jul 20, 2024

add unit test to measure the ST_Union_aggr time

b380387

jiayuasu requested changes Jul 22, 2024

View reviewed changes

zhangfengcdt added 2 commits July 22, 2024 11:28

address review comments by refactoring unit tests

f8a2245

rename test table

db116d6

jiayuasu approved these changes Jul 22, 2024

View reviewed changes

jiayuasu merged commit bab1f77 into apache:master Jul 22, 2024
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-630] Improve ST_Union_Aggr performance #1526

[SEDONA-630] Improve ST_Union_Aggr performance #1526

zhangfengcdt commented Jul 18, 2024

zhangfengcdt commented Jul 18, 2024

jiayuasu commented Jul 18, 2024

jiayuasu left a comment

zhangfengcdt commented Jul 19, 2024

zhangfengcdt commented Jul 22, 2024 •

edited

Loading

jiayuasu Jul 22, 2024

jiayuasu Jul 22, 2024

[SEDONA-630] Improve ST_Union_Aggr performance #1526

[SEDONA-630] Improve ST_Union_Aggr performance #1526

Conversation

zhangfengcdt commented Jul 18, 2024

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

zhangfengcdt commented Jul 18, 2024

jiayuasu commented Jul 18, 2024

jiayuasu left a comment

Choose a reason for hiding this comment

zhangfengcdt commented Jul 19, 2024

zhangfengcdt commented Jul 22, 2024 • edited Loading

jiayuasu Jul 22, 2024

Choose a reason for hiding this comment

jiayuasu Jul 22, 2024

Choose a reason for hiding this comment

zhangfengcdt commented Jul 22, 2024 •

edited

Loading