[SEDONA-655] DBSCAN #1589

james-willis · 2024-09-17T03:31:49Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-655. The PR name follows the format [SEDONA-XXX] my subject.

What changes were proposed in this PR?

This PR add a DBSCAN function in the scala and python APIs of the spark implementation of sedona.

How was this patch tested?

unit tests

Did this PR include necessary documentation updates?

Yes, I am adding a new API. I am using the current SNAPSHOT version number in vX.Y.Z format.

jiayuasu · 2024-09-19T07:39:55Z

@james-willis The memory consumption of new tests seem to be very high?

james-willis · 2024-09-20T04:54:19Z

@james-willis The memory consumption of new tests seem to be very high?

This is something I had worked with Kristin on in the past. We had thought it was related to cached dataframes holding references to broadcast relations. But this committed version of the code does not contain any caching, just checkpoints.

There is some persisting inside of the connected component implementation, but those should be getting cleaned up. I'll check if something here is leaking persisted dataframes

james-willis · 2024-09-20T05:43:32Z

Theres some dataframes inside of the connected components algo that dont get unpersisted. I will look into raising a PR in that repo tomorrow to fix that.

james-willis · 2024-09-20T21:54:48Z

I submitted a fix to graphframes for this. I can clear the spark catalog if the disable broadcast fix doesnt resolve

graphframes/graphframes#459

jiayuasu · 2024-09-23T22:12:34Z

what are the remaining tasks for this PR?

…be merged.

james-willis · 2024-09-24T01:22:13Z

just documentation

zwu-net · 2024-09-24T21:43:25Z

just documentation

After you guys (@james-willis @jiayuasu) finish this, I'll write an article on this.

python/tests/stats/test_dbscan.py

spark/common/src/main/scala/org/apache/sedona/stats/Util.scala

python/tests/stats/test_dbscan.py

python/sedona/stats/utils/__init__.py

Imbruced · 2024-09-25T20:53:28Z

spark/common/src/main/scala/org/apache/sedona/stats/clustering/DBSCAN.scala

+      useSpheroid: Boolean = false): DataFrame = {
+
+    // We want to disable broadcast joins because the broadcast reference were using too much driver memory
+    val spark = SparkSession.getActiveSession.get


is here possible to have an empty session ?

Only if the session crashed in some other test. The parent class initializes the spark session.

Imbruced · 2024-09-25T21:00:14Z

spark/common/src/main/scala/org/apache/sedona/stats/clustering/DBSCAN.scala

+        .withColumnRenamed("id", ID_COLUMN)
+        .withColumn("id", sha2(to_json(struct("*")), 256))
+    } else {
+      dataframe.withColumn("id", sha2(to_json(struct("*")), 256))


duplicated records should be equal and aggregated or not ? I am wondering if here monotonically_increasing_id function is enough ? I think that might by costly sha2(to_json(struct("*")) also as key for join might not be super efficient compared to bigints ?

monotonically increasing id is not deterministic so it can give incorrect results in joins unless you immediately checkpoint after generating the ids. This blog does a decent job of describing the issue: https://xebia.com/blog/spark-surprises-for-the-uninitiated/ I've been bitten by this issue before, espcially when an executor crashes.

duplicated records should be equal and aggregated or not ?

This implementation does not deal with duplicated records well. There is only a comment in the doc string saying dont provide duplicates. monotonically_increasing_id + checkpoint would handle duplicates.

docs/api/stats/sql.md

Co-authored-by: Kelly-Ann Dolor <kellyanndolor@gmail.com>

jiayuasu · 2024-10-07T13:24:48Z

docs/api/stats/sql.md

This file is not put in the mkdocs.yml hence it will show up on the website navigation bar.

on it. and fixing test failures.

jameswillis added 3 commits September 16, 2024 18:37

add dbscan scala

97769e3

add dbscan python

8e89ad6

add dbscan tests, pom file changes, pip changes

3005f7c

james-willis requested a review from jiayuasu as a code owner September 17, 2024 03:31

github-actions bot added sedona-python sedona-spark root labels Sep 17, 2024

disable broadcast joins for all dbscan tests

5f176ea

disable non-sedona broadcast joins for all dbscan tests

8aeaee1

jiayuasu mentioned this pull request Sep 20, 2024

ST_ClusterDBSCAN Feature Request #1546

Closed

jiayuasu linked an issue Sep 20, 2024 that may be closed by this pull request

ST_ClusterDBSCAN Feature Request #1546

Closed

unpersist dbscan result assuming that graphframes PR will eventually …

8e54015

…be merged.