Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-625] Add ST_GeneratePoints #1520

Merged
merged 14 commits into from
Jul 17, 2024

Conversation

furqaankhan
Copy link
Contributor

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

  • Add ST_GeneratePoints

How was this patch tested?

  • Add new unit tests

Did this PR include necessary documentation updates?

  • Yes, I have updated the documentation.

@jiayuasu
Copy link
Member

@Kontinuation The result of this function is non-deterministic. Will this create a problem to the catalyst optimizer?

@jiayuasu jiayuasu added this to the sedona-1.6.1 milestone Jul 11, 2024
@Kontinuation
Copy link
Member

Kontinuation commented Jul 11, 2024

@Kontinuation The result of this function is non-deterministic. Will this create a problem to the catalyst optimizer?

The correct way to implement non-deterministic functions is to extend the Nondeterministic trait and implement the initializeInternal method to calculate the seed for each partition. The function evaluates using the initialized seed, so the results are identical when partitions are recomputed. Please refer to randomExpressions.scala#L93 to understand how this works.

Unfortunately, RandomPointsBuilder does not allow constructing the builder with a given seed. We have to inherit this class and override the createRandomCoord function to generate coordinates using the specified seed.

@jiayuasu
Copy link
Member

@furqaankhan can you solve the issue using Kristin's method?

@furqaankhan
Copy link
Contributor Author

Yes, I am working on it.

@jiayuasu
Copy link
Member

This PR should also add SRID to the generated points following the style of this PR: #1521

Comment on lines 1461 to 1468
override protected def initializeInternal(partitionIndex: Int): Unit = random = new Random(
randomSeed.getOrElse(0L) + partitionIndex)

override protected def evalInternal(input: InternalRow): Any = {
val geom = children.head.toGeometry(input)
val numPoints = children(1).eval(input).asInstanceOf[Int]
if (nArgs == 3) {
val seed = children(2).eval(input).asInstanceOf[Int]
return GeometrySerializer.serialize(Functions.generatePoints(geom, numPoints, seed))
}
GeometrySerializer.serialize(Functions.generatePoints(geom, numPoints))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should construct an instance of RandomPointsBuilderSeed in the initializeInternal function, and call setExtent, setNumPoints, getGeometry methods using that instance in evalInternal.

Comment on lines 1442 to 1449
case class ST_GeneratePoints(inputExpressions: Seq[Expression], randomSeed: Option[Long] = None)
extends Expression
with Nondeterministic
with ExpectsInputTypes
with CodegenFallback
with ExpressionWithRandomSeed {

def this(inputExpressions: Seq[Expression]) = this(inputExpressions, Some(0L))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExpressionWithRandomSeed has binary incompatible changes since Spark 3.0, so we cannot use it to write binary compatible code for Spark 3.0 ~ 3.5. Maintaining compatibility with old Spark versions (<= 3.2) is really a burden for us.

We can initialize ST_GeneratePoints in the old Spark 2.3 way:

case class ST_GeneratePoints(inputExpressions: Seq[Expression], randomSeed: Long) extends RDG {

  def this(inputExpressions: Seq[Expression]) = this(inputExpressions, Utils.random.nextLong())

The consequence is that the random seed will be fixed for multiple minibatches when running streaming jobs. However, that's the best we can do to maintain compatibility with various Spark versions.

Copy link
Contributor Author

@furqaankhan furqaankhan Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using RDG doesn't work, as it extends UnaryExpression. Hence, I used all the traits that RDG was using but replaced the UnaryExpression with Expression. Please let me know if you think otherwise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I made a mistake when writing the example code. It should be

case class ST_GeneratePoints(inputExpressions: Seq[Expression], randomSeed: Long) extends Expression
    with Nondeterministic
    with ExpectsInputTypes
    with CodegenFallback  {

  def this(inputExpressions: Seq[Expression]) = this(inputExpressions, Utils.random.nextLong())
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to push this changes by removing the ExpressionWithRandomSeed trait and it's overridden members, to test it.

@furqaankhan
Copy link
Contributor Author

This PR should also add SRID to the generated points following the style of this PR: #1521

I am using the factory of the input geometry. that will preserve the SRID. Please let me know if you think otherwise.

https://github.com/furqaankhan/sedona/blob/919136e2d67e730bd864f87c466a7f8be3b84407/common/src/main/java/org/apache/sedona/common/Functions.java#L1808

@jiayuasu
Copy link
Member

@furqaankhan Can you create a PR on snowflake-tester?

@furqaankhan
Copy link
Contributor Author

I did create one, I closed it as it had passed. I reopened it as I added a new function signature. The PR.

return generatePoints(geom, numPoints, 0);
}

public static Geometry generatePoints(Geometry geom, int numPoints, Random random) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there another overloaded function that takes Random as input?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for using the random member of ST_GeneratePoints to generate consistent random streams

if (seed > 0) {
Functions.generatePoints(geom, numPoints, seed)
} else {
Functions.generatePoints(geom, numPoints, random)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? This if condition seems to be repeating the same condition in Functions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for generating consistent random streams even if seed is 0.

@jiayuasu jiayuasu merged commit 1f24ca6 into apache:master Jul 17, 2024
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants