Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer overflows for kNN search? #101

Open
JulienPeloton opened this issue Sep 21, 2018 · 0 comments
Open

Integer overflows for kNN search? #101

JulienPeloton opened this issue Sep 21, 2018 · 0 comments
Assignees
Labels
3D methods bug Something isn't working pyspark3d

Comments

@JulienPeloton
Copy link
Member

pyspark3d issue.

kNN search for data set size > 2G elements seem to go crazy :D
I was running kNN for k=1000, and data set size = 5,000,000,000 elements.

py4j.protocol.Py4JJavaError: An error occurred while calling z:com.astrolabsoftware.spark3d.spatialOperator.SpatialQuery.KNN.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 556 in stage 0.0 failed 4
times, most recent failure: Lost task 556.3 in stage 0.0 (TID 967, 134.158.75.162, executor 3): 
java.lang.IllegalArgumentException: Comparison method violates its general contract!
	at java.util.TimSort.mergeHi(TimSort.java:899)
	at java.util.TimSort.mergeAt(TimSort.java:516)
	at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
	at java.util.TimSort.sort(TimSort.java:254)
	at java.util.Arrays.sort(Arrays.java:1512)
	at com.google.common.collect.Ordering.leastOf(Ordering.java:708)
	at com.astrolabsoftware.spark3d.utils.Utils$.com$astrolabsoftware$spark3d$utils$Utils$$takeOrdered(Utils.scala:174)
	at com.astrolabsoftware.spark3d.utils.Utils$$anonfun$1.apply(Utils.scala:154)
	at com.astrolabsoftware.spark3d.utils.Utils$$anonfun$1.apply(Utils.scala:152)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Note, the same problem appears regardless we want distinct objects or not.
My best guess is that we would to trade integer for long.

ADDED:
Interesting to note though, this does not happen in the pure Scala version.
The difference which comes into my mind is the default level of storage for RDD (None in Scala, MEMORY_ONLY in python).

@JulienPeloton JulienPeloton added bug Something isn't working 3D methods pyspark3d labels Sep 21, 2018
@JulienPeloton JulienPeloton self-assigned this Sep 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3D methods bug Something isn't working pyspark3d
Projects
None yet
Development

No branches or pull requests

1 participant