Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf_udf failed in all spark release intermittently #2521

Closed
pxLi opened this issue May 27, 2021 · 12 comments · Fixed by #2539
Closed

[BUG] cudf_udf failed in all spark release intermittently #2521

pxLi opened this issue May 27, 2021 · 12 comments · Fixed by #2539
Labels
bug Something isn't working P0 Must have for release

Comments

@pxLi
Copy link
Collaborator

pxLi commented May 27, 2021

Describe the bug
rapids_databricks_nightly-dev, ID 19, 20
rapids_integration-dev spark-301 302, ID 186, 187
rapids_it-3.1.x-SNAPSHOT-dev spark-312-SNAPSHOT, ID 149, 150

this actually is not 100% reproducible, and it could fail in all envs
cudf_udf integration tests failed,

[2021-05-27T08:17:32.021Z] Traceback (most recent call last):
[2021-05-27T08:17:32.021Z]   File "/home/ubuntu/spark-rapids/dist/target/rapids-4-spark_2.12-21.06.0-SNAPSHOT.jar/rapids/daemon_databricks.py", line 132, in manager
[2021-05-27T08:17:32.021Z]   File "/home/ubuntu/spark-rapids/dist/target/rapids-4-spark_2.12-21.06.0-SNAPSHOT.jar/rapids/worker.py", line 37, in initialize_gpu_mem
[2021-05-27T08:17:32.021Z]     from cudf import rmm
[2021-05-27T08:17:32.021Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/__init__.py", line 11, in <module>
[2021-05-27T08:17:32.021Z]     from cudf import core, datasets, testing
[2021-05-27T08:17:32.021Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/core/__init__.py", line 3, in <module>
[2021-05-27T08:17:32.021Z]     from cudf.core import _internals, buffer, column, column_accessor, common
[2021-05-27T08:17:32.021Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/core/_internals/__init__.py", line 3, in <module>
[2021-05-27T08:17:32.021Z]     from cudf.core._internals.where import where
[2021-05-27T08:17:32.021Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/core/_internals/where.py", line 11, in <module>
[2021-05-27T08:17:32.022Z]     from cudf.core.column import ColumnBase
[2021-05-27T08:17:32.022Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/core/column/__init__.py", line 17, in <module>
[2021-05-27T08:17:32.022Z]     from cudf.core.column.datetime import DatetimeColumn  # noqa: F401
[2021-05-27T08:17:32.022Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/core/column/datetime.py", line 20, in <module>
[2021-05-27T08:17:32.022Z]     from cudf.core.column import (
[2021-05-27T08:17:32.022Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/core/column/string.py", line 78, in <module>
[2021-05-27T08:17:32.022Z]     from cudf._lib.strings.combine import (
[2021-05-27T08:17:32.022Z] ImportError: /databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/_lib/strings/combine.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN4cudf7strings18join_list_elementsERKNS_17lists_column_viewERKNS_19strings_column_viewERKNS_13string_scalarES9_NS0_18separator_on_nullsEPN3rmm2mr22device_memory_resourceE
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   py4j.protocol.Py4JJavaError: An error occurred while calling o2591.collectToPython.�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 35.0 failed 1 times, most recent failure: Lost task 5.0 in stage 35.0 (TID 220, 10.2.128.4, executor driver): java.net.SocketException: Connection reset�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at java.net.SocketInputStream.read(SocketInputStream.java:210)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at java.net.SocketInputStream.read(SocketInputStream.java:141)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at java.net.SocketInputStream.read(SocketInputStream.java:224)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at java.io.DataInputStream.readInt(DataInputStream.java:387)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:210)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:233)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:225)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:119)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:192)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:183)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.python.PandasGroupUtils$.executePython(PandasGroupUtils.scala:44)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.python.rapids.GpuPandasUtils$.executePython(GpuPandasUtils.scala:35)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.sql.rapids.execution.python.GpuFlatMapCoGroupsInPandasExec.$anonfun$doExecute$1(GpuFlatMapCoGroupsInPandasExec.scala:138)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:101)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.Task.run(Task.scala:117)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:640)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   �[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   Driver stacktrace:�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at scala.Option.foreach(Option.scala:407)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2339)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2434)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:273)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:308)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:82)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:508)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:480)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:401)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3497)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3709)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:249)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:199)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3707)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3495)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at sun.reflect.GeneratedMethodAccessor219.invoke(Unknown Source)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.lang.reflect.Method.invoke(Method.java:498)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at py4j.Gateway.invoke(Gateway.java:295)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at py4j.GatewayConnection.run(GatewayConnection.java:251)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   Caused by: java.net.SocketException: Connection reset�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.net.SocketInputStream.read(SocketInputStream.java:210)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.net.SocketInputStream.read(SocketInputStream.java:141)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.net.SocketInputStream.read(SocketInputStream.java:224)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.io.DataInputStream.readInt(DataInputStream.java:387)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:210)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:233)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:225)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:119)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:192)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:183)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.python.PandasGroupUtils$.executePython(PandasGroupUtils.scala:44)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.python.rapids.GpuPandasUtils$.executePython(GpuPandasUtils.scala:35)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.rapids.execution.python.GpuFlatMapCoGroupsInPandasExec.$anonfun$doExecute$1(GpuFlatMapCoGroupsInPandasExec.scala:138)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:101)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.scheduler.Task.run(Task.scala:117)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:640)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	... 1 more�[0m
@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 27, 2021
@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels May 27, 2021
@sameerz sameerz added this to the May 24 - Jun 4 milestone May 27, 2021
@pxLi
Copy link
Collaborator Author

pxLi commented May 28, 2021

passed in todays' run. We will keep monitoring this

@pxLi pxLi changed the title [BUG] cudf_udf failed in databricks env [BUG] cudf_udf failed in spark 3.1.X-SNAPSHOT May 28, 2021
@pxLi
Copy link
Collaborator Author

pxLi commented May 28, 2021

The error started failing other tests,

rapids_integration-dev spark-301 302, ID 186
rapids_it-3.1.x-SNAPSHOT-dev spark-312-SNAPSHOT, ID 149 150

vanilla spark standalone executor logs,

21/05/28 09:08:54 INFO Executor: Finished task 10.0 in stage 35.0 (TID 437). 6424 bytes result sent to driver
21/05/28 09:08:54 INFO Executor: Finished task 11.0 in stage 35.0 (TID 438). 6424 bytes result sent to driver
21/05/28 09:08:54 INFO CodeGenerator: Code generated in 6.260165 ms
INFO: Process 4023 found CUDA visible device(s): 0
Traceback (most recent call last):
  File "/home/jenkins/agent/workspace/rapids_integration-dev-github-302/jars/rapids-4-spark_2.12-21.06.0-SNAPSHOT.jar/rapids/daemon.py", line 131, in manager
  File "/home/jenkins/agent/workspace/rapids_integration-dev-github-302/jars/rapids-4-spark_2.12-21.06.0-SNAPSHOT.jar/rapids/worker.py", line 37, in initialize_gpu_mem
    from cudf import rmm
  File "/opt/conda/lib/python3.8/site-packages/cudf/__init__.py", line 11, in <module>
    from cudf import core, datasets, testing
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/__init__.py", line 3, in <module>
    from cudf.core import _internals, buffer, column, column_accessor, common
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/_internals/__init__.py", line 3, in <module>
    from cudf.core._internals.where import where
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/_internals/where.py", line 11, in <module>
    from cudf.core.column import ColumnBase
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/__init__.py", line 3, in <module>
    from cudf.core.column.categorical import CategoricalColumn
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/categorical.py", line 25, in <module>
    from cudf import _lib as libcudf
  File "/opt/conda/lib/python3.8/site-packages/cudf/_lib/__init__.py", line 4, in <module>
    from . import (
ImportError: /opt/conda/lib/python3.8/site-packages/cudf/_lib/groupby.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN4cudf7groupby7groupby5shiftERKNS_10table_viewENS_9host_spanIKiLm18446744073709551615EEERKSt6vectorISt17reference_wrapperIKNS_6scalarEESaISC_EEPN3rmm2mr22device_memory_resourceE
21/05/28 09:08:55 ERROR Executor: Exception in task 5.0 in stage 35.0 (TID 427)
java.io.EOFException
	at java.io.DataInputStream.readInt(DataInputStream.java:392)
	at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:120)
	at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:136)
	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:135)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:105)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:131)
	at org.apache.spark.sql.execution.python.PandasGroupUtils$.executePython(PandasGroupUtils.scala:44)
	at org.apache.spark.sql.execution.python.rapids.GpuPandasUtils$.executePython(GpuPandasUtils.scala:35)
	at org.apache.spark.sql.rapids.execution.python.GpuFlatMapCoGroupsInPandasExec.$anonfun$doExecute$1(GpuFlatMapCoGroupsInPandasExec.scala:138)
	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

@pxLi pxLi changed the title [BUG] cudf_udf failed in spark 3.1.X-SNAPSHOT [BUG] cudf_udf failed in all spark release intermittently May 28, 2021
@pxLi
Copy link
Collaborator Author

pxLi commented May 28, 2021

still randomly failed in different ENVs.
cudf-py verson we saw the error, w/
cudf-21.06.00a210527 cuda_11.0_py38_g773fc7aa93_394
cudf-21.06.00a210528 cuda_11.0_py38_g0eeb0c9239_404

pxLi added a commit to pxLi/spark-rapids that referenced this issue May 31, 2021
Signed-off-by: Peixin Li <pxli@nyu.edu>
pxLi added a commit that referenced this issue May 31, 2021
Signed-off-by: Peixin Li <pxli@nyu.edu>
@pxLi pxLi reopened this May 31, 2021
@pxLi
Copy link
Collaborator Author

pxLi commented May 31, 2021

reopen. accidentally closed by #2539

@firestarman
Copy link
Collaborator

firestarman commented May 31, 2021

It is an env issue. Seems cudf python is broken.

Verified locally by importing the cudf Python lib in Python shell, meeting the same error.

firestarman@firestarman-ubuntu18:~/work/projects/on_github/spark-rapids$ docker run --runtime=nvidia -it --name debug-cudf-test -v ~/.m2:/root/.m2 -v /usr/local/spark:/usr/local/spark ${docker-repo}/plugin:it-ubuntu18.04-cuda11.0-blossom-dev 
root@27d721fdaf1b:/# conda list cudf
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
cudf                      21.06.00a210530 cuda_11.0_py38_g0eeb0c9239_404    rapidsai-nightly
libcudf                   21.06.00a210525 cuda11.0_g6dbf2d58d1_379    rapidsai-nightly
root@27d721fdaf1b:/# python
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cudf
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/cudf/__init__.py", line 11, in <module>
    from cudf import core, datasets, testing
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/__init__.py", line 3, in <module>
    from cudf.core import _internals, buffer, column, column_accessor, common
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/_internals/__init__.py", line 3, in <module>
    from cudf.core._internals.where import where
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/_internals/where.py", line 11, in <module>
    from cudf.core.column import ColumnBase
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/__init__.py", line 3, in <module>
    from cudf.core.column.categorical import CategoricalColumn
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/categorical.py", line 25, in <module>
    from cudf import _lib as libcudf
  File "/opt/conda/lib/python3.8/site-packages/cudf/_lib/__init__.py", line 4, in <module>
    from . import (
ImportError: /opt/conda/lib/python3.8/site-packages/cudf/_lib/groupby.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN4cudf7groupby7groupby5shiftERKNS_10table_viewENS_9host_spanIKiLm18446744073709551615EEERKSt6vectorISt17reference_wrapperIKNS_6scalarEESaISC_EEPN3rmm2mr22device_memory_resourceE
>>> 

@pxLi
Copy link
Collaborator Author

pxLi commented May 31, 2021

confirmed download install cudf w/ conda could have mismatched version w/ libcudf

command: conda install -y -c rapidsai -c rapidsai-nightly -c nvidia -c conda-forge -c defaults cudf=21.06 python=3.8 cudatoolkit=${CUDA_VER}

image

@pxLi
Copy link
Collaborator Author

pxLi commented May 31, 2021

filed rapidsai/cudf#8404 to track conda install versions mismatch issue

@GaryShen2008
Copy link
Collaborator

Since it's not a code issue in spark-rapids, move it to 21.08 target.

@firestarman
Copy link
Collaborator

So, what's the next action here ?
Shall we need to wait for the rapidsai/cudf#8404 being fixed ?

@GaryShen2008
Copy link
Collaborator

Yes, I think so. Let's wait for the fixing.

@sameerz sameerz removed this from the May 24 - Jun 4 milestone Jun 5, 2021
@sameerz sameerz added this to the June 7 - June 18 milestone Jun 5, 2021
@pxLi
Copy link
Collaborator Author

pxLi commented Jun 7, 2021

looks like the version mismatching of 21.06 nightly did not happen again in recent 5 days.

Going to re-enable cudf_udf tests

pxLi added a commit to pxLi/spark-rapids that referenced this issue Jun 7, 2021
This reverts commit 19bb201.

Signed-off-by: Peixin Li <pxli@nyu.edu>
pxLi added a commit that referenced this issue Jun 7, 2021
* Revert "disable cudf_udf tests for #2521"

This reverts commit 19bb201.

Signed-off-by: Peixin Li <pxli@nyu.edu>

* add minAllocFraction for nightly cudf_udf test
@pxLi
Copy link
Collaborator Author

pxLi commented Jun 7, 2021

verified integration tests w/ new cudf-py on multiple databricks and standalone ENVs, worked as expected.

close the issue for now. will reopen if happen again

@pxLi pxLi closed this as completed Jun 7, 2021
nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021
Signed-off-by: Peixin Li <pxli@nyu.edu>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021
* Revert "disable cudf_udf tests for NVIDIA#2521"

This reverts commit 19bb201.

Signed-off-by: Peixin Li <pxli@nyu.edu>

* add minAllocFraction for nightly cudf_udf test
nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021
Signed-off-by: Peixin Li <pxli@nyu.edu>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021
* Revert "disable cudf_udf tests for NVIDIA#2521"

This reverts commit 19bb201.

Signed-off-by: Peixin Li <pxli@nyu.edu>

* add minAllocFraction for nightly cudf_udf test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants