[BUG] cudf_udf failed in all spark release intermittently #2521

pxLi · 2021-05-27T08:28:33Z

Describe the bug
rapids_databricks_nightly-dev, ID 19, 20
rapids_integration-dev spark-301 302, ID 186, 187
rapids_it-3.1.x-SNAPSHOT-dev spark-312-SNAPSHOT, ID 149, 150

this actually is not 100% reproducible, and it could fail in all envs
cudf_udf integration tests failed,

[2021-05-27T08:17:32.021Z] Traceback (most recent call last):
[2021-05-27T08:17:32.021Z]   File "/home/ubuntu/spark-rapids/dist/target/rapids-4-spark_2.12-21.06.0-SNAPSHOT.jar/rapids/daemon_databricks.py", line 132, in manager
[2021-05-27T08:17:32.021Z]   File "/home/ubuntu/spark-rapids/dist/target/rapids-4-spark_2.12-21.06.0-SNAPSHOT.jar/rapids/worker.py", line 37, in initialize_gpu_mem
[2021-05-27T08:17:32.021Z]     from cudf import rmm
[2021-05-27T08:17:32.021Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/__init__.py", line 11, in <module>
[2021-05-27T08:17:32.021Z]     from cudf import core, datasets, testing
[2021-05-27T08:17:32.021Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/core/__init__.py", line 3, in <module>
[2021-05-27T08:17:32.021Z]     from cudf.core import _internals, buffer, column, column_accessor, common
[2021-05-27T08:17:32.021Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/core/_internals/__init__.py", line 3, in <module>
[2021-05-27T08:17:32.021Z]     from cudf.core._internals.where import where
[2021-05-27T08:17:32.021Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/core/_internals/where.py", line 11, in <module>
[2021-05-27T08:17:32.022Z]     from cudf.core.column import ColumnBase
[2021-05-27T08:17:32.022Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/core/column/__init__.py", line 17, in <module>
[2021-05-27T08:17:32.022Z]     from cudf.core.column.datetime import DatetimeColumn  # noqa: F401
[2021-05-27T08:17:32.022Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/core/column/datetime.py", line 20, in <module>
[2021-05-27T08:17:32.022Z]     from cudf.core.column import (
[2021-05-27T08:17:32.022Z]   File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/core/column/string.py", line 78, in <module>
[2021-05-27T08:17:32.022Z]     from cudf._lib.strings.combine import (
[2021-05-27T08:17:32.022Z] ImportError: /databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/cudf/_lib/strings/combine.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN4cudf7strings18join_list_elementsERKNS_17lists_column_viewERKNS_19strings_column_viewERKNS_13string_scalarES9_NS0_18separator_on_nullsEPN3rmm2mr22device_memory_resourceE

[2021-05-27T08:18:59.417Z] �[1m�[31mE                   py4j.protocol.Py4JJavaError: An error occurred while calling o2591.collectToPython.�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 35.0 failed 1 times, most recent failure: Lost task 5.0 in stage 35.0 (TID 220, 10.2.128.4, executor driver): java.net.SocketException: Connection reset�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at java.net.SocketInputStream.read(SocketInputStream.java:210)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at java.net.SocketInputStream.read(SocketInputStream.java:141)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at java.net.SocketInputStream.read(SocketInputStream.java:224)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at java.io.DataInputStream.readInt(DataInputStream.java:387)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:210)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:233)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:225)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:119)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:192)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:183)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.python.PandasGroupUtils$.executePython(PandasGroupUtils.scala:44)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.python.rapids.GpuPandasUtils$.executePython(GpuPandasUtils.scala:35)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.sql.rapids.execution.python.GpuFlatMapCoGroupsInPandasExec.$anonfun$doExecute$1(GpuFlatMapCoGroupsInPandasExec.scala:138)�[0m
[2021-05-27T08:18:59.417Z] �[1m�[31mE                   	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:101)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.Task.run(Task.scala:117)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:640)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   �[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   Driver stacktrace:�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at scala.Option.foreach(Option.scala:407)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2339)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2434)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:273)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:308)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:82)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:508)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:480)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:401)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3497)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3709)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:249)�[0m
[2021-05-27T08:18:59.418Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:199)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3707)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3495)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at sun.reflect.GeneratedMethodAccessor219.invoke(Unknown Source)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.lang.reflect.Method.invoke(Method.java:498)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at py4j.Gateway.invoke(Gateway.java:295)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at py4j.GatewayConnection.run(GatewayConnection.java:251)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   Caused by: java.net.SocketException: Connection reset�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.net.SocketInputStream.read(SocketInputStream.java:210)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.net.SocketInputStream.read(SocketInputStream.java:141)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.net.SocketInputStream.read(SocketInputStream.java:224)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.io.DataInputStream.readInt(DataInputStream.java:387)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:210)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:233)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:225)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:119)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:192)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:183)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.python.PandasGroupUtils$.executePython(PandasGroupUtils.scala:44)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.python.rapids.GpuPandasUtils$.executePython(GpuPandasUtils.scala:35)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.sql.rapids.execution.python.GpuFlatMapCoGroupsInPandasExec.$anonfun$doExecute$1(GpuFlatMapCoGroupsInPandasExec.scala:138)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:101)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.scheduler.Task.run(Task.scala:117)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:640)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)�[0m
[2021-05-27T08:18:59.419Z] �[1m�[31mE                   	... 1 more�[0m

The text was updated successfully, but these errors were encountered:

pxLi · 2021-05-28T02:37:48Z

passed in todays' run. We will keep monitoring this

pxLi · 2021-05-28T05:56:39Z

The error started failing other tests,

rapids_integration-dev spark-301 302, ID 186
rapids_it-3.1.x-SNAPSHOT-dev spark-312-SNAPSHOT, ID 149 150

vanilla spark standalone executor logs,

21/05/28 09:08:54 INFO Executor: Finished task 10.0 in stage 35.0 (TID 437). 6424 bytes result sent to driver
21/05/28 09:08:54 INFO Executor: Finished task 11.0 in stage 35.0 (TID 438). 6424 bytes result sent to driver
21/05/28 09:08:54 INFO CodeGenerator: Code generated in 6.260165 ms
INFO: Process 4023 found CUDA visible device(s): 0
Traceback (most recent call last):
  File "/home/jenkins/agent/workspace/rapids_integration-dev-github-302/jars/rapids-4-spark_2.12-21.06.0-SNAPSHOT.jar/rapids/daemon.py", line 131, in manager
  File "/home/jenkins/agent/workspace/rapids_integration-dev-github-302/jars/rapids-4-spark_2.12-21.06.0-SNAPSHOT.jar/rapids/worker.py", line 37, in initialize_gpu_mem
    from cudf import rmm
  File "/opt/conda/lib/python3.8/site-packages/cudf/__init__.py", line 11, in <module>
    from cudf import core, datasets, testing
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/__init__.py", line 3, in <module>
    from cudf.core import _internals, buffer, column, column_accessor, common
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/_internals/__init__.py", line 3, in <module>
    from cudf.core._internals.where import where
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/_internals/where.py", line 11, in <module>
    from cudf.core.column import ColumnBase
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/__init__.py", line 3, in <module>
    from cudf.core.column.categorical import CategoricalColumn
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/categorical.py", line 25, in <module>
    from cudf import _lib as libcudf
  File "/opt/conda/lib/python3.8/site-packages/cudf/_lib/__init__.py", line 4, in <module>
    from . import (
ImportError: /opt/conda/lib/python3.8/site-packages/cudf/_lib/groupby.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN4cudf7groupby7groupby5shiftERKNS_10table_viewENS_9host_spanIKiLm18446744073709551615EEERKSt6vectorISt17reference_wrapperIKNS_6scalarEESaISC_EEPN3rmm2mr22device_memory_resourceE
21/05/28 09:08:55 ERROR Executor: Exception in task 5.0 in stage 35.0 (TID 427)
java.io.EOFException
	at java.io.DataInputStream.readInt(DataInputStream.java:392)
	at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:120)
	at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:136)
	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:135)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:105)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:131)
	at org.apache.spark.sql.execution.python.PandasGroupUtils$.executePython(PandasGroupUtils.scala:44)
	at org.apache.spark.sql.execution.python.rapids.GpuPandasUtils$.executePython(GpuPandasUtils.scala:35)
	at org.apache.spark.sql.rapids.execution.python.GpuFlatMapCoGroupsInPandasExec.$anonfun$doExecute$1(GpuFlatMapCoGroupsInPandasExec.scala:138)
	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

pxLi · 2021-05-28T09:17:46Z

still randomly failed in different ENVs.
cudf-py verson we saw the error, w/
cudf-21.06.00a210527 cuda_11.0_py38_g773fc7aa93_394
cudf-21.06.00a210528 cuda_11.0_py38_g0eeb0c9239_404

Signed-off-by: Peixin Li <pxli@nyu.edu>

pxLi · 2021-05-31T04:03:56Z

reopen. accidentally closed by #2539

firestarman · 2021-05-31T07:41:33Z

It is an env issue. Seems cudf python is broken.

Verified locally by importing the cudf Python lib in Python shell, meeting the same error.

firestarman@firestarman-ubuntu18:~/work/projects/on_github/spark-rapids$ docker run --runtime=nvidia -it --name debug-cudf-test -v ~/.m2:/root/.m2 -v /usr/local/spark:/usr/local/spark ${docker-repo}/plugin:it-ubuntu18.04-cuda11.0-blossom-dev 
root@27d721fdaf1b:/# conda list cudf
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
cudf                      21.06.00a210530 cuda_11.0_py38_g0eeb0c9239_404    rapidsai-nightly
libcudf                   21.06.00a210525 cuda11.0_g6dbf2d58d1_379    rapidsai-nightly
root@27d721fdaf1b:/# python
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cudf
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/cudf/__init__.py", line 11, in <module>
    from cudf import core, datasets, testing
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/__init__.py", line 3, in <module>
    from cudf.core import _internals, buffer, column, column_accessor, common
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/_internals/__init__.py", line 3, in <module>
    from cudf.core._internals.where import where
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/_internals/where.py", line 11, in <module>
    from cudf.core.column import ColumnBase
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/__init__.py", line 3, in <module>
    from cudf.core.column.categorical import CategoricalColumn
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/categorical.py", line 25, in <module>
    from cudf import _lib as libcudf
  File "/opt/conda/lib/python3.8/site-packages/cudf/_lib/__init__.py", line 4, in <module>
    from . import (
ImportError: /opt/conda/lib/python3.8/site-packages/cudf/_lib/groupby.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN4cudf7groupby7groupby5shiftERKNS_10table_viewENS_9host_spanIKiLm18446744073709551615EEERKSt6vectorISt17reference_wrapperIKNS_6scalarEESaISC_EEPN3rmm2mr22device_memory_resourceE
>>>

pxLi · 2021-05-31T07:58:40Z

confirmed download install cudf w/ conda could have mismatched version w/ libcudf

command: conda install -y -c rapidsai -c rapidsai-nightly -c nvidia -c conda-forge -c defaults cudf=21.06 python=3.8 cudatoolkit=${CUDA_VER}

pxLi · 2021-05-31T08:25:18Z

filed rapidsai/cudf#8404 to track conda install versions mismatch issue

GaryShen2008 · 2021-06-02T01:29:18Z

Since it's not a code issue in spark-rapids, move it to 21.08 target.

firestarman · 2021-06-04T01:46:53Z

So, what's the next action here ?
Shall we need to wait for the rapidsai/cudf#8404 being fixed ?

GaryShen2008 · 2021-06-04T03:29:12Z

Yes, I think so. Let's wait for the fixing.

pxLi · 2021-06-07T03:18:33Z

looks like the version mismatching of 21.06 nightly did not happen again in recent 5 days.

Going to re-enable cudf_udf tests

This reverts commit 19bb201. Signed-off-by: Peixin Li <pxli@nyu.edu>

* Revert "disable cudf_udf tests for #2521" This reverts commit 19bb201. Signed-off-by: Peixin Li <pxli@nyu.edu> * add minAllocFraction for nightly cudf_udf test

pxLi · 2021-06-07T07:50:03Z

verified integration tests w/ new cudf-py on multiple databricks and standalone ENVs, worked as expected.

close the issue for now. will reopen if happen again

Signed-off-by: Peixin Li <pxli@nyu.edu>

* Revert "disable cudf_udf tests for NVIDIA#2521" This reverts commit 19bb201. Signed-off-by: Peixin Li <pxli@nyu.edu> * add minAllocFraction for nightly cudf_udf test

Signed-off-by: Peixin Li <pxli@nyu.edu>

* Revert "disable cudf_udf tests for NVIDIA#2521" This reverts commit 19bb201. Signed-off-by: Peixin Li <pxli@nyu.edu> * add minAllocFraction for nightly cudf_udf test

pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 27, 2021

sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels May 27, 2021

sameerz added this to the May 24 - Jun 4 milestone May 27, 2021

pxLi changed the title ~~[BUG] cudf_udf failed in databricks env~~ [BUG] cudf_udf failed in spark 3.1.X-SNAPSHOT May 28, 2021

pxLi changed the title ~~[BUG] cudf_udf failed in spark 3.1.X-SNAPSHOT~~ [BUG] cudf_udf failed in all spark release intermittently May 28, 2021

pxLi added a commit to pxLi/spark-rapids that referenced this issue May 31, 2021

disable cudf_udf tests for NVIDIA#2521

19bb201

Signed-off-by: Peixin Li <pxli@nyu.edu>

pxLi mentioned this issue May 31, 2021

disable cudf_udf tests for #2521 #2539

Merged

firestarman mentioned this issue May 31, 2021

[FEA] Print out the GPU mode in the logs of the integration tests. #2541

Closed

pxLi closed this as completed in #2539 May 31, 2021

pxLi added a commit that referenced this issue May 31, 2021

disable cudf_udf tests for #2521 (#2539)

8bad0c7

Signed-off-by: Peixin Li <pxli@nyu.edu>

pxLi reopened this May 31, 2021

pxLi mentioned this issue May 31, 2021

[BUG] conda install nightly cudf, librmm&libcudf version mismatched rapidsai/cudf#8404

Closed

sameerz removed this from the May 24 - Jun 4 milestone Jun 5, 2021

sameerz added this to the June 7 - June 18 milestone Jun 5, 2021

pxLi added a commit to pxLi/spark-rapids that referenced this issue Jun 7, 2021

Revert "disable cudf_udf tests for NVIDIA#2521"

cfc2f17

This reverts commit 19bb201. Signed-off-by: Peixin Li <pxli@nyu.edu>

pxLi mentioned this issue Jun 7, 2021

Revert "disable cudf_udf tests for #2521" #2611

Merged

pxLi added a commit that referenced this issue Jun 7, 2021

Revert "disable cudf_udf tests for #2521" (#2611)

71db409

* Revert "disable cudf_udf tests for #2521" This reverts commit 19bb201. Signed-off-by: Peixin Li <pxli@nyu.edu> * add minAllocFraction for nightly cudf_udf test

pxLi closed this as completed Jun 7, 2021

nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021

disable cudf_udf tests for NVIDIA#2521 (NVIDIA#2539)

bb3a4b5

Signed-off-by: Peixin Li <pxli@nyu.edu>

nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021

disable cudf_udf tests for NVIDIA#2521 (NVIDIA#2539)

d8af5d6

Signed-off-by: Peixin Li <pxli@nyu.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cudf_udf failed in all spark release intermittently #2521

[BUG] cudf_udf failed in all spark release intermittently #2521

pxLi commented May 27, 2021 •

edited

Loading

pxLi commented May 28, 2021

pxLi commented May 28, 2021 •

edited

Loading

pxLi commented May 28, 2021 •

edited

Loading

pxLi commented May 31, 2021

firestarman commented May 31, 2021 •

edited

Loading

pxLi commented May 31, 2021 •

edited

Loading

pxLi commented May 31, 2021 •

edited

Loading

GaryShen2008 commented Jun 2, 2021

firestarman commented Jun 4, 2021

GaryShen2008 commented Jun 4, 2021

pxLi commented Jun 7, 2021

pxLi commented Jun 7, 2021

[BUG] cudf_udf failed in all spark release intermittently #2521

[BUG] cudf_udf failed in all spark release intermittently #2521

Comments

pxLi commented May 27, 2021 • edited Loading

pxLi commented May 28, 2021

pxLi commented May 28, 2021 • edited Loading

pxLi commented May 28, 2021 • edited Loading

pxLi commented May 31, 2021

firestarman commented May 31, 2021 • edited Loading

pxLi commented May 31, 2021 • edited Loading

pxLi commented May 31, 2021 • edited Loading

GaryShen2008 commented Jun 2, 2021

firestarman commented Jun 4, 2021

GaryShen2008 commented Jun 4, 2021

pxLi commented Jun 7, 2021

pxLi commented Jun 7, 2021

pxLi commented May 27, 2021 •

edited

Loading

pxLi commented May 28, 2021 •

edited

Loading

pxLi commented May 28, 2021 •

edited

Loading

firestarman commented May 31, 2021 •

edited

Loading

pxLi commented May 31, 2021 •

edited

Loading

pxLi commented May 31, 2021 •

edited

Loading