Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ai.rapids.cudf.CudaException: an illegal instruction was encountered in databricks 9.1 #4548

Closed
pxLi opened this issue Jan 18, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@pxLi
Copy link
Collaborator

pxLi commented Jan 18, 2022

Describe the bug
Integration tests failed in databricks 9.1 runtime

some results (non-deterministic)
8014 failed, 4989 passed, 171 skipped, 470 xfailed, 163 xpassed, 510 warnings, 19 errors in 6579.11s (1:49:39)
10983 failed, 2039 passed, 171 skipped, 545 xfailed, 88 xpassed, 638 warnings, 31 errors in 5842.30s (1:37:22)

[2022-01-17T12:04:17.756Z] �[1m�[31mE                   Driver stacktrace:�[0m
[2022-01-17T12:04:17.756Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2828)�[0m
[2022-01-17T12:04:17.756Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2775)�[0m
[2022-01-17T12:04:17.756Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2769)�[0m
[2022-01-17T12:04:17.756Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)�[0m
[2022-01-17T12:04:17.756Z] �[1m�[31mE                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)�[0m
[2022-01-17T12:04:17.756Z] �[1m�[31mE                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)�[0m
[2022-01-17T12:04:17.756Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2769)�[0m
[2022-01-17T12:04:17.756Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1305)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1305)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at scala.Option.foreach(Option.scala:407)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1305)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3036)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2977)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2965)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1067)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2476)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:264)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:299)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:82)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:75)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:62)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:512)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at scala.Option.getOrElse(Option.scala:189)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:511)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:399)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:374)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:406)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3613)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3825)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:130)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:273)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:104)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:854)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:223)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3823)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3611)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at sun.reflect.GeneratedMethodAccessor139.invoke(Unknown Source)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at java.lang.reflect.Method.invoke(Method.java:498)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at py4j.Gateway.invoke(Gateway.java:295)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at py4j.GatewayConnection.run(GatewayConnection.java:251)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   Caused by: ai.rapids.cudf.CudaException: an illegal instruction was encountered�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at ai.rapids.cudf.Cuda.memcpyOnStream(Native Method)�[0m
[2022-01-17T12:04:17.757Z] �[1m�[31mE                   	at ai.rapids.cudf.Cuda.memcpy(Cuda.java:475)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at ai.rapids.cudf.Cuda.memcpy(Cuda.java:291)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at ai.rapids.cudf.BaseDeviceMemoryBuffer.copyFromHostBuffer(BaseDeviceMemoryBuffer.java:43)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at ai.rapids.cudf.BaseDeviceMemoryBuffer.copyFromHostBuffer(BaseDeviceMemoryBuffer.java:105)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at ai.rapids.cudf.HostColumnVector.copyToDevice(HostColumnVector.java:198)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at ai.rapids.cudf.HostColumnVector$ColumnBuilder.buildAndPutOnDevice(HostColumnVector.java:1290)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.GpuColumnVector$GpuColumnarBatchBuilder.buildAndPutOnDevice(GpuColumnVector.java:402)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.GpuColumnVector$GpuColumnarBatchBuilderBase.build(GpuColumnVector.java:277)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.RowToColumnarIterator.$anonfun$buildBatch$3(GpuRowToColumnarExec.scala:656)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.RowToColumnarIterator.withResource(GpuRowToColumnarExec.scala:585)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.RowToColumnarIterator.$anonfun$buildBatch$1(GpuRowToColumnarExec.scala:655)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.RowToColumnarIterator.withResource(GpuRowToColumnarExec.scala:585)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.RowToColumnarIterator.buildBatch(GpuRowToColumnarExec.scala:612)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.RowToColumnarIterator.next(GpuRowToColumnarExec.scala:608)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.RowToColumnarIterator.next(GpuRowToColumnarExec.scala:585)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$next$1(GpuExec.scala:196)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.RebaseHelper$.withResource(RebaseHelper.scala:25)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.CollectTimeIterator.next(GpuExec.scala:195)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:248)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$2(GpuColumnarToRowExec.scala:240)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.ColumnarToRowIterator.withResource(GpuColumnarToRowExec.scala:188)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:239)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:216)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:256)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:178)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at org.apache.spark.scheduler.Task.run(Task.scala:91)�[0m
[2022-01-17T12:04:17.758Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)�[0m
[2022-01-17T12:04:17.759Z] �[1m�[31mE                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1605)�[0m
[2022-01-17T12:04:17.759Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)�[0m
[2022-01-17T12:04:17.759Z] �[1m�[31mE                   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)�[0m
[2022-01-17T12:04:17.759Z] �[1m�[31mE                   	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)�[0m
[2022-01-17T12:04:17.759Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)�[0m
[2022-01-17T12:04:17.759Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)�[0m
[2022-01-17T12:04:17.759Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)�[0m
[2022-01-17T12:04:17.759Z] �[1m�[31mE                   	... 1 more�[0m
@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 18, 2022
@pxLi
Copy link
Collaborator Author

pxLi commented Jan 18, 2022

Tried to re-run the tests in db 9.1 on aws or azure multiple times today, and cannot reproduce.

Seems like the scheduled gpu instances of azure were not working well at that specific moment.

@pxLi pxLi removed the ? - Needs Triage Need team to review and classify label Jan 18, 2022
@pxLi
Copy link
Collaborator Author

pxLi commented Jan 18, 2022

close this for now

@pxLi pxLi closed this as completed Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant