Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPC-DS causes OOM #594

Closed
vaibhawvipul opened this issue Jun 25, 2024 · 8 comments · Fixed by #639
Closed

TPC-DS causes OOM #594

vaibhawvipul opened this issue Jun 25, 2024 · 8 comments · Fixed by #639
Assignees
Labels
bug Something isn't working

Comments

@vaibhawvipul
Copy link
Contributor

Describe the bug

initially started with a 3TB dataset, which i then scalled to 200GB. This is the driver and executor config on my end.

--conf spark.driver.memory=10g \
--conf spark.executor.cores=4 \
--conf spark.executor.memory=12g \
--conf spark.driver.memoryOverhead=3000 \
--conf spark.executor.memoryOverhead=4G \

java options

--conf spark.driver.defaultJavaOptions="-XX:OnOutOfMemoryError='kill -9 %p' -XX:+UseParallelGC -XX:InitiatingHeapOccupancyPercent=70" \

--conf spark.executor.defaultJavaOptions="-verbose:gc -XX:+UseParallelGC -XX:InitiatingHeapOccupancyPercent=70" \

Comet configurations are as described in the benchmark section website.
Running this with 40 executors, and observe some OOM, which is intriguing because the dataset is small.

Steps to reproduce

No response

Expected behavior

No OOM

Additional context

image

@vaibhawvipul vaibhawvipul added the bug Something isn't working label Jun 25, 2024
@vaibhawvipul
Copy link
Contributor Author

And then the driver hangs with the following:

2024-06-25 14:36:22,219 WARN serde.QueryPlanSerde: Comet does not guarantee correct results for cast from DecimalType(22,6) to DecimalType(32,6) with timezone Some(Etc/UTC) and evalMode LEGACY                                                                                                                                                                         
   2024-06-25 14:43:32,069 ERROR util.Utils: Uncaught exception in thread kubernetes-executor-pod-polling-sync                                                                                                                                                                                                                                                              
   io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.                                                                                                                                                                                                                                                                                           
       at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:129)                                                                                                                                                                                                                                                       
       at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:122)                                                                                                                                                                                                                                                       
       at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:543)                                                                                                                                                                                                                                                               
       at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:427)                                                                                                                                                                                                                                                                              
       at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:392)                                                                                                                                                                                                                                                                              
       at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:93)                                                                                                                                                                                                                                                                               
       at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsPollingSnapshotSource$PollRunnable.$anonfun$run$1(ExecutorPodsPollingSnapshotSource.scala:91)                                                                                                                                                                                                                  
       at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1509)                                                                                                                                                                                                                                                                                                
       at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsPollingSnapshotSource$PollRunnable.run(ExecutorPodsPollingSnapshotSource.scala:74)                                                                                                                                                                                                                             
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)                                                                                                                                                                                                                                                                                           
       at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)                                                                                                                                                                                                                                                                                                  
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)                                                                                                                                                                                                                                             
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)                                                                                                                                                                                                                                                    
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)                                                                                                                                                                                                                                                                                   
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)                                                                                                                                                                                                                                                                                   
       at java.lang.Thread.run(Thread.java:750)                                                                                                                                                                                                                                                                                                                             
   Caused by: java.util.concurrent.TimeoutException                                                                                                                                                                                                                                                                                                                         
       at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)                                                                                                                                                                                                                                                                                      
       at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)                                                                                                                                                                                                                                                                                           
       at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:520) 

@lmouhib
Copy link

lmouhib commented Jun 25, 2024

logs from one of the executor

2024-06-25 14:36:03,380 WARN executor.Executor: task 181.0 in stage 573.0 (TID 35314) encountered a org.apache.spark.shuffle.FetchFailedException and failed, but the org.apache.spark.shuffle.FetchFailedException was hidden by another exception.  Spark is handling this like a fetch failure and ignoring the other exception: org.apache.comet.CometNativeExcept │
│ ion: General execution error with reason org.apache.comet.CometNativeException: called `Result::unwrap()` on an `Err` value: JNI { source: NullPtr("get_object_class") }                                                                                                                                                                                               │
│         at std::backtrace::Backtrace::create(__internal__:0)                                                                                                                                                                                                                                                                                                           │
│         at comet::errors::init::{{closure}}(__internal__:0)                                                                                                                                                                                                                                                                                                            │
│         at std::panicking::rust_panic_with_hook(__internal__:0)                                                                                                                                                                                                                                                                                                        │
│         at std::panicking::begin_panic_handler::{{closure}}(__internal__:0)                                                                                                                                                                                                                                                                                            │
│         at std::sys_common::backtrace::__rust_end_short_backtrace(__internal__:0)                                                                                                                                                                                                                                                                                      │
│         at rust_begin_unwind(__internal__:0)                                                                                                                                                                                                                                                                                                                           │
│         at core::panicking::panic_fmt(__internal__:0)                                                                                                                                                                                                                                                                                                                  │
│         at core::result::unwrap_failed(__internal__:0)                                                                                                                                                                                                                                                                                                                 │
│         at comet::execution::operators::scan::ScanExec::get_next(__internal__:0)                                                                                                                                                                                                                                                                                       │
│         at comet::execution::operators::scan::ScanExec::get_next_batch(__internal__:0)                                                                                                                                                                                                                                                                                 │
│         at comet::execution::jni_api::Java_org_apache_comet_Native_executePlan::{{closure}}(__internal__:0)                                                                                                                                                                                                                                                            │
│         at Java_org_apache_comet_Native_executePlan(__internal__:0)                                                                                                                                                                                                                                                                                                    │
│         at <unknown>(__internal__:0).                                                                                                                                                                                                                                                                                                                                  

@lmouhib
Copy link

lmouhib commented Jun 25, 2024

Spark version is 3.4.3.

@andygrove
Copy link
Member

@vaibhawvipul I am curious if you see the same issue if you disable Comet shuffle?

@andygrove
Copy link
Member

@lmouhib would you be able to test again with the changes in #598 to see if it resolves the panic and shows us more information about the root cause?

@andygrove
Copy link
Member

@lmouhib also #600 may help

@lmouhib
Copy link

lmouhib commented Jun 27, 2024

I can try #600 once its merged with main. The exception raised which led to JNI error is marked as WARNING, not sure it is the root cause, seems like it happens when an executor wants to get some intermediate result during a shuffle and the remote executor was already killed due to OOM.

@lmouhib
Copy link

lmouhib commented Jun 27, 2024

@vaibhawvipul I am curious if you see the same issue if you disable Comet shuffle?

Shouldn't it be when enabling Comet shuffle? Because I am able to run the test I disable the Comet shuffle. I am using kubernetes, maybe if we have access to a YARN based cluster we can try to run it with and without comet shuffle?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants