Add limitations for Databricks doc (#3501)

* Signed-off-by: Hao Zhu <hazhu@nvidia.com> Add limitations for databricks. * Update docs/get-started/getting-started-databricks.md Add indentation Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * Update getting-started-databricks.md updated * Update getting-started-databricks.md * Signed-off-by: Hao Zhu <hazhu@nvidia.com> Add databricks limitation to FAQ * Update docs/FAQ.md Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * Update docs/FAQ.md Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * Update docs/get-started/getting-started-databricks.md Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update FAQ.md * Update getting-started-databricks.md Move limitations section to top. Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>
NVIDIA · Sep 16, 2021 · 20057bc · 20057bc
1 parent 454bfba
commit 20057bc
Show file tree

Hide file tree

Showing 2 changed files with 39 additions and 0 deletions.
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -259,12 +259,18 @@ efficient to stay on the CPU instead of going back and forth.
 
 Yes, DPP still works.  It might not be as efficient as it could be, and we are working to improve it.
 
+DPP is not supported on Databricks with the plugin.
+Queries on Databricks will not fail but it can not benefit from DPP.
+
 ### Is Adaptive Query Execution (AQE) Supported?
 
 In the 0.2 release, AQE is supported but all exchanges will default to the CPU.  As of the 0.3 
 release, running on Spark 3.0.1 and higher any operation that is supported on GPU will now stay on 
 the GPU when AQE is enabled. 
 
+AQE is not supported on Databricks with the plugin. 
+If AQE is enabled on Databricks, queries may fail with `StackOverflowError` error.
+
 #### Why does my query show as not on the GPU when Adaptive Query Execution is enabled?
 
 When running an `explain()` on a query where AQE is on, it is possible that AQE has not finalized

diff --git a/docs/get-started/getting-started-databricks.md b/docs/get-started/getting-started-databricks.md
@@ -21,6 +21,39 @@ runtimes which may impact the behavior of the plugin.
 
 The number of GPUs per node dictates the number of Spark executors that can run in that node.
 
+## Limitations
+
+1. Adaptive query execution(AQE) and Delta optimization write do not work. These should be disabled
+when using the plugin. Queries may still see significant speedups even with AQE disabled.
+
+    ```bash 
+    spark.databricks.delta.optimizeWrite.enabled false
+    spark.sql.adaptive.enabled false
+    ```
+
+    See [issue-1059](https://github.com/NVIDIA/spark-rapids/issues/1059) for more detail. 
+
+2. Dynamic partition pruning(DPP) does not work.  This results in poor performance for queries which
+   would normally benefit from DPP.  See
+   [issue-3143](https://github.com/NVIDIA/spark-rapids/issues/3143) for more detail.
+
+3. When selecting GPU nodes, Databricks requires the driver node to be a GPU node.  Outside of
+   Databricks the plugin can operate with the driver as a CPU node and workers as GPU nodes.
+
+4. Cannot spin off multiple executors on a multi-GPU node. 
+
+	Even though it is possible to set `spark.executor.resource.gpu.amount=N` (where N is the number
+    of GPUs per node) in the in Spark Configuration tab, Databricks overrides this to
+    `spark.executor.resource.gpu.amount=1`.  This will result in failed executors when starting the
+    cluster.
+
+5. Databricks makes changes to the runtime without notification.
+
+    Databricks makes changes to existing runtimes, applying patches, without notification.
+	[Issue-3098](https://github.com/NVIDIA/spark-rapids/issues/3098) is one example of this.  We run
+	regular integration tests on the Databricks environment to catch these issues and fix them once
+	detected.
+
 ## Start a Databricks Cluster
 Create a Databricks cluster by going to Clusters, then clicking `+ Create Cluster`.  Ensure the
 cluster meets the prerequisites above by configuring it as follows: