[DOCS] Improve the Databricks setup guide (#1582)

apache · Sep 6, 2024 · b408227 · b408227
1 parent a66d4e7
commit b408227
Show file tree

Hide file tree

Showing 4 changed files with 40 additions and 15 deletions.
diff --git a/docs/setup/databricks.md b/docs/setup/databricks.md
@@ -1,4 +1,4 @@
-Please pay attention to the Spark version postfix and Scala version postfix on our [Maven Coordinate page](../maven-coordinates). Databricks Spark and Apache Spark's compatibility can be found here: https://docs.databricks.com/en/release-notes/runtime/index.html
+Please pay attention to the Spark version postfix and Scala version postfix on our [Maven Coordinate page](maven-coordinates.md). Databricks Spark and Apache Spark's compatibility can be found [here](https://docs.databricks.com/en/release-notes/runtime/index.html).
 
 ## Community edition (free-tier)
 
@@ -8,18 +8,18 @@ You just need to install the Sedona jars and Sedona Python on Databricks using D
 
 1) From the Libraries tab install from Maven Coordinates
 
- ```
- org.apache.sedona:sedona-spark-shaded-3.0_2.12:{{ sedona.current_version }}
- org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}
- ```
+```
+org.apache.sedona:sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}
+org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}
+```
 
 2) For enabling python support, from the Libraries tab install from PyPI
 
- ```
- apache-sedona
- keplergl==0.3.2
- pydeck==0.8.0
- ```
+```
+apache-sedona=={{ sedona.current_version }}
+keplergl==0.3.2
+pydeck==0.8.0
+```
 
 ### Initialize
 
@@ -66,10 +66,15 @@ curl -o /Workspace/Shared/sedona/{{ sedona.current_version }}/geotools-wrapper-{
 curl -o /Workspace/Shared/sedona/{{ sedona.current_version }}/sedona-spark-shaded-3.4_2.12-{{ sedona.current_version }}.jar "https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.4_2.12/{{ sedona.current_version }}/sedona-spark-shaded-3.4_2.12-{{ sedona.current_version }}.jar"
 ```
 
+Of course, you can also do the steps above manually.
+
 ### Create an init script
 
 !!!warning
- Starting from December 2023, Databricks has disabled all DBFS based init script (/dbfs/XXX/<script-name>.sh). So you will have to store the init script from a workspace level (`/Users/<user-name>/<script-name>.sh`) or Unity Catalog volume (`/Volumes/<catalog>/<schema>/<volume>/<path-to-script>/<script-name>.sh`). Please see https://docs.databricks.com/en/init-scripts/cluster-scoped.html#configure-a-cluster-scoped-init-script-using-the-ui
+ Starting from December 2023, Databricks has disabled all DBFS based init script (/dbfs/XXX/<script-name>.sh). So you will have to store the init script from a workspace level (`/Workspace/Users/<user-name>/<script-name>.sh`) or Unity Catalog volume (`/Volumes/<catalog>/<schema>/<volume>/<path-to-script>/<script-name>.sh`). Please see [Databricks init scripts](https://docs.databricks.com/en/init-scripts/cluster-scoped.html#configure-a-cluster-scoped-init-script-using-the-ui) for more information.
+
+!!!note
+ If you are creating a Shared cluster, you won't be able to use init scripts and jars stored under `Workspace`. Please instead store them in `Volumes`. The overall process should be the same.
 
 Create an init script in `Workspace` that loads the Sedona jars into the cluster's default jar directory. You can create that from any notebook by running:
 
@@ -86,13 +91,14 @@ cat > /Workspace/Shared/sedona/sedona-init.sh <<'EOF'
 # File: sedona-init.sh
 #
 # On cluster startup, this script will copy the Sedona jars to the cluster's default jar directory.
-# In order to activate Sedona functions, remember to add to your spark configuration the Sedona extensions: "spark.sql.extensions org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
 
 cp /Workspace/Shared/sedona/{{ sedona.current_version }}/*.jar /databricks/jars
 
 EOF
 ```
 
+Of course, you can also do the steps above manually.
+
 ### Set up cluster config
 
 From your cluster configuration (`Cluster` -> `Edit` -> `Configuration` -> `Advanced options` -> `Spark`) activate the Sedona functions and the kryo serializer by adding to the Spark Config
@@ -120,3 +126,13 @@ pydeck==0.8.0
 
 !!!tips
  You need to install the Sedona libraries via init script because the libraries installed via UI are installed after the cluster has already started, and therefore the classes specified by the config `spark.sql.extensions`, `spark.serializer`, and `spark.kryo.registrator` are not available at startup time.*
+
+### Verify installation
+
+After you have started the cluster, you can verify that Sedona is correctly installed by running the following code in a notebook:
+
+```python
+spark.sql("SELECT ST_Point(1, 1)").show()
+```
+
+Note that: you don't need to run the `SedonaRegistrator.registerAll(spark)` or `SedonaContext.create(spark)` in the advanced edition because `org.apache.sedona.sql.SedonaSqlExtensions` in the Cluster Config will take care of that.
diff --git a/docs/setup/emr.md b/docs/setup/emr.md
@@ -52,3 +52,13 @@ When you create an EMR cluster, in the software configuration, add the following
 
 !!!note
  If you use Sedona 1.3.1-incubating, please use `sedona-python-adpater-3.0_2.12` jar in the content above, instead of `sedona-spark-shaded-3.0_2.12`.
+
+## Verify installation
+
+After the cluster is created, you can verify the installation by running the following code in a Jupyter notebook:
+
+```python
+spark.sql("SELECT ST_Point(0, 0)").show()
+```
+
+Note that: you don't need to run the `SedonaRegistrator.registerAll(spark)` or `SedonaContext.create(spark)` because `org.apache.sedona.sql.SedonaSqlExtensions` in the config will take care of that.
diff --git a/docs/tutorial/benchmark.md b/docs/tutorial/benchmark.md
@@ -3,5 +3,4 @@
 We welcome people to use Sedona for benchmark purpose. To achieve the best performance or enjoy all features of Sedona,
 
 * Please always use the latest version or state the version used in your benchmark so that we can trace back to the issues.
-* Please consider using Sedona core instead of Sedona SQL. Due to the limitation of SparkSQL (for instance, not support clustered index), we are not able to expose all features to SparkSQL.
 * Please open Sedona kryo serializer to reduce the memory footprint.
diff --git a/docs/tutorial/sql.md b/docs/tutorial/sql.md
@@ -43,7 +43,7 @@ Detailed SedonaSQL APIs are available here: [SedonaSQL API](../api/sql/Overview.
 
 ## Create Sedona config
 
-Use the following code to create your Sedona config at the beginning. If you already have a SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks, please skip this step and can use `spark` directly.
+Use the following code to create your Sedona config at the beginning. If you already have a SparkSession (usually named `spark`) created by AWS EMR/Databricks/Microsoft Fabric, please ==skip this step==.
 
 ==Sedona >= 1.4.1==
 
@@ -147,7 +147,7 @@ The following method has been deprecated since Sedona 1.4.1. Please use the meth
 
 ## Initiate SedonaContext
 
-Add the following line after creating Sedona config. If you already have a SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks, please call `SedonaContext.create(spark)` instead.
+Add the following line after creating Sedona config. If you already have a SparkSession (usually named `spark`) created by AWS EMR/Databricks/Microsoft Fabric, please call `sedona = SedonaContext.create(spark)` instead. For ==Databricks==, the situation is more complicated, please refer to [Databricks setup guide](../setup/databricks.md), but generally you don't need to create SedonaContext.
 
 ==Sedona >= 1.4.1==