Skip to content

Commit

Permalink
[DOCS] Improve the Databricks setup guide (#1582)
Browse files Browse the repository at this point in the history
  • Loading branch information
jiayuasu committed Sep 6, 2024
1 parent a66d4e7 commit b408227
Show file tree
Hide file tree
Showing 4 changed files with 40 additions and 15 deletions.
40 changes: 28 additions & 12 deletions docs/setup/databricks.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Please pay attention to the Spark version postfix and Scala version postfix on our [Maven Coordinate page](../maven-coordinates). Databricks Spark and Apache Spark's compatibility can be found here: https://docs.databricks.com/en/release-notes/runtime/index.html
Please pay attention to the Spark version postfix and Scala version postfix on our [Maven Coordinate page](maven-coordinates.md). Databricks Spark and Apache Spark's compatibility can be found [here](https://docs.databricks.com/en/release-notes/runtime/index.html).

## Community edition (free-tier)

Expand All @@ -8,18 +8,18 @@ You just need to install the Sedona jars and Sedona Python on Databricks using D

1) From the Libraries tab install from Maven Coordinates

```
org.apache.sedona:sedona-spark-shaded-3.0_2.12:{{ sedona.current_version }}
org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}
```
```
org.apache.sedona:sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}
org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}
```

2) For enabling python support, from the Libraries tab install from PyPI

```
apache-sedona
keplergl==0.3.2
pydeck==0.8.0
```
```
apache-sedona=={{ sedona.current_version }}
keplergl==0.3.2
pydeck==0.8.0
```

### Initialize

Expand Down Expand Up @@ -66,10 +66,15 @@ curl -o /Workspace/Shared/sedona/{{ sedona.current_version }}/geotools-wrapper-{
curl -o /Workspace/Shared/sedona/{{ sedona.current_version }}/sedona-spark-shaded-3.4_2.12-{{ sedona.current_version }}.jar "https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.4_2.12/{{ sedona.current_version }}/sedona-spark-shaded-3.4_2.12-{{ sedona.current_version }}.jar"
```

Of course, you can also do the steps above manually.

### Create an init script

!!!warning
Starting from December 2023, Databricks has disabled all DBFS based init script (/dbfs/XXX/<script-name>.sh). So you will have to store the init script from a workspace level (`/Users/<user-name>/<script-name>.sh`) or Unity Catalog volume (`/Volumes/<catalog>/<schema>/<volume>/<path-to-script>/<script-name>.sh`). Please see https://docs.databricks.com/en/init-scripts/cluster-scoped.html#configure-a-cluster-scoped-init-script-using-the-ui
Starting from December 2023, Databricks has disabled all DBFS based init script (/dbfs/XXX/<script-name>.sh). So you will have to store the init script from a workspace level (`/Workspace/Users/<user-name>/<script-name>.sh`) or Unity Catalog volume (`/Volumes/<catalog>/<schema>/<volume>/<path-to-script>/<script-name>.sh`). Please see [Databricks init scripts](https://docs.databricks.com/en/init-scripts/cluster-scoped.html#configure-a-cluster-scoped-init-script-using-the-ui) for more information.

!!!note
If you are creating a Shared cluster, you won't be able to use init scripts and jars stored under `Workspace`. Please instead store them in `Volumes`. The overall process should be the same.

Create an init script in `Workspace` that loads the Sedona jars into the cluster's default jar directory. You can create that from any notebook by running:

Expand All @@ -86,13 +91,14 @@ cat > /Workspace/Shared/sedona/sedona-init.sh <<'EOF'
# File: sedona-init.sh
#
# On cluster startup, this script will copy the Sedona jars to the cluster's default jar directory.
# In order to activate Sedona functions, remember to add to your spark configuration the Sedona extensions: "spark.sql.extensions org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
cp /Workspace/Shared/sedona/{{ sedona.current_version }}/*.jar /databricks/jars
EOF
```

Of course, you can also do the steps above manually.

### Set up cluster config

From your cluster configuration (`Cluster` -> `Edit` -> `Configuration` -> `Advanced options` -> `Spark`) activate the Sedona functions and the kryo serializer by adding to the Spark Config
Expand Down Expand Up @@ -120,3 +126,13 @@ pydeck==0.8.0

!!!tips
You need to install the Sedona libraries via init script because the libraries installed via UI are installed after the cluster has already started, and therefore the classes specified by the config `spark.sql.extensions`, `spark.serializer`, and `spark.kryo.registrator` are not available at startup time.*

### Verify installation

After you have started the cluster, you can verify that Sedona is correctly installed by running the following code in a notebook:

```python
spark.sql("SELECT ST_Point(1, 1)").show()
```

Note that: you don't need to run the `SedonaRegistrator.registerAll(spark)` or `SedonaContext.create(spark)` in the advanced edition because `org.apache.sedona.sql.SedonaSqlExtensions` in the Cluster Config will take care of that.
10 changes: 10 additions & 0 deletions docs/setup/emr.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,13 @@ When you create an EMR cluster, in the software configuration, add the following

!!!note
If you use Sedona 1.3.1-incubating, please use `sedona-python-adpater-3.0_2.12` jar in the content above, instead of `sedona-spark-shaded-3.0_2.12`.

## Verify installation

After the cluster is created, you can verify the installation by running the following code in a Jupyter notebook:

```python
spark.sql("SELECT ST_Point(0, 0)").show()
```

Note that: you don't need to run the `SedonaRegistrator.registerAll(spark)` or `SedonaContext.create(spark)` because `org.apache.sedona.sql.SedonaSqlExtensions` in the config will take care of that.
1 change: 0 additions & 1 deletion docs/tutorial/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,4 @@
We welcome people to use Sedona for benchmark purpose. To achieve the best performance or enjoy all features of Sedona,

* Please always use the latest version or state the version used in your benchmark so that we can trace back to the issues.
* Please consider using Sedona core instead of Sedona SQL. Due to the limitation of SparkSQL (for instance, not support clustered index), we are not able to expose all features to SparkSQL.
* Please open Sedona kryo serializer to reduce the memory footprint.
4 changes: 2 additions & 2 deletions docs/tutorial/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Detailed SedonaSQL APIs are available here: [SedonaSQL API](../api/sql/Overview.

## Create Sedona config

Use the following code to create your Sedona config at the beginning. If you already have a SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks, please skip this step and can use `spark` directly.
Use the following code to create your Sedona config at the beginning. If you already have a SparkSession (usually named `spark`) created by AWS EMR/Databricks/Microsoft Fabric, please ==skip this step==.

==Sedona >= 1.4.1==

Expand Down Expand Up @@ -147,7 +147,7 @@ The following method has been deprecated since Sedona 1.4.1. Please use the meth

## Initiate SedonaContext

Add the following line after creating Sedona config. If you already have a SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks, please call `SedonaContext.create(spark)` instead.
Add the following line after creating Sedona config. If you already have a SparkSession (usually named `spark`) created by AWS EMR/Databricks/Microsoft Fabric, please call `sedona = SedonaContext.create(spark)` instead. For ==Databricks==, the situation is more complicated, please refer to [Databricks setup guide](../setup/databricks.md), but generally you don't need to create SedonaContext.

==Sedona >= 1.4.1==

Expand Down

0 comments on commit b408227

Please sign in to comment.