Skip to content

Commit

Permalink
additional docs, use pathlib for writing script.
Browse files Browse the repository at this point in the history
  • Loading branch information
mjohns-databricks committed May 13, 2024
1 parent 8ab2bb8 commit f4245d1
Show file tree
Hide file tree
Showing 4 changed files with 72 additions and 80 deletions.
18 changes: 11 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ The supported languages are Scala, Python, R, and SQL.

## How does it work?

The Mosaic library is written in Scala (JVM) to guarantee maximum performance with Spark and when possible, it uses code generation to give an extra performance boost.
The Mosaic library is written in Scala (JVM) to guarantee maximum performance with Spark and when possible,
it uses code generation to give an extra performance boost.

__The other supported languages (Python, R and SQL) are thin wrappers around the Scala (JVM) code.__

Expand All @@ -44,9 +45,9 @@ Image1: Mosaic logical design.
:warning: **geopandas 0.14.4 not supported**

For Mosaic <= 0.4.1 `%pip install databricks-mosaic` will no longer install "as-is" in DBRs due to the fact that Mosaic
has thus far left geopandas unpinned. As of geopandas 0.14.4, numpy dependency is changed which conflicts
with the limits of scikit-learn in DBRs. The workaround is `%pip install geopandas==0.14.3 databricks-mosaic`.
Mosaic 0.4.2 release will pin the geopandas version.
left geopandas unpinned in those versions. With geopandas 0.14.4, numpy dependency conflicts with the limits of
scikit-learn in DBRs. The workaround is `%pip install geopandas==0.14.3 databricks-mosaic`.
Mosaic 0.4.2+ limits the geopandas version.

### Mosaic 0.4.x Series [Latest]

Expand All @@ -62,10 +63,13 @@ We recommend using Databricks Runtime versions 13.3 LTS with Photon enabled.
__Language Bindings__

As of Mosaic 0.4.0 (subject to change in follow-on releases)...
As of Mosaic 0.4.0 / DBR 13.3 LTS (subject to change in follow-on releases)...

* [Assigned Clusters](https://docs.databricks.com/en/compute/configure.html#access-modes): Mosaic Python, SQL, R, and Scala APIs.
* [Shared Access Clusters](https://docs.databricks.com/en/compute/configure.html#access-modes): Mosaic Scala API (JVM) with Admin [allowlisting](https://docs.databricks.com/en/data-governance/unity-catalog/manage-privileges/allowlist.html); _Python bindings to Mosaic Scala APIs are blocked by Py4J Security on Shared Access Clusters._
* [Assigned Clusters](https://docs.databricks.com/en/compute/configure.html#access-modes)
* Mosaic Python, SQL, R, and Scala APIs.
* [Shared Access Clusters](https://docs.databricks.com/en/compute/configure.html#access-modes)
* Mosaic Scala API (JVM) with Admin [allowlisting](https://docs.databricks.com/en/data-governance/unity-catalog/manage-privileges/allowlist.html).
* Mosaic Python bindings (to Mosaic Scala APIs) are blocked by Py4J Security on Shared Access Clusters.
* Mosaic SQL expressions cannot yet be registered with [Unity Catalog](https://www.databricks.com/product/unity-catalog) due to API changes affecting DBRs >= 13, more [here](https://docs.databricks.com/en/udf/index.html).

__Additional Notes:__
Expand Down
84 changes: 36 additions & 48 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,86 +29,71 @@
:target: https://github.com/databrickslabs/mosaic/actions/workflows/docs.yml
:alt: Mosaic sphinx docs

.. image:: https://img.shields.io/lgtm/grade/python/g/databrickslabs/mosaic.svg?logo=lgtm&logoWidth=18
:target: https://lgtm.com/projects/g/databrickslabs/mosaic/context:python
:alt: Language grade: Python

.. image:: https://img.shields.io/badge/code%20style-black-000000.svg
:target: https://github.com/psf/black
:alt: Code style: black



Mosaic is an extension to the `Apache Spark <https://spark.apache.org/>`_ framework that allows easy and fast processing of very large geospatial datasets.

We currently recommend using Databricks Runtime with Photon enabled;
this will leverage the Databricks H3 expressions when using H3 grid system.

Mosaic provides:

* easy conversion between common spatial data encodings (WKT, WKB and GeoJSON);
* constructors to easily generate new geometries from Spark native data types;
* many of the OGC SQL standard :code:`ST_` functions implemented as Spark Expressions for transforming, aggregating and joining spatial datasets;
* high performance through implementation of Spark code generation within the core Mosaic functions;
* optimisations for performing point-in-polygon joins using an approach we co-developed with Ordnance Survey (`blog post <https://databricks.com/blog/2021/10/11/efficient-point-in-polygon-joins-via-pyspark-and-bng-geospatial-indexing.html>`_); and
* the choice of a Scala, SQL and Python API.

.. warning::
For Mosaic <= 0.4.1 **%pip install databricks-mosaic** will no longer install "as-is" in DBRs due to the fact that Mosaic
has thus far left geopandas unpinned. As of geopandas 0.14.4, numpy dependency is changed which conflicts
with the limits of scikit-learn in DBRs. The workaround is **%pip install geopandas==0.14.3 databricks-mosaic**.
Mosaic 0.4.2 release will pin the geopandas version.
| Mosaic is an extension to the `Apache Spark <https://spark.apache.org/>`_ framework for fast + easy processing
of very large geospatial datasets. It provides:
|
| [1] The choice of a Scala, SQL and Python language bindings (written in Scala).
| [2] Raster and Vector APIs.
| [3] Easy conversion between common spatial data encodings (WKT, WKB and GeoJSON).
| [4] Constructors to easily generate new geometries from Spark native data types.
| [5] Many of the OGC SQL standard :code:`ST_` functions implemented as Spark Expressions for transforming,
| aggregating and joining spatial datasets.
| [6] High performance through implementation of Spark code generation within the core Mosaic functions.
| [7] Performing point-in-polygon joins using an approach we co-developed with Ordnance Survey
(`blog post <https://databricks.com/blog/2021/10/11/efficient-point-in-polygon-joins-via-pyspark-and-bng-geospatial-indexing.html>`_).
.. note::
For Mosaic versions < 0.4 please use the `0.3 docs <https://databrickslabs.github.io/mosaic/v0.3.x/index.html>`_.
We recommend using Databricks Runtime with Photon enabled to leverage the Databricks H3 expressions.

Version 0.4.x Series
====================

We recommend using Databricks Runtime versions 13.3 LTS with Photon enabled.
.. warning::
For Mosaic <= 0.4.1 :code:`%pip install databricks-mosaic` will no longer install "as-is" in DBRs due to the fact that Mosaic
left geopandas unpinned in those versions. With geopandas 0.14.4, numpy dependency conflicts with the limits of
scikit-learn in DBRs. The workaround is :code:`%pip install geopandas==0.14.3 databricks-mosaic`.
Mosaic 0.4.2+ limits the geopandas version.

Mosaic 0.4.x series only supports DBR 13.x DBRs. If running on a different DBR it will throw an exception:

DEPRECATION ERROR: Mosaic v0.4.x series only supports Databricks Runtime 13.
You can specify `%pip install 'databricks-mosaic<0.4,>=0.3'` for DBR < 13.
DEPRECATION ERROR: Mosaic v0.4.x series only supports Databricks Runtime 13.
You can specify :code:`%pip install 'databricks-mosaic<0.4,>=0.3'` for DBR < 13.

Mosaic 0.4.x series issues an ERROR on standard, non-Photon clusters `ADB <https://learn.microsoft.com/en-us/azure/databricks/runtime/>`_ |
`AWS <https://docs.databricks.com/runtime/index.html/>`_ |
`GCP <https://docs.gcp.databricks.com/runtime/index.html/>`_:

DEPRECATION ERROR: Please use a Databricks Photon-enabled Runtime for performance benefits or Runtime ML for
spatial AI benefits; Mosaic 0.4.x series restricts executing this cluster.
DEPRECATION ERROR: Please use a Databricks Photon-enabled Runtime for performance benefits or Runtime ML for
spatial AI benefits; Mosaic 0.4.x series restricts executing this cluster.

As of Mosaic 0.4.0 (subject to change in follow-on releases)
As of Mosaic 0.4.0 / DBR 13.3 LTS (subject to change in follow-on releases):

* `Assigned Clusters <https://docs.databricks.com/en/compute/configure.html#access-modes>`_: Mosaic Python, SQL, R, and Scala APIs.
* `Shared Access Clusters <https://docs.databricks.com/en/compute/configure.html#access-modes>`_: Mosaic Scala API (JVM) with
Admin `allowlisting <https://docs.databricks.com/en/data-governance/unity-catalog/manage-privileges/allowlist.html>`_;
Python bindings to Mosaic Scala APIs are blocked by Py4J Security on Shared Access Clusters.

.. warning::
Mosaic 0.4.x SQL bindings for DBR 13 can register with Assigned clusters (as Spark Expressions), but not Shared Access due
to `Unity Catalog <https://www.databricks.com/product/unity-catalog>`_ API changes, more `here <https://docs.databricks.com/en/udf/index.html>`_.
* `Assigned Clusters <https://docs.databricks.com/en/compute/configure.html#access-modes>`_
* Mosaic Python, SQL, R, and Scala APIs.
* `Shared Access Clusters <https://docs.databricks.com/en/compute/configure.html#access-modes>`_
* Mosaic Scala API (JVM) with Admin `allowlisting <https://docs.databricks.com/en/data-governance/unity-catalog/manage-privileges/allowlist.html>`_.
* Mosaic Python bindings (to Mosaic Scala APIs) are blocked by Py4J Security on Shared Access Clusters.
* Mosaic SQL expressions cannot yet be registered due to `Unity Catalog <https://www.databricks.com/product/unity-catalog>`_.
API changes, more `here <https://docs.databricks.com/en/udf/index.html>`_.

.. note::
As of Mosaic 0.4.0 (subject to change in follow-on releases)

* `Unity Catalog <https://www.databricks.com/product/unity-catalog>`_ enforces process isolation which is difficult
to accomplish with custom JVM libraries; as such only built-in (aka platform provided) JVM APIs can be invoked from
other supported languages in Shared Access Clusters.
* Along the same principle of isolation, clusters (both Assigned and Shared Access) can read
`Volumes <https://docs.databricks.com/en/connect/unity-catalog/volumes.html>`_ via relevant built-in readers and
writers or via custom python calls which do not involve any custom JVM code.
* Clusters (both Assigned and Shared Access) can read `Volumes <https://docs.databricks.com/en/connect/unity-catalog/volumes.html>`_
via relevant built-in readers and writers or via custom python calls which do not involve any custom JVM code.


Version 0.3.x Series
====================

We recommend using Databricks Runtime versions 12.2 LTS with Photon enabled.
For Mosaic versions < 0.4.0 please use the `0.3.x docs <https://databrickslabs.github.io/mosaic/v0.3.x/index.html>`_.

.. warning::
Mosaic 0.3.x series does not support DBR 13.x DBRs.

As of the 0.3.11 release, Mosaic issues the following WARNING when initialized on a cluster that is neither Photon Runtime
nor Databricks Runtime ML `ADB <https://learn.microsoft.com/en-us/azure/databricks/runtime/>`_ |
`AWS <https://docs.databricks.com/runtime/index.html/>`_ |
Expand All @@ -122,6 +107,9 @@ making this change is that we are streamlining Mosaic internals to be more align
powered by Photon. Along this direction of change, Mosaic has standardized to JTS as its default and supported Vector
Geometry Provider.

.. note::
For Mosaic versions < 0.4 please use the `0.3 docs <https://databrickslabs.github.io/mosaic/v0.3.x/index.html>`_.


Documentation
=============
Expand Down
44 changes: 23 additions & 21 deletions docs/source/usage/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,39 +6,39 @@ Supported platforms
###################

.. warning::
For Mosaic <= 0.4.1 **%pip install databricks-mosaic** will no longer install "as-is" in DBRs due to the fact that Mosaic
has thus far left geopandas unpinned. As of geopandas 0.14.4, numpy dependency is changed which conflicts
with the limits of scikit-learn in DBRs. The workaround is **%pip install geopandas==0.14.3 databricks-mosaic**.
Mosaic 0.4.2 release will pin the geopandas version.
For Mosaic <= 0.4.1 :code:`%pip install databricks-mosaic` will no longer install "as-is" in DBRs due to the fact that Mosaic
left geopandas unpinned in those versions. With geopandas 0.14.4, numpy dependency conflicts with the limits of
scikit-learn in DBRs. The workaround is :code:`%pip install geopandas==0.14.3 databricks-mosaic`.
Mosaic 0.4.2+ limits the geopandas version.

Mosaic 0.4.x series only supports DBR 13.x DBRs. If running on a different DBR it will throw an exception:

DEPRECATION ERROR: Mosaic v0.4.x series only supports Databricks Runtime 13.
You can specify `%pip install 'databricks-mosaic<0.4,>=0.3'` for DBR < 13.
DEPRECATION ERROR: Mosaic v0.4.x series only supports Databricks Runtime 13.
You can specify :code:`%pip install 'databricks-mosaic<0.4,>=0.3'` for DBR < 13.

Mosaic 0.4.x series issues an ERROR on standard, non-Photon clusters `ADB <https://learn.microsoft.com/en-us/azure/databricks/runtime/>`_ |
`AWS <https://docs.databricks.com/runtime/index.html/>`_ |
`GCP <https://docs.gcp.databricks.com/runtime/index.html/>`_:

DEPRECATION ERROR: Please use a Databricks Photon-enabled Runtime for performance benefits or Runtime ML for
spatial AI benefits; Mosaic 0.4.x series restricts executing this cluster.
DEPRECATION ERROR: Please use a Databricks Photon-enabled Runtime for performance benefits or Runtime ML for
spatial AI benefits; Mosaic 0.4.x series restricts executing this cluster.

As of Mosaic 0.4.0 (subject to change in follow-on releases)
As of Mosaic 0.4.0 / DBR 13.3 LTS (subject to change in follow-on releases):

* `Assigned Clusters <https://docs.databricks.com/en/compute/configure.html#access-modes>`_: Mosaic Python, SQL, R, and Scala APIs.
* `Shared Access Clusters <https://docs.databricks.com/en/compute/configure.html#access-modes>`_: Mosaic Scala API (JVM) with
Admin `allowlisting <https://docs.databricks.com/en/data-governance/unity-catalog/manage-privileges/allowlist.html>`_;
Python bindings to Mosaic Scala APIs are blocked by Py4J Security on Shared Access Clusters.
* `Assigned Clusters <https://docs.databricks.com/en/compute/configure.html#access-modes>`_
* Mosaic Python, SQL, R, and Scala APIs.
* `Shared Access Clusters <https://docs.databricks.com/en/compute/configure.html#access-modes>`_
* Mosaic Scala API (JVM) with Admin `allowlisting <https://docs.databricks.com/en/data-governance/unity-catalog/manage-privileges/allowlist.html>`_.
* Mosaic Python bindings (to Mosaic Scala APIs) are blocked by Py4J Security on Shared Access Clusters.
* Mosaic SQL expressions cannot yet be registered due to `Unity Catalog <https://www.databricks.com/product/unity-catalog>`_.
API changes, more `here <https://docs.databricks.com/en/udf/index.html>`_.

.. note::
As of Mosaic 0.4.0 (subject to change in follow-on releases)

* `Unity Catalog <https://www.databricks.com/product/unity-catalog>`_ enforces process isolation which is difficult
to accomplish with custom JVM libraries; as such only built-in (aka platform provided) JVM APIs can be invoked from
other supported languages in Shared Access Clusters.
* Along the same principle of isolation, clusters (both assigned and shared access) can read
`Volumes <https://docs.databricks.com/en/connect/unity-catalog/volumes.html>`_ via relevant built-in readers and
writers or via custom python calls which do not involve any custom JVM code.
* Clusters (both Assigned and Shared Access) can read `Volumes <https://docs.databricks.com/en/connect/unity-catalog/volumes.html>`_
via relevant built-in readers and writers or via custom python calls which do not involve any custom JVM code.

If you have cluster creation permissions in your Databricks
workspace, you can create a cluster using the instructions
Expand Down Expand Up @@ -116,9 +116,11 @@ The mechanism for enabling the Mosaic functions varies by language:
enableMosaic()

.. note::
We recommend use of **import mosaic as mos** to namespace the python api and avoid any conflicts with other similar
functions. By default, the python import will handle installing the JAR and registering Spark Expressions which are
suitable for Assigned (vs Shared Access) clusters.
* We recommend use of :code:`import mosaic as mos` to namespace the python api and avoid any conflicts with other similar
functions. By default, the python import will handle installing the JAR and registering Spark Expressions which are
suitable for Assigned (vs Shared Access) clusters.
* It is possible to initialize python bindings without providing :code:`dbutils`; if you do this, :code:`%%mosaic_kepler`
won't be able to render maps in notebooks.

Unless you are specially adding the JAR to your cluster (outside :code:`%pip` or the WHL file), please always initialize
with Python first, then you can initialize Scala (after the JAR has been auto-attached by python); otherwise, you don't
Expand Down
6 changes: 2 additions & 4 deletions python/mosaic/api/fuse.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,6 @@ def configure(self, test_mode: bool = False) -> bool:
# volumes must be pre-generated in unity catalog
os.makedirs(self.to_fuse_dir, exist_ok=True)

script_out_path = f"{self.to_fuse_dir}/{self.script_out_name}"

# - start with the un-configured script (from repo)
# this is using a different (repo) folder in 0.4.2+ (to allow prior versions to work)
GITHUB_CONTENT_TAG_URL = "https://raw.githubusercontent.com/databrickslabs/mosaic/main"
Expand Down Expand Up @@ -64,8 +62,8 @@ def configure(self, test_mode: bool = False) -> bool:
)

# - write the configured init script
with open(script_out_path, "w") as file:
file.write(script)
script_out_path = Path(self.to_fuse_dir) / self.script_out_name
script_out_path.write_text(script, encoding='utf-8')

# --- end of script config ---

Expand Down

0 comments on commit f4245d1

Please sign in to comment.