Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global Checkpoint #562

Merged
merged 40 commits into from
May 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
af37d92
checkpoint handling, streamline gdal setup, test log changes
mjohns-databricks May 1, 2024
ff7a840
undo local pom changes
mjohns-databricks May 1, 2024
07be355
revert StringType in readRaster to use createInfo
mjohns-databricks May 1, 2024
3b999d0
reducing long-running test data
mjohns-databricks May 1, 2024
53de83f
logging level adjusted; no checkpoint test adjusted.
mjohns-databricks May 1, 2024
62faf13
.gitignore updated for local notebooks
mjohns-databricks May 1, 2024
2b429bd
Merge pull request #559 from databrickslabs/main
mjohns-databricks May 3, 2024
7ccb8d8
merged codecov change, optional local build tweaks, reorder a test
mjohns-databricks May 6, 2024
dec0a94
Appended './0.4.2' to default fuse path, updating checkpoint fields …
mjohns-databricks May 6, 2024
ee8782e
spark configs set for checkpoint.
mjohns-databricks May 6, 2024
ee89699
adjusted local build, refreshing expression configs for checkpoint
mjohns-databricks May 7, 2024
6159d60
checkpoint accessors, deserialize handling.
mjohns-databricks May 8, 2024
a718021
deserialize checking
mjohns-databricks May 8, 2024
67cc869
revert deserialize checking
mjohns-databricks May 8, 2024
6992e90
fallback to path on raster ClassCastException
mjohns-databricks May 8, 2024
f65dfd1
serialize fallback to path on raster ClassCastException
mjohns-databricks May 8, 2024
0074150
deserialization handling
mjohns-databricks May 8, 2024
b660135
classcastexception and null check.
mjohns-databricks May 9, 2024
f062397
RasterTileTile now checkpoint aware.
mjohns-databricks May 9, 2024
920263d
pin geopandas ver pinned to 0.14, h3 ver pinned to 3.7, deserializati…
mjohns-databricks May 13, 2024
0387a12
additional docs, use pathlib for writing script.
mjohns-databricks May 13, 2024
e1a2754
clarify custom JVM vs built-in language in docs.
mjohns-databricks May 14, 2024
7dc5b13
pin geopandas ver pinned to 0.14, h3 ver pinned to 3.7, deserializati…
mjohns-databricks May 13, 2024
cab5db5
additional docs, use pathlib for writing script.
mjohns-databricks May 13, 2024
63435fc
clarify custom JVM vs built-in language in docs.
mjohns-databricks May 14, 2024
0770a2f
Merge remote-tracking branch 'refs/remotes/origin/main' into HEAD
mjohns-databricks May 14, 2024
d442b35
merge changes prior to 0.4.2 release, remove pyspark dep.
mjohns-databricks May 14, 2024
74ae84b
checkpoint handling for file and content.
mjohns-databricks May 15, 2024
acae6a8
enableGDALWithCheckpoint re-inits MosaicContext
mjohns-databricks May 16, 2024
5ef3620
re-register spark expressions for checkpoint
mjohns-databricks May 16, 2024
3e6938a
refresh python mosaic context. adjusted config for python and ipython…
mjohns-databricks May 16, 2024
8e658b0
additional adjustments to mosaic_context and additional functions for…
mjohns-databricks May 16, 2024
406027a
streamline python bindings wrt mosaic_context.
mjohns-databricks May 17, 2024
f11a271
version to 0.4.3, hasContext() function added to MosaicContext class.
mjohns-databricks May 17, 2024
b0f3230
gdal API handles checkpoint config changes.
mjohns-databricks May 20, 2024
705970c
reset checkpoint support, additional testing.
mjohns-databricks May 21, 2024
a3fee82
new functions to gdal __all__
mjohns-databricks May 21, 2024
ef3ee9a
small commit to trigger github build.
mjohns-databricks May 21, 2024
2e3f3bd
changelog, docker, and pyspark version changes.
mjohns-databricks May 24, 2024
4b80084
library handling.
mjohns-databricks May 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build_main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:
python: [ 3.10.12 ]
numpy: [ 1.22.4 ]
gdal: [ 3.4.1 ]
spark: [ 3.4.0 ]
spark: [ 3.4.1 ]
R: [ 4.2.2 ]
steps:
- name: checkout code
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
python: [ 3.10.12 ]
numpy: [ 1.22.4 ]
gdal: [ 3.4.1 ]
spark: [ 3.4.0 ]
spark: [ 3.4.1 ]
R: [ 4.2.2 ]
steps:
- name: checkout code
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_r.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
python: [ 3.10.12 ]
numpy: [ 1.22.4 ]
gdal: [ 3.4.1 ]
spark: [ 3.4.0 ]
spark: [ 3.4.1 ]
R: [ 4.2.2 ]
steps:
- name: checkout code
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_scala.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
python: [ 3.10.12 ]
numpy: [ 1.22.4 ]
gdal: [ 3.4.1 ]
spark: [ 3.4.0 ]
spark: [ 3.4.1 ]
R: [ 4.2.2 ]
steps:
- name: checkout code
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/pypi-release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
python: [ 3.10.12 ]
numpy: [ 1.22.4 ]
gdal: [ 3.4.1 ]
spark: [ 3.4.0 ]
spark: [ 3.4.1 ]
R: [ 4.2.2 ]
steps:
- name: checkout code
Expand Down
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -159,3 +159,9 @@ spark-warehouse
.DS_Store
.Rproj.user
docker/.m2/
/python/notebooks/
/scripts/m2/
/python/mosaic_test/
/python/checkpoint/
/python/checkpoint-new/
/scripts/docker/docker-build/ubuntu-22-spark-3.4/Dockerfile
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
## v0.4.3 [DBR 13.3 LTS]
- Pyspark requirement removed from python setup.cfg as it is supplied by DBR
- Python version limited to "<3.11,>=3.10" for DBR
- iPython dependency limited to "<8.11,>=7.4.2" for both DBR and keplergl-jupyter
- Expanded support for fuse-based checkpointing (persisted raster storage), managed through:
- spark config 'spark.databricks.labs.mosaic.raster.use.checkpoint' in addition to 'spark.databricks.labs.mosaic.raster.checkpoint'.
- python: `mos.enable_gdal(spark, with_checkpoint_path=path)`.
- scala: `MosaicGDAL.enableGDALWithCheckpoint(spark, path)`.

## v0.4.2 [DBR 13.3 LTS]
- Geopandas now fixed to "<0.14.4,>=0.14" due to conflict with minimum numpy version in geopandas 0.14.4.
- H3 python changed from "==3.7.0" to "<4.0,>=3.7" to pick up patches.
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ The repository is structured as follows:
## Test & build Mosaic

Given that DBR 13.3 is Ubuntu 22.04, we recommend using docker,
see [mosaic-docker.sh](https://github.com/databrickslabs/mosaic/blob/main/scripts/mosaic-docker.sh).
see [mosaic-docker.sh](https://github.com/databrickslabs/mosaic/blob/main/scripts/docker/mosaic-docker.sh).

### Scala JAR

Expand Down
2 changes: 1 addition & 1 deletion R/sparkR-mosaic/sparkrMosaic/DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: sparkrMosaic
Title: SparkR bindings for Databricks Mosaic
Version: 0.4.2
Version: 0.4.3
Authors@R:
person("Robert", "Whiffin", , "robert.whiffin@databricks.com", role = c("aut", "cre")
)
Expand Down
2 changes: 1 addition & 1 deletion R/sparklyr-mosaic/sparklyrMosaic/DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: sparklyrMosaic
Title: sparklyr bindings for Databricks Mosaic
Version: 0.4.2
Version: 0.4.3
Authors@R:
person("Robert", "Whiffin", , "robert.whiffin@databricks.com", role = c("aut", "cre")
)
Expand Down
2 changes: 1 addition & 1 deletion R/sparklyr-mosaic/tests.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ library(sparklyr.nested)
spark_home <- Sys.getenv("SPARK_HOME")
spark_home_set(spark_home)

install.packages("sparklyrMosaic_0.4.2.tar.gz", repos = NULL)
install.packages("sparklyrMosaic_0.4.3.tar.gz", repos = NULL)
library(sparklyrMosaic)

# find the mosaic jar in staging
Expand Down
1 change: 0 additions & 1 deletion docs/source/api/rasterio-udfs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -248,7 +248,6 @@ depending on your needs.
def write_raster(raster, driver, file_id, fuse_dir):
from io import BytesIO
from pathlib import Path
from pyspark.sql.functions import udf
from rasterio.io import MemoryFile
import numpy as np
import rasterio
Expand Down
2 changes: 1 addition & 1 deletion docs/source/usage/install-gdal.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ Here are spark session configs available for raster, e.g. :code:`spark.conf.set(
- Checkpoint location, e.g. :ref:`rst_maketiles`
* - spark.databricks.labs.mosaic.raster.use.checkpoint
- "false"
- Checkpoint for session, in 0.4.2+
- Checkpoint for session, in 0.4.3+
* - spark.databricks.labs.mosaic.raster.tmp.prefix
- "" (will use "/tmp")
- Local directory for workers
Expand Down
62 changes: 39 additions & 23 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -146,27 +146,6 @@
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.scoverage</groupId>
<artifactId>scoverage-maven-plugin</artifactId>
<version>2.0.2</version>
<executions>
<execution>
<id>scoverage-report</id>
<phase>package</phase>
<goals>
<goal>check</goal>
<goal>report-only</goal>
</goals>
</execution>
</executions>
<configuration>
<minimumCoverage>${minimum.coverage}</minimumCoverage>
<failOnMinimumCoverage>true</failOnMinimumCoverage>
<scalaVersion>${scala.version}</scalaVersion>
<additionalForkedProjectProperties>skipTests=false</additionalForkedProjectProperties>
</configuration>
</plugin>
<plugin>
<!-- see http://davidb.github.com/scala-maven-plugin -->
<groupId>net.alchim31.maven</groupId>
Expand Down Expand Up @@ -277,8 +256,45 @@
<properties>
<scala.version>2.12.10</scala.version>
<scala.compat.version>2.12</scala.compat.version>
<spark.version>3.4.0</spark.version>
<mosaic.version>0.4.2</mosaic.version>
<spark.version>3.4.1</spark.version>
<mosaic.version>0.4.3</mosaic.version>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.scoverage</groupId>
<artifactId>scoverage-maven-plugin</artifactId>
<version>2.0.2</version>
<executions>
<execution>
<id>scoverage-report</id>
<phase>package</phase>
<goals>
<goal>check</goal>
<goal>report-only</goal>
</goals>
</execution>
</executions>
<configuration>
<minimumCoverage>${minimum.coverage}</minimumCoverage>
<failOnMinimumCoverage>true</failOnMinimumCoverage>
<scalaVersion>${scala.version}</scalaVersion>
<additionalForkedProjectProperties>skipTests=false</additionalForkedProjectProperties>
</configuration>
</plugin>
</plugins>
</build>
</profile>
<profile>
<!-- local testing `mvn test -PskipScoverage -DskipTests=false -Dsuite=...` -->
<id>skipScoverage</id>
<properties>
<scala.version>2.12.10</scala.version>
<scala.compat.version>2.12</scala.compat.version>
<spark.version>3.4.1</spark.version>
<mosaic.version>0.4.3</mosaic.version>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
</profile>
</profiles>
Expand Down
2 changes: 1 addition & 1 deletion python/mosaic/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@
from .models import SpatialKNN
from .readers import read

__version__ = "0.4.2"
__version__ = "0.4.3"
2 changes: 1 addition & 1 deletion python/mosaic/api/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from .accessors import *
from .aggregators import *
from .constructors import *
from .enable import enable_mosaic
from .enable import enable_mosaic, get_install_version, get_install_lib_dir
from .functions import *
from .fuse import *
from .predicates import *
Expand Down
55 changes: 43 additions & 12 deletions python/mosaic/api/enable.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import importlib.metadata
import importlib.resources
import warnings

from IPython.core.getipython import get_ipython
Expand Down Expand Up @@ -72,24 +74,25 @@ def enable_mosaic(
if not jar_autoattach:
spark.conf.set("spark.databricks.labs.mosaic.jar.autoattach", "false")
print("...set 'spark.databricks.labs.mosaic.jar.autoattach' to false")
config.jar_autoattach=False
if jar_path is not None:
spark.conf.set("spark.databricks.labs.mosaic.jar.path", jar_path)
print(f"...set 'spark.databricks.labs.mosaic.jar.path' to '{jar_path}'")
config.jar_path=jar_path
if log_info:
spark.sparkContext.setLogLevel("info")
config.log_info=True

# Config global objects
# - add MosaicContext after MosaicLibraryHandler
config.mosaic_spark = spark
_ = MosaicLibraryHandler(config.mosaic_spark, log_info=log_info)
config.mosaic_context = MosaicContext(config.mosaic_spark)

# Register SQL functions
optionClass = getattr(spark._sc._jvm.scala, "Option$")
optionModule = getattr(optionClass, "MODULE$")
config.mosaic_context._context.register(
spark._jsparkSession, optionModule.apply(None)
)

isSupported = config.mosaic_context._context.checkDBR(spark._jsparkSession)
if not isSupported:
_ = MosaicLibraryHandler(spark, log_info=log_info)
config.mosaic_context = MosaicContext(spark)
config.mosaic_context.jRegister(spark)

_jcontext = config.mosaic_context.jContext()
is_supported = _jcontext.checkDBR(spark._jsparkSession)
if not is_supported:
# unexpected - checkDBR returns true or throws exception
print("""WARNING: checkDBR returned False.""")

Expand All @@ -104,3 +107,31 @@ def enable_mosaic(
from mosaic.utils.kepler_magic import MosaicKepler

config.ipython_hook.register_magics(MosaicKepler)


def get_install_version() -> str:
"""
:return: mosaic version installed
"""
return importlib.metadata.version("databricks-mosaic")


def get_install_lib_dir(override_jar_filename=None) -> str:
"""
This is looking for the library dir under site packages using the jar name.
:return: located library dir.
"""
v = get_install_version()
jar_filename = f"mosaic-{v}-jar-with-dependencies.jar"
if override_jar_filename:
jar_filename = override_jar_filename
with importlib.resources.path("mosaic.lib", jar_filename) as p:
return p.parent.as_posix()


def refresh_context():
"""
Refresh mosaic context, using previously configured information.
- This is needed when spark configs change, such as for checkpointing.
"""
config.mosaic_context.jContextReset()
Loading
Loading