-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to Dask page in Xarray docs #9495
Conversation
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
for more information, see https://pre-commit.ci
|
||
ds.to_netcdf("manipulated-example-data.nc") | ||
When using Dask’s distributed scheduler to write NETCDF4 files, it may be necessary to set the environment variable ``HDF5_USE_FILE_LOCKING=FALSE`` to avoid competing locks within the HDF5 SWMR file locking scheme. Note that writing netCDF files with Dask’s distributed scheduler is only supported for the netcdf4 backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see anything in the release notes about a fix/update for this, so I'm assuming it's still true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a link to the GitHub issue to track this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a corresponding open issue, but looks like Joe added it in #1793.
I noticed https://github.com/pydata/xarray/issues/1836
is referenced, but I think this is a slightly different issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @scharlottej13 for working on this! Some parts of this sorely needed an update. I've left a few comments.
Also FYI there is also https://tutorial.xarray.dev/intermediate/xarray_and_dask.html to be aware of.
doc/user-guide/dask.rst
Outdated
.. tab:: HDF5 | ||
|
||
from dask.diagnostics import ProgressBar | ||
Open HDF5 files with :py:func:`~xarray.open_dataset`:: | ||
|
||
# or distributed.progress when using the distributed scheduler | ||
delayed_obj = ds.to_netcdf("manipulated-example-data.nc", compute=False) | ||
with ProgressBar(): | ||
results = delayed_obj.compute() | ||
xr.open_dataset("/path/to/my/file.h5", chunks='auto') | ||
|
||
.. ipython:: python | ||
:suppress: | ||
See :ref:`io.hdf5` for more details. | ||
|
||
os.remove("manipulated-example-data.nc") # Was not opened. | ||
.. tab:: GeoTIFF | ||
|
||
.. note:: | ||
Open large geoTIFF files with rioxarray:: | ||
|
||
When using Dask's distributed scheduler to write NETCDF4 files, | ||
it may be necessary to set the environment variable `HDF5_USE_FILE_LOCKING=FALSE` | ||
to avoid competing locks within the HDF5 SWMR file locking scheme. Note that | ||
writing netCDF files with Dask's distributed scheduler is only supported for | ||
the `netcdf4` backend. | ||
xds = rioxarray.open_rasterio("my-satellite-image.tif", chunks='auto') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Xarray has dedicated documentation on IO for different formats. Unless these functions behaves differently when used with dask than one would expect from the above explanation, I suggest either moving these examples to the dedicated IO docs page or leaving them out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean this page https://docs.xarray.dev/en/stable/user-guide/io.html, right?
When I went through those sections they seemed more focused on the file format than Dask, so it seemed like adding in Dask-specific snippets or explanations would be overly complicated for people who don't need to use Dask.
My thought with adding tabs to the Dask page is they quickly show which file formats you can use with Dask, a simple example, and then link out to different sections of https://docs.xarray.dev/en/stable/user-guide/io.html.
Definitely defer to what you think is a better fit for the Xarray docs, though!
doc/user-guide/dask.rst
Outdated
from flox.xarray import xarray_reduce | ||
import xarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example seem unconnected to the surrounding text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was wondering if this made sense.
I was trying to have the code snippet be a demonstrative, simple example of putting some of the optimization tips together. I swapped the order and added a sentence of explanation. Maybe that's better? Open to other ideas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @scharlottej13 ! this definitely needed an update :)
|
||
ds.to_netcdf("manipulated-example-data.nc") | ||
When using Dask’s distributed scheduler to write NETCDF4 files, it may be necessary to set the environment variable ``HDF5_USE_FILE_LOCKING=FALSE`` to avoid competing locks within the HDF5 SWMR file locking scheme. Note that writing netCDF files with Dask’s distributed scheduler is only supported for the netcdf4 backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a link to the GitHub issue to track this one?
doc/user-guide/dask.rst
Outdated
*eager*, in-memory NumPy arrays is to use the :py:meth:`~xarray.Dataset.load` method: | ||
To do this, you can use :py:meth:`~xarray.Dataset.load`, which is similar to :py:meth:`~xarray.Dataset.compute`, but instead changes results in-place: | ||
|
||
.. ipython:: python | ||
|
||
ds.load() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's emphasize .compute()
(and .persist()
instead of .load()
(which is a strange method that modifies an xarray object in place)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to mention load()
at all? I added a note that it's recommended people use compute()
instead
Co-authored-by: Tom Nicholas <tom@cworthy.org> Co-authored-by: Stephan Hoyer <shoyer@google.com>
for more information, see https://pre-commit.ci
@shoyer @TomNicholas thank you both for the review! I believe I've addressed all of your comments, but please let me know if I missed anything. |
@shoyer @TomNicholas gentle ping here, thanks again for the thorough review! |
@shoyer @TomNicholas another gentle ping here, thanks again! |
* main: (125 commits) http:// → https:// (pydata#9748) Discard useless `!s` conversion in f-string (pydata#9752) Apply ruff/flake8-simplify rule SIM401 (pydata#9749) Use micromamba 1.5.10 where conda is needed (pydata#9737) pin array-api-strict<=2.1 (pydata#9751) Reorganise ruff rules (pydata#9738) use new conda-forge package pydap-server (pydata#9741) Enforce ruff/flake8-pie rules (PIE) (pydata#9740) Enforce ruff/flake8-comprehensions rules (C4) (pydata#9724) Enforce ruff/Perflint rules (PERF) (pydata#9730) Apply ruff rule RUF007 (pydata#9739) chmod -x (pydata#9725) Aplpy ruff rules (RUF) (pydata#9731) Fix typos found by codespell (pydata#9721) support for additional scipy nd interpolants (pydata#9599) Apply ruff/flake8-simplify rules (SIM) (pydata#9727) Apply ruff/flake8-implicit-str-concat rules (ISC) (pydata#9722) Apply ruff/flake8-pie rules (PIE) (pydata#9726) Enforce ruff/pygrep-hooks rules (PGH) (pydata#9729) Move to micromamba 2 (pydata#9732) ...
Sorry for the really long delay here @scharlottej13 |
Thank you @dcherian!! |
* main: fix html repr indexes section (pydata#9768) Bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.2 in the actions group (pydata#9763) unpin array-api-strict, as issues are resolved upstream (pydata#9762) rewrite the `min_deps_check` script (pydata#9754) CI runs ruff instead of pep8speaks (pydata#9759) Specify copyright holders in main license file (pydata#9756) Compress PNG files (pydata#9747) Dispatch to Dask if nanquantile is available (pydata#9719) Updates to Dask page in Xarray docs (pydata#9495) http:// → https:// (pydata#9748) Discard useless `!s` conversion in f-string (pydata#9752) Apply ruff/flake8-simplify rule SIM401 (pydata#9749) Use micromamba 1.5.10 where conda is needed (pydata#9737) pin array-api-strict<=2.1 (pydata#9751) Reorganise ruff rules (pydata#9738) use new conda-forge package pydap-server (pydata#9741)
whats-new.rst
api.rst
Hi! It's been a while since the Dask docs page had a comprehensive update, so thought I'd open a PR with some suggestions.
I chatted a little bit about updating this page with @jhamman and @dcherian, some of the things I was trying to keep in mind:
There are certainly a lot of things I wasn't sure about (I'm certainly not an xarray expert). I'll comment in-line with questions.
cc @phofl @jrbourbeau