Xarray NetCDFDataSet #165

lucasjamar · 2022-03-15T08:31:26Z

Description

Read and write netcdf file into Xarray.
https://xarray.pydata.org/en/stable/user-guide/io.html#netcdf

Context

Should attract the weather data science community to use kedro :)

Possible Implementation

Should be quite similar to pandas.CSVDataSet. I'll give this implementation a shot in my free time.

Has anyone ever implemented such a custom dataset already?

datajoely · 2022-03-15T11:36:12Z

@lucasjamar to my knowledge this hasn't been requested before - we'd absolutely appreciate a PR :)

antonymilne · 2022-03-16T09:12:11Z

Always interesting to hear about communities I'm not familiar with and how they might use kedro 🙂 🎉 👍 FYI @noklam - am I right in thinking you're from geoscience originally?

noklam · 2022-03-16T10:17:51Z

@AntonyMilneQB Close enough! I did Earth System Science.

noklam · 2022-03-29T10:27:01Z

@lucasjamar Hi Lucas, just checking in to see if you need any help with this.

lucasjamar · 2022-03-29T12:12:23Z

Hi @noklam,

Quite busy right now sorry. Im hoping to have a look at this over easter

noklam · 2022-03-29T12:32:12Z

@lucasjamar No worries! Looking forward to see your PR😀.

noklam · 2022-05-19T16:46:45Z

@lucasjamar Are you still interested in this?

lucasjamar · 2022-05-30T11:33:12Z

@noklam I started during the holiday break but didn't get very far. Im afraid i wont find the time to get any further with this... Terribly sorry

lucasjamar · 2022-06-08T09:29:12Z

Sorry this is as far as i could get:

"""``GenericDataSet`` loads/saves data from/to a NetCDF file using an underlying
filesystem (e.g.: local, S3, GCS). It uses xarray to handle the NetCDF file.
"""
import logging
from copy import deepcopy
from io import BytesIO
from pathlib import PurePosixPath
from typing import Any, Dict

import fsspec
import xarray as xr

from kedro.io.core import (
    PROTOCOL_DELIMITER,
    AbstractVersionedDataSet,
    DataSetError,
    Version,
    get_filepath_str,
    get_protocol_and_path,
)

logger = logging.getLogger(__name__)


class GenericDataSet(AbstractVersionedDataSet):
    """``GenericDataSet`` loads/saves data from/to a file using an underlying
    filesystem (e.g.: local, S3, GCS). It uses xarray to handle the file.

    Example adding a catalog entry with
    `YAML API
    <https://kedro.readthedocs.io/en/stable/data/\
        data_catalog.html#using-the-data-catalog-with-the-yaml-api>`_:

    .. code-block:: yaml

        >>> january:
        >>>   type: xarray.GenericDataSet
        >>>   filepath: data/01_raw/weather/january.nc
        >>>   load_args:
        >>>     engine: netcdf4
        >>>     decode_times: True
        >>>   save_args:
        >>>     index: False
        >>>     date_format: "%Y-%m-%d %H:%M"
        >>>
        >>> motorbikes:
        >>>   type: xarray.GenericDataSet
        >>>   filepath: gcs://your_bucket/weater.ya
        >>>   credentials: dev_s3
        >>>


    Example using Python API:
    ::

        >>> from kedro.extras.datasets.xarray import GenericDataSet
        >>> import numpy as np
        >>> import pandas as pd
        >>> import xarray as xr
        >>>
        >>> data = xr.Dataset(
        >>>     {"foo": (("x", "y"), np.random.rand(4, 5))},
        >>>     coords={
        >>>         "x": [10, 20, 30, 40],
        >>>         "y": pd.date_range("2000-01-01", periods=5),
        >>>         "z": ("x", list("abcd")),
        >>>     },
        >>> )
        >>>
        >>> # data_set = GenericDataSet(filepath="gcs://bucket/test.nc")
        >>> data_set = GenericDataSet(filepath="test.nc")
        >>> data_set.save(data)
        >>> reloaded = data_set.load()
        >>> assert data.equals(reloaded)

    """

    DEFAULT_LOAD_ARGS = {}  # type: Dict[str, Any]
    DEFAULT_SAVE_ARGS = {"index": False}  # type: Dict[str, Any]

    # pylint: disable=too-many-arguments
    def __init__(
        self,
        filepath: str,
        load_args: Dict[str, Any] = None,
        save_args: Dict[str, Any] = None,
        version: Version = None,
        credentials: Dict[str, Any] = None,
        fs_args: Dict[str, Any] = None,
    ) -> None:
        """Creates a new instance of ``GenericDataSet`` pointing to a concrete NetCDF file
        on a specific filesystem.

        Args:
            filepath: Filepath in POSIX format to a NetCDF file prefixed with a protocol like `s3://`.
                If prefix is not provided, `file` protocol (local filesystem) will be used.
                The prefix should be any protocol supported by ``fsspec``.
                Note: `http(s)` doesn't support versioning.
            load_args: Pandas options for loading NetCDF files.
                Here you can find all available arguments:
                https://xarray.pydata.org/xarray-docs/stable/generated/xarray.read_nc.html
                All defaults are preserved.
            save_args: Pandas options for saving NetCDF files.
                Here you can find all available arguments:
                https://xarray.pydata.org/xarray-docs/stable/generated/xarray.DataFrame.to_nc.html
                All defaults are preserved, but "index", which is set to False.
            version: If specified, should be an instance of
                ``kedro.io.core.Version``. If its ``load`` attribute is
                None, the latest version will be loaded. If its ``save``
                attribute is None, save version will be autogenerated.
            credentials: Credentials required to get access to the underlying filesystem.
                E.g. for ``GCSFileSystem`` it should look like `{"token": None}`.
            fs_args: Extra arguments to pass into underlying filesystem class constructor
                (e.g. `{"project": "my-project"}` for ``GCSFileSystem``).
        """
        _fs_args = deepcopy(fs_args) or {}
        _credentials = deepcopy(credentials) or {}

        protocol, path = get_protocol_and_path(filepath, version)
        if protocol == "file":
            _fs_args.setdefault("auto_mkdir", True)

        self._protocol = protocol
        self._storage_options = {**_credentials, **_fs_args}
        self._fs = fsspec.filesystem(self._protocol, **self._storage_options)

        super().__init__(
            filepath=PurePosixPath(path),
            version=version,
            exists_function=self._fs.exists,
            glob_function=self._fs.glob,
        )

        # Handle default load and save arguments
        self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
        if load_args is not None:
            self._load_args.update(load_args)
        self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
        if save_args is not None:
            self._save_args.update(save_args)

        if "storage_options" in self._save_args or "storage_options" in self._load_args:
            logger.warning(
                "Dropping `storage_options` for %s, "
                "please specify them under `fs_args` or `credentials`.",
                self._filepath,
            )
            self._save_args.pop("storage_options", None)
            self._load_args.pop("storage_options", None)

    def _describe(self) -> Dict[str, Any]:
        return dict(
            filepath=self._filepath,
            protocol=self._protocol,
            load_args=self._load_args,
            save_args=self._save_args,
            version=self._version,
        )

    def _load(self) -> xr.Dataset:
        load_path = str(self._get_load_path())
        if self._protocol == "file":
            # file:// protocol seems to misbehave on Windows
            # (<urlopen error file not on local host>),
            # so we don't join that back to the filepath;
            # storage_options also don't work with local paths
            return xr.open_dataset(load_path, **self._load_args)

        load_path = f"{self._protocol}{PROTOCOL_DELIMITER}{load_path}"
        return xr.open_dataset(
            load_path, storage_options=self._storage_options, **self._load_args
        )

    def _save(self, data: xr.DataArray) -> None:
        save_path = get_filepath_str(self._get_save_path(), self._protocol)

        buf = BytesIO()
        data.to_nc(path_or_buf=buf, **self._save_args)

        with self._fs.open(save_path, mode="wb") as fs_file:
            fs_file.write(buf.getvalue())

        self._invalidate_cache()

    def _exists(self) -> bool:
        try:
            load_path = get_filepath_str(self._get_load_path(), self._protocol)
        except DataSetError:
            return False

        return self._fs.exists(load_path)

    def _release(self) -> None:
        super()._release()
        self._invalidate_cache()

    def _invalidate_cache(self) -> None:
        """Invalidate underlying filesystem caches."""
        filepath = get_filepath_str(self._filepath, self._protocol)
        self._fs.invalidate_cache(filepath)

noklam · 2022-06-08T09:37:00Z

@lucasjamar Hey thanks for making time for this! Do you want to make this a PR? It would be easier for me to drop comments on it. :)

jamespolly · 2022-07-06T18:42:12Z

Xarray functionality in the Data Catalog would be a big deal for me. Is there an update on this?

antonymilne · 2022-07-06T19:10:39Z

No, unfortunately not. Feel free to try building on @lucasjamar's start above and raise a PR for it though 🙂

noklam · 2022-07-06T22:07:04Z

At a glance making just NetCDF works is straight forward, but seems @lucasjamar is going with a GenericXarrayDataSet. I wonder if it makes sense to take the pandas approach with xarray.NetCDFDataSet, xarray.PickleDataSet etc.

Here is a quick implementation, mostly just copy from above

class NetCDFDataSet(AbstractVersionedDataSet):
    """``GenericDataSet`` loads/saves data from/to a file using an underlying
    filesystem (e.g.: local, S3, GCS). It uses xarray to handle the file.

    Example adding a catalog entry with
    `YAML API
    <https://kedro.readthedocs.io/en/stable/data/\
        data_catalog.html#using-the-data-catalog-with-the-yaml-api>`_:

    .. code-block:: yaml

        >>> january:
        >>>   type: xarray.GenericDataSet
        >>>   filepath: data/01_raw/weather/january.nc
        >>>   load_args:
        >>>     engine: netcdf4
        >>>     decode_times: True
        >>>   save_args:
        >>>     index: False
        >>>     date_format: "%Y-%m-%d %H:%M"
        >>>
        >>> motorbikes:
        >>>   type: xarray.GenericDataSet
        >>>   filepath: gcs://your_bucket/weater.ya
        >>>   credentials: dev_s3
        >>>


    Example using Python API:
    ::

        >>> from kedro.extras.datasets.xarray import GenericDataSet
        >>> import numpy as np
        >>> import pandas as pd
        >>> import xarray as xr
        >>>
        >>> data = xr.Dataset(
        >>>     {"foo": (("x", "y"), np.random.rand(4, 5))},
        >>>     coords={
        >>>         "x": [10, 20, 30, 40],
        >>>         "y": pd.date_range("2000-01-01", periods=5),
        >>>         "z": ("x", list("abcd")),
        >>>     },
        >>> )
        >>>
        >>> # data_set = GenericDataSet(filepath="gcs://bucket/test.nc")
        >>> data_set = GenericDataSet(filepath="test.nc")
        >>> data_set.save(data)
        >>> reloaded = data_set.load()
        >>> assert data.equals(reloaded)

    """

    DEFAULT_LOAD_ARGS = {}  # type: Dict[str, Any]
    DEFAULT_SAVE_ARGS = {}  # type: Dict[str, Any]

    # pylint: disable=too-many-arguments
    def __init__(
        self,
        filepath: str,
        load_args: Dict[str, Any] = None,
        save_args: Dict[str, Any] = None,
        version: Version = None,
        credentials: Dict[str, Any] = None,
        fs_args: Dict[str, Any] = None,
    ) -> None:
        """Creates a new instance of ``GenericDataSet`` pointing to a concrete NetCDF file
        on a specific filesystem.

        Args:
            filepath: Filepath in POSIX format to a NetCDF file prefixed with a protocol like `s3://`.
                If prefix is not provided, `file` protocol (local filesystem) will be used.
                The prefix should be any protocol supported by ``fsspec``.
                Note: `http(s)` doesn't support versioning.
            load_args: Pandas options for loading NetCDF files.
                Here you can find all available arguments:
                https://xarray.pydata.org/xarray-docs/stable/generated/xarray.read_nc.html
                All defaults are preserved.
            save_args: Pandas options for saving NetCDF files.
                Here you can find all available arguments:
                https://xarray.pydata.org/xarray-docs/stable/generated/xarray.DataFrame.to_nc.html
                All defaults are preserved, but "index", which is set to False.
            version: If specified, should be an instance of
                ``kedro.io.core.Version``. If its ``load`` attribute is
                None, the latest version will be loaded. If its ``save``
                attribute is None, save version will be autogenerated.
            credentials: Credentials required to get access to the underlying filesystem.
                E.g. for ``GCSFileSystem`` it should look like `{"token": None}`.
            fs_args: Extra arguments to pass into underlying filesystem class constructor
                (e.g. `{"project": "my-project"}` for ``GCSFileSystem``).
        """
        _fs_args = deepcopy(fs_args) or {}
        _fs_open_args_load = _fs_args.pop("open_args_load", {})
        _fs_open_args_save = _fs_args.pop("open_args_save", {})
        _credentials = deepcopy(credentials) or {}

        protocol, path = get_protocol_and_path(filepath, version)
        if protocol == "file":
            _fs_args.setdefault("auto_mkdir", True)

        self._protocol = protocol
        self._storage_options = {**_credentials, **_fs_args}
        self._fs = fsspec.filesystem(self._protocol, **self._storage_options)

        super().__init__(
            filepath=PurePosixPath(path),
            version=version,
            exists_function=self._fs.exists,
            glob_function=self._fs.glob,
        )

        # Handle default load and save arguments
        self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
        if load_args is not None:
            self._load_args.update(load_args)
        self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
        if save_args is not None:
            self._save_args.update(save_args)

        if "storage_options" in self._save_args or "storage_options" in self._load_args:
            logger.warning(
                "Dropping `storage_options` for %s, "
                "please specify them under `fs_args` or `credentials`.",
                self._filepath,
            )
            self._save_args.pop("storage_options", None)
            self._load_args.pop("storage_options", None)

        _fs_open_args_save.setdefault("mode", "wb")
        self._fs_open_args_load = _fs_open_args_load
        self._fs_open_args_save = _fs_open_args_save

    def _describe(self) -> Dict[str, Any]:
        return dict(
            filepath=self._filepath,
            protocol=self._protocol,
            load_args=self._load_args,
            save_args=self._save_args,
            version=self._version,
        )

    def _load(self) -> xr.Dataset:
        load_path = str(self._get_load_path())
        if self._protocol == "file":
            # file:// protocol seems to misbehave on Windows
            # (<urlopen error file not on local host>),
            # so we don't join that back to the filepath;
            # storage_options also don't work with local paths
            return xr.open_dataset(load_path, **self._load_args)

        load_path = f"{self._protocol}{PROTOCOL_DELIMITER}{load_path}"
        return xr.open_dataset(
            load_path, storage_options=self._storage_options, **self._load_args
        )

    def _save(self, data: xr.DataArray) -> None:
        save_path = get_filepath_str(self._get_save_path(), self._protocol)
        with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
            data.to_netcdf(fs_file, **self._save_args)

        self._invalidate_cache()

    def _exists(self) -> bool:
        try:
            load_path = get_filepath_str(self._get_load_path(), self._protocol)
        except DataSetError:
            return False

        return self._fs.exists(load_path)

    def _release(self) -> None:
        super()._release()
        self._invalidate_cache()

    def _invalidate_cache(self) -> None:
        """Invalidate underlying filesystem caches."""
        filepath = get_filepath_str(self._filepath, self._protocol)
        self._fs.invalidate_cache(filepath)

antonymilne · 2022-07-06T22:24:47Z

Would it make sense to have a file_format argument and then get the right xr.to/from_*** function based on that? I'm not sure how standardised the xarray load/save API is. This is what we do in pandas.GenericDataSet:
https://github.com/kedro-org/kedro/blob/fdc6be73408852b09fcc16370706257b918994d5/kedro/extras/datasets/pandas/generic_dataset.py#L187-L188

noklam · 2022-07-06T22:33:35Z

I am actually not aware of the pandas.GenericDataSet😅 do people actually use that and do we keep both for backward compatiblitity reason? I think it is possible since the API is quite similar.

Btw I saw the invalidate_cache method for some datasets are public but the majority are private, does that make sense if we just remove that and put it in the abstractClass? We have repeated that 71 times😅

antonymilne · 2022-07-07T08:32:27Z

Yeah, pandas.GenericDataSet isn't well known. It was added relatively recently so that we don't have to constantly be catching up with pandas when they add a new format: kedro-org/kedro#987. It's not really designed to replace the pandas.CSVDataSet etc. - just to cater for any new formats where we don't have a dedicated dataset. But at some point it might make sense to deprecate the dedicated ones in favour of the generic.

Btw I saw the invalidate_cache method for some datasets are public but the majority are private, does that make sense if we just remove that and put it in the abstractClass? We have repeated that 71 times😅

Probably yes. The same could probably be said for _exists, _release, the repetition of save_path = get_filepath_str(self._filepath, self._protocol), etc. etc. It would be nice if we could reduce the repeated boiler plate that's in every dataset definition, and I don't think it would be too hard to do. Possibly there's some reason for keeping it, so let's not remove it straight away without getting some more input. I would consider it part of my "fundamental issue" in kedro-org/kedro#1691 - we should consider generally what exactly belongs in the abstract classes vs. the implementations.

noklam · 2022-07-07T09:48:49Z

Not sure if I like the idea of deprecating the dedicated one, but this is not urgent so I will think more about it. I added the dataset related refactoring piece into kedro-org/kedro#1691 in your comments, it's a bit easier to keep track of when we have a dedicated issue for that.

datajoely · 2022-07-07T10:05:36Z

Yeah the main reason why I pushed for pandas.GenericDataSet is because they introduced pd.read_xml and it felt like we were just always in reactive mode and could at least provide a generic way of users leveraging pandas features without us needing to push a release.

The pandas API is also really in consistently, for file paths I had to rely on the assumption that the first positional argument is going to be the path argument since the library uses a different kwarg because each reader/writer was contributed independently.

In the Spark world we're able to only provide a generic version of the dataset since it's much better designed, I think in situations like xarray and dask we can probably do the same.

jamespolly · 2022-07-07T14:36:14Z

I really appreciate the responses. 🙏

Re: the "pandas approach" mentioned by @noklam, I think there are certainly a few xarray-supported file types that could benefit from built-in handling in kedro. NetCDF, Zarr, and GeoTIFF (via rioxarray) specifically. Maybe that supports moving away from a generic dataset, but I also appreciate the difficulty in supporting everything.

In the selfish short term it looks like I can use the implementation provided by @noklam above in combination with these instructions to get done what I need to get done. I can test out these implementations and report back changes I see as necessary.

antonymilne · 2022-07-07T15:03:56Z

Definitely that sounds like a good plan in the short term - you can easily use the custom dataset implementation without it being added to core kedro. Just copy and paste the above into a file and modify the dataset type to point to that class. This would be a great way to test it out and see how well it all works.

jamespolly · 2022-07-15T20:23:24Z

An update and a couple questions about partitioned datasets and lazy evaluation. The implementation discussed above works well. I've only tested it on local file systems.

TL;DR: To not lose advantages of working with xarray we need to concatenate many .nc files into a big dataset and then operate on it as lazily as possible. I want to keep benefits of both xarray and Kedro.

When saving data to the Data Catalog, is the create_partitions approach I'm employing here best/as-intended by Kedro? I'm using a dictionary of {ids: xarray.DataSets}.
When not saving to the Data Catalog and passing node outputs ("in memory") to the inputs of the next node, is this being done lazily or are things being loaded? Dask is the backend for xarray if this matters, and what Kedro means by "in memory" starts to get fuzzy for me with data types supporting lazy eval.

More on this:
When obtaining NetCDF files from a remote source, which are being saved for the first time in a pipeline (e.g., saving to /data/01_raw), it is common (outside of Kedro) to do this with a for loop, saving every iteration. With Kedro handling file I/O in a partitioned dataset, is it best to make a function that builds a dictionary of unique partition ID keys with corresponding xarray.DataSet values?

This approach (example shown below) works, but I'm not sure if it is loading all this data into memory and ruining the lazy evaluation aspect. The only way I know to test this is to htop while increasing the number of datasets I'm downloading to exceed the memory of my machine. Hoping to get some feedback and learn before I do that.

In this pipeline I would have a node defined in pipeline.py as:

node(
    func=create_data_partitions,
    inputs=["params:starttime", "params:endtime"],
    outputs="downloaded_data", # defined in catalog.yml
    name="download_forecast_node",
    )

With the corresponding function definition (which as shown here is a very simplified/broken example to highlight dictionary creation aspect) in nodes.py as:

def create_data_partitions(starttime: Dict, endtime: Dict) -> Dict[str, Any]:
    """Create new partitions and save using PartitionedDataSet.

    returns: Dictionary of the partitions to create.
    """
    duration = endtime - starttime
    part_lkup = {}
    for i in range(duration):
        tmptime = starttime + timedelta(hours=i)
        ds = get_data(tmptime) # downloads a single xarray dataset
        part_id = str(tmptime)+'_downloaded'
        part_lkup[part_id] = ds

    return part_lkup

I've used this pattern in multiple places, and it works, but again it's not clear to me that it is the best way to go.

Another example, when I start a processing pipeline on this downloaded data, I'll open the downloaded data files but concatenate them into a single, large (but not loaded into memory) xarray.DataSet. In doing so (as shown below) I am attempting to do what xr.open_mfdataset will do when given a directory full of .nc files.

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=create_data_partitions,
                inputs=["params:starttime", "params:endtime"],
                outputs="downloaded_data", # defined in catalog.yml
                name="download_forecast_node",
            ),
            node(
                func=concat_partitions,
                inputs="downloaded_data",
                outputs="concat_ds", # not in catalog, in memory, hopefully still lazy and not actually loaded
                name="concat_download_ds_node",
            ),
            node(
                func=cleanup_data,
                inputs="concat_ds",
                outputs="clean_ds", # not in catalog, in memory, hopefully still lazy and not actually loaded
                name="clean_ds_node",
            ),
            node(
                func=create_partitions,
                inputs="clean_ds",
                outputs="forecast_clean",
                name="forecast_clean_node", # defined in catalog.yml
            ),
        ]
    )

Where the concat_partitions node uses a pattern similar to those in the Kedro docs:

def concat_partitions(partitioned_input: Dict[str, Callable[[], Any]], concat_dim: str = 'time') -> xr.Dataset:
    """Concatenate input partitions into one xarray Dataset.

    param: partitioned_input: A dictionary with partition ids as keys and load functions as values.
    param: concat_dim: existing dimension in each netcdf file along which to concatenate all datasets.

    return: xarray Dataset containing all loaded partitions.
    """

    return xr.concat([partitioned_input[i]() for i in sorted(partitioned_input)], dim=concat_dim)

and where the create_partitions node uses my same uncertain/questionable pattern at the heart of this question:

def create_partitions(ds:xr.Dataset) -> Dict[str, Any]:
    """Create new partitions and save using PartitionedDataSet.

    returns: Dictionary of the partitions to create.
    """
    times = ds.time.values
    part_lkup = {}
    for tmptime in times:
        tmp_ds = ds.sel(time=slice(tmptime, tmptime))
        part_id = str(tmptime)[:13]+'_cleaned_data'
        part_lkup[part_id] = tmp_ds

    return part_lkup

datajoely · 2022-07-18T07:18:29Z

@jamespolly thank you for your very detailed write up - @deepyaman do you want to chime in with your thoughts?

NowanIlfideme · 2023-02-16T15:02:24Z

One issue I've run into with xarray.load_dataset() and ds.to_netcdf() functions is that they still don't allow reading/writing from bytes, only from local files. Would be nice if I could just use Kedro to abstract this away. 😄

astrojuanlu · 2023-08-22T16:32:06Z

Talking to some folks working with satellite data, would be cool to have this 🛰️

riley-brady · 2023-09-26T20:21:02Z

Looking to open a PR on this in the next week or so.

noklam · 2023-09-26T20:35:23Z

@riley-brady that's awesome!

galenseilis · 2023-09-26T22:31:13Z

Looking to open a PR on this in the next week or so.

As a PyMC user I am excited by this news!

riley-brady · 2023-09-26T22:35:10Z

Glad to see the excitement! I'm bundling it with a Zarr implementation, since those are commonly used together. We've got a nicely working implementation on our climate team for both, targeting an AWS platform. Let's start with stripping it down to a local-working implementation and then see how we can build it up from there.

Zarr should work out of the box for remote read/write. NetCDF can't be read remotely natively (e.g. from an S3 bucket), so we pull it down to temp storage and then do the read. One could also do something like a kerchunk wrapper to generalize remote reads for NetCDF.

noklam · 2023-09-26T22:35:53Z

I talked to @tgoelles today and he is interested to contribute a dataset for geotiff and NetCDF. I am excited to see more contributions from the scientific community🔥

tgoelles · 2023-09-28T06:17:25Z

@noklam I implemented geotiff for now but the build on github actions fails for windows du to the dependency on GDAL

fork is here: https://github.com/tgoelles/kedro-plugins

Collecting rasterio (from rioxarray>=0.9.0->kedro-datasets==1.7.0)
Downloading rasterio-1.2.10.tar.gz (2.3 MB)
---------------------------------------- 2.3/2.3 MB 36.2 MB/s eta 0:00:00
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'error'
error: subprocess-exited-with-error

Getting requirements to build wheel did not run successfully.
exit code: 1

[2 lines of output]
INFO:root:Building on Windows requires extra options to setup.py to locate needed GDAL files. More information is available in the README.
ERROR: A GDAL API version must be specified. Provide a path to gdal-config using a GDAL_CONFIG environment variable or use a GDAL_VERSION environment variable.
[end of output]

Is there a way to support geotiff only for Ubuntu for now? I don't want to look into windows issues, and I know that GDAL makes installations complex

astrojuanlu · 2023-09-28T06:47:18Z

I can help with GDAL on Windows 👍🏽 go ahead and open the PR!

KikiCS · 2023-09-29T15:19:42Z

I'm also really looking forward to this PR!

riley-brady · 2023-10-02T21:21:27Z

Hi folks! I put a decent stake in the ground here: https://github.com/kedro-org/kedro-plugins/pull/360/files. There's a few TODO's to deal with at the moment.

I have some pressing work deadlines I need to focus on so I might not get back to this for a couple of weeks. But wanted to provide the initial snippet for folks to test out/provide feedback on instead of going silent here.

I'd like to get this implemented with file syncing for load-from-remote, since it's most straight-forward. A future PR could work with kerchunk to allow direct loading from remote storage. This is a really nice toolkit, but requires management of a lot of JSON metadata files that are generated, and which sometimes can be quite slow to generate. It will take a little bit of tweaking to implement this nicely, since the first run would need to generate and cache/store all of the reference JSONs to make future loads much much faster.

riley-brady · 2023-10-13T20:53:58Z

Hi everyone! The PR is fully implemented with testing and ready for review. #360

astrojuanlu · 2023-10-23T17:08:35Z

Of interest for people subscribed to this issue: https://guide.cloudnativegeo.org (source)

riley-brady · 2024-02-28T18:01:21Z

I think this can be closed via #360!

astrojuanlu · 2024-02-28T18:14:41Z

Closed in #360 indeed! 🚀

merelcht added the Community Issue/PR opened by the open-source community label Mar 15, 2022

noklam mentioned this issue Jul 18, 2022

Fix broken documentation of PartitionedDataSet kedro-org/kedro#1710

Merged

5 tasks

lancechua mentioned this issue Oct 16, 2022

Easier CustomDataset Creation kedro-org/kedro#1936

Open

merelcht transferred this issue from kedro-org/kedro Apr 12, 2023

tgoelles mentioned this issue Sep 28, 2023

feat(datasets): Add rioxarray and RasterDataset #355

Merged

4 tasks

riley-brady mentioned this issue Oct 2, 2023

feat(datasets): Add NetCDFDataSet class #360

Merged

4 tasks

astrojuanlu mentioned this issue Oct 18, 2023

Document less conservative Python support policy kedro-org/kedro#2815

Closed

astrojuanlu closed this as completed Feb 28, 2024

astrojuanlu mentioned this issue Mar 12, 2024

Decide on definitions of regular and experimental contributions #583

Closed

Xarray NetCDFDataSet #165

Xarray NetCDFDataSet #165

Comments

lucasjamar commented Mar 15, 2022

Description

Context

Possible Implementation

datajoely commented Mar 15, 2022

antonymilne commented Mar 16, 2022

noklam commented Mar 16, 2022

noklam commented Mar 29, 2022

lucasjamar commented Mar 29, 2022

noklam commented Mar 29, 2022

noklam commented May 19, 2022

lucasjamar commented May 30, 2022

lucasjamar commented Jun 8, 2022 • edited by noklam Loading

noklam commented Jun 8, 2022

jamespolly commented Jul 6, 2022

antonymilne commented Jul 6, 2022

noklam commented Jul 6, 2022

antonymilne commented Jul 6, 2022 • edited Loading

noklam commented Jul 6, 2022

antonymilne commented Jul 7, 2022

noklam commented Jul 7, 2022

datajoely commented Jul 7, 2022

jamespolly commented Jul 7, 2022

antonymilne commented Jul 7, 2022

jamespolly commented Jul 15, 2022

datajoely commented Jul 18, 2022

NowanIlfideme commented Feb 16, 2023

astrojuanlu commented Aug 22, 2023

riley-brady commented Sep 26, 2023

noklam commented Sep 26, 2023

galenseilis commented Sep 26, 2023

riley-brady commented Sep 26, 2023

noklam commented Sep 26, 2023

tgoelles commented Sep 28, 2023 • edited Loading

astrojuanlu commented Sep 28, 2023

KikiCS commented Sep 29, 2023

riley-brady commented Oct 2, 2023

riley-brady commented Oct 13, 2023

astrojuanlu commented Oct 23, 2023

riley-brady commented Feb 28, 2024

astrojuanlu commented Feb 28, 2024

lucasjamar commented Jun 8, 2022 •

edited by noklam

Loading

antonymilne commented Jul 6, 2022 •

edited

Loading

tgoelles commented Sep 28, 2023 •

edited

Loading