Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open Kerchunk refs as Virtual Dataset #119

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

norlandrhagen
Copy link
Collaborator

@norlandrhagen norlandrhagen commented May 16, 2024

Start of PR to address #118.

Lots of open questions!

  • How should we read .parquetfiles into KerchunkStoreRefs to pass into dataset_from_kerchunk_refs
  • RT'ing json seems to loose _ARRAY_DIMENSIONS

Would love some feedback @jsignell!

@TomNicholas TomNicholas added the references generation Reading byte ranges from archival files label May 17, 2024
@norlandrhagen
Copy link
Collaborator Author

norlandrhagen commented May 17, 2024

Update: @TomNicholas and I dug a bit deeper on the json roundtrip failing. On a visual inspection, the underlying structure and data seem identical, but the xarray_testing.assert_equal disagrees.

vds.lat.data.manifest.dict() == rt_vds.lat.data.manifest.dict() asserts to True.

image ...

@norlandrhagen
Copy link
Collaborator Author

Also, some weird behavior where virtualize.to_kerchunk seems to be adding ARRAY_DIMENSIONS?

image


vds = dataset_from_kerchunk_refs(refs_dict)
return vds
elif kerchunk_storage_ftype == ".parquet":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet files are not required to have this suffix for instance ".parq" is also very common. Not sure if there is a better way to tell the type of file though.


# Question: How should we read the parquet files
# into a dict to pass into dataset_from_kerchunk_refs?
# pandas, pyarrow table, duckdb?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like pandas would be a fine way to get things working and then you can always switch it out.

fpath = fsspec.filesystem(protocol, **storage_options).open(filepath)
fpath = fsspec.filesystem(protocol, **storage_options)
if universal_filepath.is_file():
fpath = fpath.open(filepath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble figuring out what motivated these changes.

def test_kerchunk_to_virtual_dataset(netcdf4_file, tmpdir, format):
vds = open_virtual_dataset(netcdf4_file, indexes={})

# QUESTION: should these live in a fixture? ex. kerchunk_ref_fpath_json, kerchunk_ref_fpath_parquet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh you kind of want the original vds as well as the kerchunk refs so I think it is fine as is.

@TomNicholas
Copy link
Collaborator

I just tried to merge main into this because @kthyng is interesting in picking it up.

Also, some weird behavior where virtualize.to_kerchunk seems to be adding ARRAY_DIMENSIONS?

I'm pretty sure I fixed this in #153

EDIT: Looks like I broke something in the defaults for the fsspec reader though oops

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
references generation Reading byte ranges from archival files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants