Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dataset pickling #7

Merged
merged 2 commits into from
Apr 2, 2024
Merged

Fix dataset pickling #7

merged 2 commits into from
Apr 2, 2024

Conversation

aazuspan
Copy link
Contributor

There was a bug in _load_rasters_to_dataset that caused compute to fail with any Dask scheduler that requires pickling (anything but threads) because the file buffer used in raster loading isn't pickleable. The easy fix is just to load the file path built by resources rather than the file buffer.

For testing, I'm just directly checking that the dataset is pickleable. A more thorough test would be to fully compute the dataset with each scheduler to avoid any other surprise incompatibilities, but that would lead to some very slow tests as well as some complexity around setup/teardown with a LocalCluster (see dask/distributed#3540).

@grovduck let me know what you think about the testing here, and whether we should just bite the bullet and add in compute tests.

Computing a dataset using "processes" or a Client scheduler currently
fails due to a pickling error. This adds a corresponding test that will
need to be fixed.
`PosixPath.open()` returns a `BufferedReader` that is not pickleable,
breaking dask computation with `Client` or "processes" schedulers. This
still uses `resources` to identify data file paths, but opens directly
from the paths instead of file buffers to fix pickling.
@aazuspan aazuspan added the bug Something isn't working label Mar 26, 2024
@aazuspan aazuspan requested a review from grovduck March 26, 2024 19:08
@aazuspan aazuspan self-assigned this Mar 26, 2024
Copy link
Member

@grovduck grovduck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aazuspan, I think I understand this issue and the fix that you've implemented, namely that the path is pickleable, whereas the file handle is not.

Again, I'll yield to your expertise on just this simple "pickleable" test versus the full-blown compute tests. Can you say a bit more about what other types of incompatibilities you might expect with other schedulers? If you did go the full compute route, could you mark the tests as slow and skip them in CI? The issue that you referenced looks pretty thorny, though.

@aazuspan
Copy link
Contributor Author

aazuspan commented Apr 2, 2024

Can you say a bit more about what other types of incompatibilities you might expect with other schedulers?

I wouldn't say I expect other issues, but that's what I would've said before this issue too 😜

If you did go the full compute route, could you mark the tests as slow and skip them in CI?

Yeah, I think this is a good idea.

I'm leaning towards punting on this for now. We'll need to figure out a good way to test deferred Dask results once we get predictions up and running, so I'd suggest we revisit dataset tests once that system's in place.

@grovduck
Copy link
Member

grovduck commented Apr 2, 2024

Good with me!

@aazuspan
Copy link
Contributor Author

aazuspan commented Apr 2, 2024

Thanks for giving it a look!

@aazuspan aazuspan merged commit 270e909 into main Apr 2, 2024
5 checks passed
@aazuspan aazuspan deleted the fix-dataset-pickling branch April 2, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants