Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Retry opening datasets #144

Open
charalamm opened this issue Feb 19, 2024 · 2 comments
Open

Feature request: Retry opening datasets #144

charalamm opened this issue Feb 19, 2024 · 2 comments

Comments

@charalamm
Copy link

Hello,

We are planning to use odc stac for some analysis. We have the data on azure and we accessing them with the az:// prefix. In every analysis, when trying to read the files there are always some errors with the internet which result on the data missing from the final data structure.

So far I have catched the following errors:

                dns_problem_condition = "Could not resolve host" in str(ex)
                dns_timeout_condition = "Resolving timed out after" in str(ex)
                read_problem_condition = "not recognized as a supported" in str(ex)
                write_problem = "Failure writing output to destination" in str(ex)
                read_write_problem = "Read or write failed" in str(ex)
                broken_pipe = "Broken pipe" in str(ex)

Do you think it is useful to add a mechanism to retry reading on some errors? I think I can work on a PR if you are interested in this feature. Feel free to close it if you are not interested

A possible approach?

Since some of these errors can be valid ones it should be on the user to decide I they want to retry or not and on what errors to retry. One option would be to allow the user define a list of regexes or strings and odc-stac can check if it should retry based on that. One problem is that GDAL is caching these errors so it might be needed to use CPL_VSIL_CURL_NON_CACHED

@charalamm charalamm changed the title Retry opening datasets Feature request: Retry opening datasets Feb 19, 2024
@Kirill888
Copy link
Member

Thanks for raising this @charalamm, better error handling and tracking is certainly needed, see #101. It can be a little bit tricky to support consistently across Dask and direct loads though. Right now, a major refactor of the loading code is taking place to support hyperspectral data sources. As part of that work we are adding an IO driver abstraction that allows user to bring their own loader, mostly to enable efficient access to data sources that rasterio/gdal struggle with. Once completed, we should be in a much better position to experiment with various error handling approaches and to give library users more control over that aspect of things when they need it.

Initially that would be implemented with various forms of callbacks into user code to make a decision or to keep track of failures, as we develop better understanding we will provide non-code mechanisms, like your suggested regex-based matching. My concern is with rasterio/GDAL boundary, at least in the past it was not always possible to bubble up GDAL errors in to Python code without losing some fidelity in error reporting (just because you see an error printed to stderr, doesn't mean Python has access to that same information in the exception object).

In the meantime have you experimented with settings available within GDAL, things like GDAL_HTTP_MAX_RETRY and others in GDAL_HTTP* and CPL_VSIL_CURL* families? The fact that you are suspecting bit-errors in http responses you receive is worrying, there have been cache corruption issues in GDAL in the past, but it could also be in your infrastructure, given that you also observe dns errors (is this inside k8s?).

@charalamm
Copy link
Author

Hello @Kirill888 thanks for your immediate response.

Yes unfortunately my network is not great.

I have experimented with the gdal environment variables but I did not notice any difference. I think that is because the reading status codes are 500 or GDAL can not even connect so GDAL_HTTP_MAX_RETRY doesn't get activated. The only thing that worked for me (not with odc-stac but with stackstac) is not cache responses and catch the errors and retry from python, however I am not sure if that putted any performance overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants