Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify corrupt data source in odc.stac.load #97

Closed
SandroGroth opened this issue Nov 24, 2022 · 11 comments
Closed

Identify corrupt data source in odc.stac.load #97

SandroGroth opened this issue Nov 24, 2022 · 11 comments

Comments

@SandroGroth
Copy link

Hi all, first of all thanks for this great tool!

I'm currently trying to aggregate raster values of a xarray.DataSet created with odc-stac based on several hundred STAC items, similar to this reproducible example:

import pystac_client
import planetary_computer
from odc.stac import load

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1", 
    modifier=planetary_computer.sign_inplace
) 

time_range = "2020-01-01/2020-12-31"
bbox = [-122.2751, 47.5469, -121.9613, 47.7458]

search = catalog.search(collections=["landsat-c2-l2"], bbox=bbox, datetime=time_range)
items = search.get_all_items()

data = load(items, bands=["green"], chunks={"x": 256, "y": 256})
res = data.sum(dim="time").compute()

If, however, one of the many items is corrupt, it is very hard to indentify the faulty data source just from the RasterioIOError that gets returned:

# lets break a random href of an Landsat item
items[0].assets['green'].href = "https://faulty_url"

Excecuting the code above returns:

Exception has occurred: RasterioIOError
HTTP response code: 404

During handling of the above exception, another exception occurred:

  File "/path/to/file", line 16, in <module>
    res = data.sum(dim="time").compute()

Is there an option to extend the logging in odc.stac.load in order to identify which Item/Asset rasterio wasn't able to open?

@Kirill888
Copy link
Member

  1. if this is loading data from planetary computer you need to sign urls for it to work, use patch_url=planetary_computer.sign, https://odc-stac.readthedocs.io/en/latest/notebooks/stac-load-S2-ms.html#Lazy-load-all-the-bands
  2. for understanding what is failing during load probably best to enable logging for rasterio library as this is the library that we use for loading data

@SandroGroth
Copy link
Author

Thanks for the quick response!

I'm currently working with an internally hosted and maintained STAC catalog, where sometimes hrefs of assets are broken. That's why it would be handy, if in case a RasterioIOError is raised, the error message would include the href of the file that was attempted to open.

As suggested, I tried to get this information with rasterio by catching the RasterioIOError when it comes up and print the filename property of the error:

try:    
    data = load(items, bands=["green"], chunks={"x": 256, "y": 256})
    res = data.sum(dim="time").compute()
except rasterio.errors.RasterioIOError as e:
    print(f"e.filename")

... which unfortunately is None. I looked into the GDAL configuration options, but did not find any logging option that would log the href wihtout producing a ton of messages when opening a bigger list of items.

I guess it would be cool, if odc.stac.load would catch the error as well and additionally log the href that was attempted to open. So in theory something in the direction of:

except rasterio.errors.RasterioIOError as e:
    logger.error(f"Unable to open {asset.href}: \n {e.characters_written}")
    raise e

@Kirill888
Copy link
Member

it is None, but exception message should contain file being loaded

import rasterio
import logging

logging.basicConfig(level="INFO")
logging.getLogger("rasterio").setLevel(logging.DEBUG)

try:
    x = rasterio.open("bad_tif.tif")
except rasterio.errors.RasterioIOError as e:
    print(f"{e}, {e.filename}")

odc-stac does not capture any rasterio errors, they all get bubbled up, but we probably should have "continue load even when some files failed to load" mode, with proper error reporting. I also recommend testing things like that without using Dask.

And also 256px chunks are way too tiny, I recommend starting with 2048 and only going down from that in special situations.

@Kirill888
Copy link
Member

stac item and asset information are all gone by the time loading is happening inside odc-stac. Pixel reading might be happening on a remote instance (Dask), and stac items can contain a lot of extra metadata, so we distill it down to essential info only, so no way to link it back to a specific stac item.asset at the moment.

@SandroGroth
Copy link
Author

Got it! I will activate more detailed rasterio logging if odc.stac.load encounters an exception.

Thanks again for the detailed explanation and keep up the great work!

@idantene
Copy link

@Kirill888 Bringing this up again, (and also in the context of #54).
stackstac offers a solution to this with errors_as_nodata (here).

Any chance to implement something similar in odc-stack?

@Kirill888
Copy link
Member

Kirill888 commented Jul 28, 2023

@idantene similar option is available in the current release of odc-stac,

fail_on_error=False,

Failed locations are logged with python warning system, see #100 and #99

@idantene
Copy link

idantene commented Jul 28, 2023 via email

@Kirill888
Copy link
Member

odc-stac decides on structure of the output array at the very start, so all the storage for all time slices and for all the bands is "allocated" at the very start (not really when using Dask, but same idea). Missing/broken file will result in nodata or nan pixels being filled in there. Rather, image begins with empty pixels only, then each item contributes it's valid pixels (image might be readable but have no valid data at all, only nodata/nan pixels). In case of overlapping data, "first valid pixel sticks".

so if your problem is due to broken network connection for example, then you will get back array full of nodata/nan pixels.

@Kirill888
Copy link
Member

There is no way to distinguish between pixels that were missing from the original data and pixels that failed to read, both types end up with nodata marker. There is no mask of "observed pixels" being computed either, so we can't distinguish between the following types of missing data

  • No source image overlaps this pixel
  • Some source images overlap this pixel but have no valid pixel at that location
  • Some source images overlap this pixel, but we don't know if any of them had any valid data here because we failed to read them

@idantene
Copy link

Thanks @Kirill888, all of that makes a lot of sense! I wish it was more explicitly mentioned in the documentation, and I still hope to put in a PR for documentation in the future :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants