-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: empty range for randrange() #319
Comments
What version of TorchGeo are you using that gets this bug, 0.1.2 or main? Also, can you share a minimum reproducing script? |
# Custom RasterDataset defined simply with the filename_glob
ds = WatchRaster(root=Path('/home/ritwik/dataset/'))
sampler = RandomBatchGeoSampler(ds, size=1024, batch_size=5, length=5 * 5)
dl = DataLoader(ds, batch_sampler=sampler, collate_fn=stack_samples)
fig, axs = plt.subplots(5, 5)
for idx, batch in enumerate(dl):
for idx_s, image in enumerate(batch['image']):
image = torch.squeeze(image)
axs[idx, idx_s].imshow(image, cmap='inferno')
axs[idx, idx_s].axis('off') |
Can't reproduce with any of the data I have lying around, and obviously none of our CI tests hit this bug. What CRS/units are you files in? How large is 1024 in your CRS? Are any of your files very small? I can't really debug this if I can't reproduce it. You can try creating an R-tree with the bounding box of some fake files, then I wouldn't actually need the data to reproduce. There are some examples of this in our unit tests. |
The trace from above means that |
I can provide data to reproduce this.
DopData and DsmData are two simple custom RasterDatasets with the corresponding filename_glob. Simply put the files starting with "dop" in the "dops" folder and the others in the "dsms" folder. Data removed, see new comment. |
Looks like the UNIX zip tool doesn't support multi-part zip files. Can you upload separate zip files containing only some of the files? Or maybe just the single GeoTIFF needed to reproduce the issue? Also, can you post the code defining |
Here are the files zipped individually. I had to downsample two of the raster files and the error still remains. dsm_data.py:
dop_data.py:
dop10rgbi_32_338_5677_1_nw_0.5.zip |
Thanks, I can reproduce this now. Will let you know when I figure out what's happening here. |
I think there are two issues here:
1 is easy to solve (replace <=/>= with </>), but 2 is a bit harder to solve. It's important to solve 2 (not just 1) because some tiles may have a very small non-zero region of overlap. I'll submit a PR to fix this and test it by this afternoon. Thanks for the bug report! |
1 might not actually be that simple. The intersection logic used to decide how two R-trees are combined comes from |
Can we just add a check in the sampler constructor to compare the box size to the area of intersection? |
We could, but what would you do with that information? Particularly in regards to:
If you skip intersections with zero area, you can't use samplers for point data. |
I'm thinking maybe for now just a warning that a user requested a box size that is larger than the area of intersection. In the meantime this will give users a more informative response than the above error. |
I don't think that will help for this particular issue. To help visualize the data that @tritolol shared, we have two datasets: 1 and 2. Both datasets include two tiles, A and B, that are adjacent like so:
Let's say A1 is tile A from dataset 1. In this case, the dataset intersection we compute includes four tiles:
The user didn't request a bounding box larger than the area of intersection because there isn't a single area of intersection, there are multiple. I think it will almost always be the case that some area of intersection in any large dataset will have an area smaller than the requested bounding box. For example, we could also have:
where A and B have slight overlap but not big enough for a bounding box. We could try to fuse these into a single tile C but that won't work for cases like:
|
Can't we just compare H/W of intersection to H/W of user defined bounding box? I.e. if a user is requesting a 256x256 size box but the intersection height or width is less than the box shape then we should warn the user about this. Sorry haven't had a chance to sit down and look at this in depth. Edit: if there are multiple areas of overlap can we not just filter them prior to sampling? I admit I haven't had an in depth use case for rtree so might be ignorant on how simple or not this might be. |
You could (it's easy to do), but I'm not sure where to do that and how to handle it.
This is the part that doesn't make sense. With the dataset that @tritolol shared, tiles A and B are much larger than the requested bounding box, yet you still end up with regions of overlap with zero area. In general, just about every dataset will encounter this issue. Warning is useless, we need to avoid crashing, not just warn for every dataset. |
I think there are a few possible places we could address this issue: GeoDatasetThe index is first created when you instantiate a GeoDataset (usually via RasterDataset or VectorDataset). In the case of adjacent tiles, one solution would be to merge those tiles into a single bounding box. However, this won't work since:
So this location won't work. IntersectionDatasetThe first time we recompute the index is when we compute the intersection of two datasets. With the example data above, this is where we get those pesky intersection bounding boxes with zero area. Many users might consider this to be a bug, and it might make sense to remove bounding boxes with zero area. However, there are a couple of problems with this:
So this location won't work either. GeoSamplerWe compute the index that we sample from in the
So this location won't work either. GeoSampler subclassesI think this is where we'll have to do things (remove intersection bboxes smaller than the query bbox). This is the first time we know the size of the query bbox, and these samplers are already specific enough that they only work for volumetric/areal data. If someone wants to work with point data, they would already need to create a custom sampler. It's a shame we need to iterate through the R-tree in 4 different places just to get a list of locations to sample from, but I don't know of a different way to do this. Another question is what to do when the area of intersection is 0 < bbox < size. Do we throw away those small regions of overlap, or do we still support sampling from them? The former is probably more efficient, but the latter may be important for some applications. It isn't clear to me what a good default would be. For example, if I'm using GridGeoSampler, I probably want to make predictions for all regions of data, even if they are smaller than the query bbox. |
I think we also have a case of strong sampler bias. Since we first choose a random tile and then choose a random bounding box, very small regions of intersection will have the same number of hits as large areas. I'm not sure how big of an issue this is. |
When using
RandomBatchGeoSampler
, 50% of the time the following error will occur. With no code change, this runs perfectly fine the other 50% of the time.code:
error:
The text was updated successfully, but these errors were encountered: