Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero area intersections in IntersectionDataset result in unexpected dataset lengths #1270

Closed
calebrob6 opened this issue Apr 20, 2023 · 3 comments · Fixed by #1985
Closed
Assignees
Labels
datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets

Comments

@calebrob6
Copy link
Member

calebrob6 commented Apr 20, 2023

Issue

We create an IntersectionDataset like this:

train_image_ds = RasterDataset(
    'data/processed/images/',
)
train_mask_ds = RasterDataset(
    'data/processed/masks/',
)
train_mask_ds.is_image = False
train_ds = train_image_ds & train_mask_ds

Here both train_image_ds and train_mask_ds have length of 22, and cover the exact same spatial areas (i.e. there is a 1-to-1 pairing between a tile in train_image_ds and a tile in train_mask_ds). It looks something like this:

output

The issue is that train_ds unexpectedly has a length of 140. Specifically, the merged index has 140 entries, however only 22 of them (as expected) have an area > 0. I'm guessing this is why we filter out intersection areas with area <= 0 in the samplers, but don't remember the details!

I recommend that we filter areas of intersection with area 0 when merging datasets.

@calebrob6 calebrob6 added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets and removed documentation Improvements or additions to documentation labels Apr 20, 2023
@adamjstewart
Copy link
Collaborator

Related to #737, #319, #376, etc.

The reason for this issue is that rtree considers two bounding boxes to be overlapping even if the area of overlap is 0.

It isn't hard to add a check for this and remove them from the intersection, or from the sampler. The reason we haven't done this already is that some datasets have 0 area on purpose. We have several point GeoDatasets, including GBIF, iNaturalist, and EDDMapS, and I have plans to add others for air pollution as well. I'm not actively using these datasets, and I'm not even sure if our builtin samplers would be useful for these kinds of datasets, but that's the reason things are the way they are. I would be open to changing this, but would need to think about how else we could use point datasets without 0 area files. Could add a parameter to control this I suppose.

@calebrob6 calebrob6 changed the title Zero area intersections in IntersectionDataset Zero area intersections in IntersectionDataset result in unexpected dataset lengths Apr 20, 2023
@calebrob6
Copy link
Member Author

calebrob6 commented Apr 20, 2023

Just clarified the title to emphasize that the problem is that the reported length of the IntersectionDataset does not match the expected length which is confusing to users.

@adamjstewart adamjstewart self-assigned this Apr 4, 2024
@adamjstewart adamjstewart added this to the 0.6.0 milestone Apr 4, 2024
@adamjstewart
Copy link
Collaborator

@yichiac has this same problem in his dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants