Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store projection metadata in rtree index of GeoDataset #411

Closed
wants to merge 11 commits into from
Closed

Store projection metadata in rtree index of GeoDataset #411

wants to merge 11 commits into from

Conversation

weiji14
Copy link
Contributor

@weiji14 weiji14 commented Feb 17, 2022

Currently the GeoDataset's rtree index stores only the filepath to the file the data was loaded from (e.g. the geotiff or vector file). This pull request expands that to store the projection (CRS) information also. The filepath and crs are stored using a Python dictionary so that more metadata fields can be added in the future.

# Metadata is to be stored in the rtree index using a Python dictionary
GeoDataset.index.insert(
    id=i,
    coordinates=coords,
    obj=dict(filepath=filepath, crs=crs),
)

Note that this PR is mostly standalone and can be merged to handle implementation points 1 and 2 of #409.

Creating a namedtuple called 'GeoMetaData' that stores the original filepath and crs of the geodataset.
@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Feb 17, 2022
@weiji14 weiji14 marked this pull request as ready for review February 18, 2022 04:23
@adamjstewart adamjstewart added this to the 0.3.0 milestone Feb 18, 2022
@calebrob6 calebrob6 closed this Apr 20, 2022
@calebrob6 calebrob6 reopened this Apr 20, 2022
@calebrob6
Copy link
Member

This looks good to me / I don't see why we wouldn't want this.

@weiji14
Copy link
Contributor Author

weiji14 commented Apr 20, 2022

This looks good to me / I don't see why we wouldn't want this.

Yes, I'm hoping that this can get extended to store other bits of metadata in the future (e.g. spatial/radiometric resolution, % cloud cover, etc)! But the key thing is to have a path forward to resolve the big elephant in the room - #278/#409

I can do a rebase/merge from main to bring this branch up to speed, is there anything else in the implementation that you think could be improved?

@adamjstewart
Copy link
Collaborator

If we want the possibility of storing additional metadata we def don't want to use a namedtuple, a dict would be better. Not all datasets will have things like cloud cover.

@weiji14
Copy link
Contributor Author

weiji14 commented Apr 20, 2022

If we want the possibility of storing additional metadata we def don't want to use a namedtuple, a dict would be better. Not all datasets will have things like cloud cover.

Ok, cloud cover was definitely a bad example. But CRS and things like resolution would be mostly universal for raster datasets.

Main difference between a namedtuple/dataclass and dict is the way attributes are accessed.

Namedtuple/dataclass uses dot something (allows tab completion): hit.crs, or could also do hit[0].
Python dictionary uses square brackets: hit["crs"]

@adamjstewart
Copy link
Collaborator

But CRS and things like resolution would be mostly universal for raster datasets.

For RasterDataset yes, but this won't always be true for all GeoDatasets. For example, in #507 I'm adding a custom GeoDataset that loads from a single CSV file (so no filename per entry) and includes point data (so no resolution). VectorDataset doesn't really have resolution either, although we currently hack that in. We could store None/empty-string/0 for these kinds of datasets, but it might be better to just use some kind of dict with optional keys. I wish we could use TypedDict for more things but it currently doesn't support optional keys.

Main difference between a namedtuple/dataclass and dict is the way attributes are accessed.

All of the attribute access is internal to TorchGeo, users will almost never use this themselves. The more important difference to me is whether or not features can be optional and whether or not type hints are supported.

Not trying to shut down any of the ideas here, just playing devil's advocate for how this data structure could fail.

@weiji14
Copy link
Contributor Author

weiji14 commented Apr 20, 2022

But CRS and things like resolution would be mostly universal for raster datasets.

For RasterDataset yes, but this won't always be true for all GeoDatasets. For example, in #507 I'm adding a custom GeoDataset that loads from a single CSV file (so no filename per entry) and includes point data (so no resolution). VectorDataset doesn't really have resolution either, although we currently hack that in. We could store None/empty-string/0 for these kinds of datasets, but it might be better to just use some kind of dict with optional keys. I wish we could use TypedDict for more things but it currently doesn't support optional keys.

Main difference between a namedtuple/dataclass and dict is the way attributes are accessed.

All of the attribute access is internal to TorchGeo, users will almost never use this themselves. The more important difference to me is whether or not features can be optional and whether or not type hints are supported.

Not trying to shut down any of the ideas here, just playing devil's advocate for how this data structure could fail.

Ok, I see where you're coming from now. I had the impression that dataclass attributes could simply use typing.Optional but apparently not (though there's a hacky workaround according to https://stackoverflow.com/questions/70809438/python-dataclasses-with-optional-attributes). So good ol' Python dict it is then! Let me do a bit of refactoring.

@weiji14
Copy link
Contributor Author

weiji14 commented Apr 20, 2022

Oh wait, I just re-read your comment a bit closely, you prefer a TypedDict instead of a regular dict? Let me do another commit. Edit: done at 847998e.

@adamjstewart
Copy link
Collaborator

I don't think TypedDict supports optional keys so a regular dict would be better.

@weiji14
Copy link
Contributor Author

weiji14 commented Apr 20, 2022

I don't think TypedDict supports optional keys so a regular dict would be better.

Ok, and seems like TypedDict isn't available on Python 3.7 either. Reverted in 801d09c.

@adamjstewart
Copy link
Collaborator

Can you rebase to run the new Python 3.10 and minimum version tests?

@adamjstewart
Copy link
Collaborator

Actually, let me see if closing and reopening will run the new tests...

@adamjstewart adamjstewart reopened this Jun 27, 2022
@adamjstewart
Copy link
Collaborator

That ran the new tests, but you'll need to rebase or add a merge commit to incorporate the Sphinx changes to fix the RtD test.

Also, it looks like there is a problem with the new tests we added to test things with the minimum version of our dependencies that we support. Looks like the CRS object isn't pickleable in rasterio 1.0.20, this was fixed in 1.0.21 (rasterio/rasterio@7c6e01f). Can you update the dep in .github/requirements-min.txt to 1.0.21?

Happy to make these changes for you if you're busy but I'll need push access to your branch.

@weiji14
Copy link
Contributor Author

weiji14 commented Jun 27, 2022

After a bit more thought, I think I'll have to agree with the comment made in #409 (comment) that this PR won't make sense unless we sort out the downstream tasks of actually how to make use of the stored CRS information. I'll close this PR as I'm very low on bandwidth for the next two months and won't be able to sort out the merge conflicts anytime soon. Maybe someone can revisit this and/or come up with a better implementation later.

@weiji14 weiji14 closed this Jun 27, 2022
@weiji14 weiji14 deleted the geodataset_multicrs branch June 27, 2022 15:58
@adamjstewart adamjstewart removed this from the 0.3.0 milestone Jul 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants