Store projection metadata in rtree index of GeoDataset #411

weiji14 · 2022-02-17T19:40:47Z

Currently the GeoDataset's rtree index stores only the filepath to the file the data was loaded from (e.g. the geotiff or vector file). This pull request expands that to store the projection (CRS) information also. The filepath and crs are stored using a Python dictionary so that more metadata fields can be added in the future.

# Metadata is to be stored in the rtree index using a Python dictionary
GeoDataset.index.insert(
    id=i,
    coordinates=coords,
    obj=dict(filepath=filepath, crs=crs),
)

Note that this PR is mostly standalone and can be merged to handle implementation points 1 and 2 of #409.

Creating a namedtuple called 'GeoMetaData' that stores the original filepath and crs of the geodataset.

…perly" This reverts commit ca9b3e7.

calebrob6 · 2022-04-20T15:08:25Z

This looks good to me / I don't see why we wouldn't want this.

weiji14 · 2022-04-20T16:36:06Z

This looks good to me / I don't see why we wouldn't want this.

Yes, I'm hoping that this can get extended to store other bits of metadata in the future (e.g. spatial/radiometric resolution, % cloud cover, etc)! But the key thing is to have a path forward to resolve the big elephant in the room - #278/#409

I can do a rebase/merge from main to bring this branch up to speed, is there anything else in the implementation that you think could be improved?

adamjstewart · 2022-04-20T16:39:29Z

If we want the possibility of storing additional metadata we def don't want to use a namedtuple, a dict would be better. Not all datasets will have things like cloud cover.

weiji14 · 2022-04-20T17:07:51Z

If we want the possibility of storing additional metadata we def don't want to use a namedtuple, a dict would be better. Not all datasets will have things like cloud cover.

Ok, cloud cover was definitely a bad example. But CRS and things like resolution would be mostly universal for raster datasets.

Main difference between a namedtuple/dataclass and dict is the way attributes are accessed.

Namedtuple/dataclass uses dot something (allows tab completion): hit.crs, or could also do hit[0].
Python dictionary uses square brackets: hit["crs"]

adamjstewart · 2022-04-20T19:46:05Z

But CRS and things like resolution would be mostly universal for raster datasets.

For RasterDataset yes, but this won't always be true for all GeoDatasets. For example, in #507 I'm adding a custom GeoDataset that loads from a single CSV file (so no filename per entry) and includes point data (so no resolution). VectorDataset doesn't really have resolution either, although we currently hack that in. We could store None/empty-string/0 for these kinds of datasets, but it might be better to just use some kind of dict with optional keys. I wish we could use TypedDict for more things but it currently doesn't support optional keys.

Main difference between a namedtuple/dataclass and dict is the way attributes are accessed.

All of the attribute access is internal to TorchGeo, users will almost never use this themselves. The more important difference to me is whether or not features can be optional and whether or not type hints are supported.

Not trying to shut down any of the ideas here, just playing devil's advocate for how this data structure could fail.

weiji14 · 2022-04-20T20:06:09Z

But CRS and things like resolution would be mostly universal for raster datasets.

For RasterDataset yes, but this won't always be true for all GeoDatasets. For example, in #507 I'm adding a custom GeoDataset that loads from a single CSV file (so no filename per entry) and includes point data (so no resolution). VectorDataset doesn't really have resolution either, although we currently hack that in. We could store None/empty-string/0 for these kinds of datasets, but it might be better to just use some kind of dict with optional keys. I wish we could use TypedDict for more things but it currently doesn't support optional keys.

Main difference between a namedtuple/dataclass and dict is the way attributes are accessed.

All of the attribute access is internal to TorchGeo, users will almost never use this themselves. The more important difference to me is whether or not features can be optional and whether or not type hints are supported.

Not trying to shut down any of the ideas here, just playing devil's advocate for how this data structure could fail.

Ok, I see where you're coming from now. I had the impression that dataclass attributes could simply use typing.Optional but apparently not (though there's a hacky workaround according to https://stackoverflow.com/questions/70809438/python-dataclasses-with-optional-attributes). So good ol' Python dict it is then! Let me do a bit of refactoring.

weiji14 · 2022-04-20T20:25:28Z

Oh wait, I just re-read your comment a bit closely, you prefer a TypedDict instead of a regular dict? ~~Let me do another commit.~~ Edit: done at 847998e.

adamjstewart · 2022-04-20T20:42:44Z

I don't think TypedDict supports optional keys so a regular dict would be better.

…ata" This reverts commit 847998e.

weiji14 · 2022-04-20T21:00:24Z

I don't think TypedDict supports optional keys so a regular dict would be better.

Ok, and seems like TypedDict isn't available on Python 3.7 either. Reverted in 801d09c.

adamjstewart · 2022-06-27T00:14:44Z

Can you rebase to run the new Python 3.10 and minimum version tests?

adamjstewart · 2022-06-27T01:01:31Z

Actually, let me see if closing and reopening will run the new tests...

adamjstewart · 2022-06-27T01:13:35Z

That ran the new tests, but you'll need to rebase or add a merge commit to incorporate the Sphinx changes to fix the RtD test.

Also, it looks like there is a problem with the new tests we added to test things with the minimum version of our dependencies that we support. Looks like the CRS object isn't pickleable in rasterio 1.0.20, this was fixed in 1.0.21 (rasterio/rasterio@7c6e01f). Can you update the dep in .github/requirements-min.txt to 1.0.21?

Happy to make these changes for you if you're busy but I'll need push access to your branch.

weiji14 · 2022-06-27T15:58:54Z

After a bit more thought, I think I'll have to agree with the comment made in #409 (comment) that this PR won't make sense unless we sort out the downstream tasks of actually how to make use of the stored CRS information. I'll close this PR as I'm very low on bandwidth for the next two months and won't be able to sort out the merge conflicts anytime soon. Maybe someone can revisit this and/or come up with a better implementation later.

Store crs metadata in rtree index

e6a9fe7

Creating a namedtuple called 'GeoMetaData' that stores the original filepath and crs of the geodataset.

github-actions bot added the datasets Geospatial or benchmark datasets label Feb 17, 2022

weiji14 added 2 commits February 17, 2022 22:28

Fix mypy errors by using typing.NamedTuple class

342f484

Remove unused collections import

cb5144b

weiji14 marked this pull request as ready for review February 18, 2022 04:23

weiji14 mentioned this pull request Feb 18, 2022

GeoDataset: avoid unnecessary reprojection #409

Open

adamjstewart added this to the 0.3.0 milestone Feb 18, 2022

weiji14 added 4 commits February 28, 2022 13:59

Merge branch 'main' into geodataset_multicrs

f302b53

Let datasets get filepath attribute from rtree hit object properly

ca9b3e7

Revert "Let datasets get filepath attribute from rtree hit object pro…

5d9109a

…perly" This reverts commit ca9b3e7.

Let globiomass dataset get filepath attribute from rtree index properly

66a74ce

calebrob6 closed this Apr 20, 2022

calebrob6 reopened this Apr 20, 2022

weiji14 added 2 commits April 20, 2022 16:08

Merge branch 'main' into geodataset_multicrs

e355bb1

Refactor to store filepath and CRS in Python dict instead of dataclass

7f83f9e

Use TypedDict instead of regular Python dict for the GeoMetaData

847998e

Revert "Use TypedDict instead of regular Python dict for the GeoMetaD…

801d09c

…ata" This reverts commit 847998e.

adamjstewart approved these changes Jun 27, 2022

View reviewed changes

adamjstewart closed this Jun 27, 2022

adamjstewart reopened this Jun 27, 2022

weiji14 closed this Jun 27, 2022

weiji14 deleted the geodataset_multicrs branch June 27, 2022 15:58

adamjstewart removed this from the 0.3.0 milestone Jul 9, 2022

adriantre mentioned this pull request Jun 8, 2023

Stitching: Access patch geo-transform in callback after predict #1407

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store projection metadata in rtree index of GeoDataset #411

Store projection metadata in rtree index of GeoDataset #411

weiji14 commented Feb 17, 2022 •

edited

Loading

calebrob6 commented Apr 20, 2022

weiji14 commented Apr 20, 2022 •

edited

Loading

adamjstewart commented Apr 20, 2022

weiji14 commented Apr 20, 2022 •

edited

Loading

adamjstewart commented Apr 20, 2022

weiji14 commented Apr 20, 2022

weiji14 commented Apr 20, 2022 •

edited

Loading

adamjstewart commented Apr 20, 2022

weiji14 commented Apr 20, 2022

adamjstewart commented Jun 27, 2022

adamjstewart commented Jun 27, 2022

adamjstewart commented Jun 27, 2022

weiji14 commented Jun 27, 2022

Store projection metadata in rtree index of GeoDataset #411

Store projection metadata in rtree index of GeoDataset #411

Conversation

weiji14 commented Feb 17, 2022 • edited Loading

calebrob6 commented Apr 20, 2022

weiji14 commented Apr 20, 2022 • edited Loading

adamjstewart commented Apr 20, 2022

weiji14 commented Apr 20, 2022 • edited Loading

adamjstewart commented Apr 20, 2022

weiji14 commented Apr 20, 2022

weiji14 commented Apr 20, 2022 • edited Loading

adamjstewart commented Apr 20, 2022

weiji14 commented Apr 20, 2022

adamjstewart commented Jun 27, 2022

adamjstewart commented Jun 27, 2022

adamjstewart commented Jun 27, 2022

weiji14 commented Jun 27, 2022

weiji14 commented Feb 17, 2022 •

edited

Loading

weiji14 commented Apr 20, 2022 •

edited

Loading

weiji14 commented Apr 20, 2022 •

edited

Loading

weiji14 commented Apr 20, 2022 •

edited

Loading