Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PreChippedGeoSampler for pre-chipped geospatial datasets #479

Merged
merged 4 commits into from
Apr 5, 2022

Conversation

adamjstewart
Copy link
Collaborator

Rationale

Many existing VisionDatasets actually contain geospatial metadata. These datasets should be converted to GeoDatasets (#83). However, GeoDatasets are a bit more complicated than VisionDatasets and require a GeoSampler to use. This PR adds a PreChippedGeoSampler to make this transition easier.

Implementation

For VisionDatasets, sampling is quite simple:

train_dataset = ...
train_dataloader = DataLoader(train_dataset, shuffle=True)

test_dataset = ...
test_dataloader = DataLoader(test_dataset)

However, it was much trickier to get the same behavior for GeoDatasets. Previously, a user would need to do something like:

SIZE = 256  # size of each image

train_dataset = ...
train_sampler = RandomGeoSampler(train_dataset, size=SIZE, length=len(train_dataset))
train_dataloader = DataLoader(train_dataset, sampler=train_sampler)

test_dataset = ...
test_sampler = GridGeoSampler(test_dataset, size=SIZE, stride=SIZE)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler)

Crucially, this requires the user to know the size of each image, to explicitly specify the number of images in the train dataset, and to be clever with stride. With this PR, users can instead use:

train_dataset = ...
train_sampler = PreChippedGeoSampler(train_dataset, shuffle=True)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler)

test_dataset = ...
test_sampler = PreChippedGeoSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler)

This is almost as simple as VisionDataset sampling and probably about as good as we're going to get.

This may be of interest to @recursix @RitwikGupta @ashnair1. I think #409 is the only remaining bottleneck preventing us from converting more VisionDatasets to GeoDatasets.

@adamjstewart adamjstewart added this to the 0.3.0 milestone Mar 23, 2022
@github-actions github-actions bot added documentation Improvements or additions to documentation samplers Samplers for indexing datasets testing Continuous integration testing labels Mar 23, 2022
@RitwikGupta
Copy link
Contributor

So how does this work? Are you taking all the pre-chipped GeoTIFFs in a directory and building an R-tree using those extents?

@adamjstewart
Copy link
Collaborator Author

@RitwikGupta Yes, that's how a VisionDataset would be converted to a GeoDataset. This allows you to do the same things as VisionDataset (sampling image/mask pairs one at a time) but also allows you to restrict to a specific geographic region or combine with other geospatial datasets.

@calebrob6
Copy link
Member

calebrob6 commented Mar 30, 2022

Some benchmarking -- I created a subset of 5000 tiffs from the USAVars dataset (256x256x4 patches in local UTM CRS scattered around the US) and used these with a RasterDataset.

Things:

  • Instantiating the RasterDataset takes ~26 seconds for the 5000 tiffs (i.e. really slow). It has to open each tiff and get the bound information. This would take ~10 minutes for a dataset of 100k patches.
  • Instantiating the PreChippedGeoSampler is practically the same as RandomGeoSampler
  • Using the PreChippedGeoSampler gets around 13 batches/sec
  • Using a RandomGeoSampler gets around 14 batches/sec
  • I made a "CustomDataset" (see below). This gets around 28 batches/sec
  • If there is warping (as in the case of this dataset) then images in a batch won't be the same size and the DataLoader will freak out. I expect this will be a sore point with users.
class CustomDataset(Dataset):    
    def __init__(self, fns):
        self.fns = fns
        
    def __len__(self):
        return len(self.fns)
    
    def __getitem__(self, idx):
        with rasterio.open(self.fns[idx]) as f:
            data = f.read()
        data = torch.from_numpy(data.astype(np.float32))
        return data

calebrob6
calebrob6 previously approved these changes Mar 30, 2022
Copy link
Member

@calebrob6 calebrob6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of pure functionality, works exactly as advertised :)

@adamjstewart
Copy link
Collaborator Author

Instantiating the RasterDataset takes ~26 seconds for the 5000 tiffs

Yeah, this is much slower than I expected. For datasets that come with a STAC JSON file we should use this whenever possible.

PreChippedGeoSampler vs. RandomGeoSampler

Yep, wouldn't expect any difference here. I think a more interesting benchmark would be to convert a VisionDataset to a RasterDataset and compare before and after.

I made a "CustomDataset" (see below). This gets around 28 batches/sec

Is this just because of warping?

If there is warping (as in the case of this dataset) then images in a batch won't be the same size and the DataLoader will freak out. I expect this will be a sore point with users.

Let me clarify this in the docs. This will hopefully no longer be an issue with #409.

@calebrob6
Copy link
Member

For datasets that come with a STAC JSON file we should use this whenever possible.

Perhaps we can generate/cache this on first run? This is also a problem with the SECO dataset IIRC.

I think a more interesting benchmark would be to convert a VisionDataset to a RasterDataset and compare before and after.

This is essentially what I'm doing with the CustomDataset. The __getitem__() in VisionDatasets just have to load the file from disk and convert to torch tensor.

@calebrob6 calebrob6 merged commit e8474e4 into main Apr 5, 2022
@calebrob6 calebrob6 deleted the samplers/pre-chipped branch April 5, 2022 16:10
remtav pushed a commit to remtav/torchgeo that referenced this pull request May 26, 2022
…ft#479)

* Add PreChippedGeoSampler for pre-chipped geospatial datasets

* Add shuffle parameter

* Add tests, fix type hints

* Warn about multi-CRS datasets
@adamjstewart adamjstewart mentioned this pull request Jul 11, 2022
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
…ft#479)

* Add PreChippedGeoSampler for pre-chipped geospatial datasets

* Add shuffle parameter

* Add tests, fix type hints

* Warn about multi-CRS datasets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation samplers Samplers for indexing datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants