Add PreChippedGeoSampler for pre-chipped geospatial datasets #479

adamjstewart · 2022-03-23T22:21:06Z

Rationale

Many existing VisionDatasets actually contain geospatial metadata. These datasets should be converted to GeoDatasets (#83). However, GeoDatasets are a bit more complicated than VisionDatasets and require a GeoSampler to use. This PR adds a PreChippedGeoSampler to make this transition easier.

Implementation

For VisionDatasets, sampling is quite simple:

train_dataset = ...
train_dataloader = DataLoader(train_dataset, shuffle=True)

test_dataset = ...
test_dataloader = DataLoader(test_dataset)

However, it was much trickier to get the same behavior for GeoDatasets. Previously, a user would need to do something like:

SIZE = 256  # size of each image

train_dataset = ...
train_sampler = RandomGeoSampler(train_dataset, size=SIZE, length=len(train_dataset))
train_dataloader = DataLoader(train_dataset, sampler=train_sampler)

test_dataset = ...
test_sampler = GridGeoSampler(test_dataset, size=SIZE, stride=SIZE)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler)

Crucially, this requires the user to know the size of each image, to explicitly specify the number of images in the train dataset, and to be clever with stride. With this PR, users can instead use:

train_dataset = ...
train_sampler = PreChippedGeoSampler(train_dataset, shuffle=True)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler)

test_dataset = ...
test_sampler = PreChippedGeoSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler)

This is almost as simple as VisionDataset sampling and probably about as good as we're going to get.

This may be of interest to @recursix @RitwikGupta @ashnair1. I think #409 is the only remaining bottleneck preventing us from converting more VisionDatasets to GeoDatasets.

RitwikGupta · 2022-03-24T07:29:15Z

So how does this work? Are you taking all the pre-chipped GeoTIFFs in a directory and building an R-tree using those extents?

adamjstewart · 2022-03-24T16:32:50Z

@RitwikGupta Yes, that's how a VisionDataset would be converted to a GeoDataset. This allows you to do the same things as VisionDataset (sampling image/mask pairs one at a time) but also allows you to restrict to a specific geographic region or combine with other geospatial datasets.

calebrob6 · 2022-03-30T03:13:08Z

Some benchmarking -- I created a subset of 5000 tiffs from the USAVars dataset (256x256x4 patches in local UTM CRS scattered around the US) and used these with a RasterDataset.

Things:

Instantiating the RasterDataset takes ~26 seconds for the 5000 tiffs (i.e. really slow). It has to open each tiff and get the bound information. This would take ~10 minutes for a dataset of 100k patches.
Instantiating the PreChippedGeoSampler is practically the same as RandomGeoSampler
Using the PreChippedGeoSampler gets around 13 batches/sec
Using a RandomGeoSampler gets around 14 batches/sec
I made a "CustomDataset" (see below). This gets around 28 batches/sec
If there is warping (as in the case of this dataset) then images in a batch won't be the same size and the DataLoader will freak out. I expect this will be a sore point with users.

class CustomDataset(Dataset):    
    def __init__(self, fns):
        self.fns = fns
        
    def __len__(self):
        return len(self.fns)
    
    def __getitem__(self, idx):
        with rasterio.open(self.fns[idx]) as f:
            data = f.read()
        data = torch.from_numpy(data.astype(np.float32))
        return data

calebrob6

In terms of pure functionality, works exactly as advertised :)

adamjstewart · 2022-03-30T16:35:26Z

Instantiating the RasterDataset takes ~26 seconds for the 5000 tiffs

Yeah, this is much slower than I expected. For datasets that come with a STAC JSON file we should use this whenever possible.

PreChippedGeoSampler vs. RandomGeoSampler

Yep, wouldn't expect any difference here. I think a more interesting benchmark would be to convert a VisionDataset to a RasterDataset and compare before and after.

I made a "CustomDataset" (see below). This gets around 28 batches/sec

Is this just because of warping?

If there is warping (as in the case of this dataset) then images in a batch won't be the same size and the DataLoader will freak out. I expect this will be a sore point with users.

Let me clarify this in the docs. This will hopefully no longer be an issue with #409.

calebrob6 · 2022-03-30T17:01:24Z

For datasets that come with a STAC JSON file we should use this whenever possible.

Perhaps we can generate/cache this on first run? This is also a problem with the SECO dataset IIRC.

I think a more interesting benchmark would be to convert a VisionDataset to a RasterDataset and compare before and after.

This is essentially what I'm doing with the CustomDataset. The __getitem__() in VisionDatasets just have to load the file from disk and convert to torch tensor.

…ft#479) * Add PreChippedGeoSampler for pre-chipped geospatial datasets * Add shuffle parameter * Add tests, fix type hints * Warn about multi-CRS datasets

adamjstewart added 2 commits March 23, 2022 14:31

Add PreChippedGeoSampler for pre-chipped geospatial datasets

552be5d

Add shuffle parameter

1ee5239

adamjstewart added this to the 0.3.0 milestone Mar 23, 2022

github-actions bot added documentation Improvements or additions to documentation samplers Samplers for indexing datasets testing Continuous integration testing labels Mar 23, 2022

Add tests, fix type hints

5cca5f5

calebrob6 previously approved these changes Mar 30, 2022

View reviewed changes

Warn about multi-CRS datasets

73a3116

adamjstewart dismissed calebrob6’s stale review via 73a3116 March 30, 2022 18:54

adamjstewart requested a review from calebrob6 April 3, 2022 02:14

calebrob6 approved these changes Apr 5, 2022

View reviewed changes

calebrob6 merged commit e8474e4 into main Apr 5, 2022

calebrob6 deleted the samplers/pre-chipped branch April 5, 2022 16:10

adamjstewart mentioned this pull request Jul 11, 2022

0.3.0 release #664

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PreChippedGeoSampler for pre-chipped geospatial datasets #479

Add PreChippedGeoSampler for pre-chipped geospatial datasets #479

adamjstewart commented Mar 23, 2022

RitwikGupta commented Mar 24, 2022

adamjstewart commented Mar 24, 2022

calebrob6 commented Mar 30, 2022 •

edited

Loading

calebrob6 left a comment

adamjstewart commented Mar 30, 2022

calebrob6 commented Mar 30, 2022

Add PreChippedGeoSampler for pre-chipped geospatial datasets #479

Add PreChippedGeoSampler for pre-chipped geospatial datasets #479

Conversation

adamjstewart commented Mar 23, 2022

Rationale

Implementation

RitwikGupta commented Mar 24, 2022

adamjstewart commented Mar 24, 2022

calebrob6 commented Mar 30, 2022 • edited Loading

calebrob6 left a comment

Choose a reason for hiding this comment

adamjstewart commented Mar 30, 2022

calebrob6 commented Mar 30, 2022

calebrob6 commented Mar 30, 2022 •

edited

Loading