Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EuroSAT100: add new dataset/datamodule #1130

Merged
merged 5 commits into from
Feb 23, 2023
Merged

EuroSAT100: add new dataset/datamodule #1130

merged 5 commits into from
Feb 23, 2023

Conversation

adamjstewart
Copy link
Collaborator

@adamjstewart adamjstewart commented Feb 21, 2023

What?

This is a subset of the EuroSAT MS dataset containing only 100 images. Yes, you heard that right, 100 images. This dataset maintains the 60-20-20 train-val-test split from https://arxiv.org/abs/1911.06721. There are 10 images (6 train, 2 val, 2 test) for all 10 classes, for a total of 100 images.

But why?

Our tutorials download a lot of data. So much in fact that we crash the meager 14 GB SSD in our GitHub Actions runners, causing our notebook tests to fail for the last year or so. Between NAIP, Chesapeake, Sentinel, EuroSAT, and TropicalCyclone, we're talking several minutes of downloads. The TropicalCyclone download doesn't even work unless you sign up for an API key (#1074). A quick comparison:

Dataset Compressed Uncompressed Download Time
EuroSAT 1.9G 4.7G 3m 7s
EuroSAT100 7.4M 18M 1s

Yes, 1 second, down to the millisecond:

$ time wget https://huggingface.co/datasets/torchgeo/eurosat/resolve/main/EuroSAT100.zip
...
real	0m1.000s
user	0m0.055s
sys	0m0.110s

Our tutorials should not take several minutes just on data prep, nor should they fill up your hard drive, nor should they require an API key.

Why EuroSAT?

EuroSAT was chosen for the following reasons:

  • Many of our existing tutorials already use EuroSAT, making them easy to port
  • It contains Sentinel-2 images, which our tutorials rely on to show off things like NDVI or pre-trained models
  • It is distributed under a permissive license
  • Images are georeferenced, meaning we could add a GeoDataset version of this someday

The milestone says 0.4.1 but versionadded says 0.5...

Technically, new features shouldn't be added in patch releases. But this may fix our chronically failing notebook tests. So I would like to add this to the next patch release for the sake of testing, but we can just pretend that it was actually added in 0.5.

Hey, you forgot to update the docs!

I'll open a separate PR to add this to the docs that we can save for the 0.5 release.

@adamjstewart adamjstewart added this to the 0.4.1 milestone Feb 21, 2023
@github-actions github-actions bot added datasets Geospatial or benchmark datasets testing Continuous integration testing labels Feb 21, 2023
@github-actions github-actions bot added the datamodules PyTorch Lightning datamodules label Feb 21, 2023
@adamjstewart adamjstewart changed the title EuroSAT100: add new dataset EuroSAT100: add new dataset/datamodule Feb 21, 2023
@calebrob6
Copy link
Member

Could not reproduce the 1 second download, it took my laptop 1.971 seconds.

@adamjstewart
Copy link
Collaborator Author

I tried the download several times and it's always between 0.8s and 1.8s on my laptop. Either way, drastic improvement over the previous 3m download for EuroSAT and 10+ min download for Cyclone.

@calebrob6
Copy link
Member

That's more like it! We emphasize test metric distributions in torchgeo!

@calebrob6 calebrob6 merged commit e81af42 into main Feb 23, 2023
@calebrob6 calebrob6 deleted the datasets/eurosat100 branch February 23, 2023 23:26
calebrob6 pushed a commit that referenced this pull request Apr 10, 2023
* EuroSAT100: add new dataset

* Fix type hints

* Add EuroSAT100DataModule

* Isort and test fixes

* Add disclaimer, remove duplicate code
@adamjstewart adamjstewart modified the milestones: 0.4.1, 0.5.0 Apr 11, 2023
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
* EuroSAT100: add new dataset

* Fix type hints

* Add EuroSAT100DataModule

* Isort and test fixes

* Add disclaimer, remove duplicate code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants