Million-AID dataset #455

nilsleh · 2022-03-09T22:01:26Z

This PR adds the Million-AID dataset which contains one million aerial scenes from Google Earth engine.

Comments/Questions:

It offers both a multi-class (51 classes) and a multi-label (73 labels) task, for which I added an option in the constructor
images can have two or three labels in the multi-label case so the __getitem__ currently returns variable length label tensor
I am not sure how to best handle the download because the total file is more than 200GB, so is there a way to maybe split it somehow?

Plot Examples:

torchgeo/datasets/millionaid.py

adamjstewart · 2022-03-11T16:33:11Z

Bleh, I'll fix the mypy issues in another PR, it looks like PyTorch finally added type hints for many of its functions in the latest release that came out yesterday.

calebrob6 · 2022-03-20T17:23:38Z

@nilsleh can you rebase on main here

adamjstewart · 2022-03-20T18:20:41Z

Rebasing isn't necessary, just need to remove the few remaining type ignores that are causing the tests to fail.

tests/data/millionaid/data.py

tests/datasets/test_millionaid.py

torchgeo/datasets/millionaid.py

isaaccorley

The test set is ~260GB and multiple parts compared to the train set which is a single zip file. I wonder if we even want users to be downloading this in a single sequential process. @adamjstewart

isaaccorley · 2022-04-06T03:02:27Z

torchgeo/datasets/millionaid.py

+    }
+    url = {
+        "train": "https://eastus1-mediap.svc.ms/transform/zip?cs=fFNQTw",
+        "test": "https://eastus1-mediap.svc.ms/transform/zip?cs=fFNQTw",


Does this actually download the test set? These look like the same url to me. Also looks like the test set is made of multiple parts (e.g. test.zip.001, test.zip.002, etc.). Does extract_archive support this?

That is a good question, I'm not sure whether that will work or not. We've had to hack things to support deflate64-compressed zip files before, so multi-part zip files should be possible somehow.

Does this actually download the test set? These look like the same url to me.

If I look in the OneDrive and download the train or test folder, the download link for me are the same for some reason.

@isaaccorley you mentioned that you downloaded the test set and computed its MD5. Where did you download the test set from? Was it multiple parts when you downloaded and checksummed it?

I actually just noticed that test.zip was corrupted and I couldn't unzip it completely so I think the hash may be incorrect. Going to try and download each of the test files individually.

Was this resolved?

I don't think this was ever resolved, this is the only thing holding up this PR. If there's a single link for all data that's fine, we can download the data with one link, checksum it once, then extract it and extract and zip files that contains. I don't have the bandwidth/storage to download this myself, but can someone investigate this and see if it's even possible to download this dataset? If not, we could just remove the download logic until we figure it out.

I unfortunately also do not have the bandwidth/storage for this download :/

Did anyone ever reach out to the dataset authors and see if we can rehost a single zip file on something like zenodo?

adamjstewart · 2022-07-09T21:13:38Z

Since the download logic doesn't currently work, I'm going to remove it so we can get this into 0.3.0. We can add download support in a future release.

adamjstewart · 2022-07-09T21:48:57Z

docs/api/non_geo_datasets.csv

@@ -16,6 +16,7 @@ Dataset,Task,Source,# Samples,# Classes,Size (px),Resolution (m),Bands
 `LandCover.ai`_,S,Aerial,"10,674",5,512x512,0.25--0.5,RGB
 `LEVIR-CD+`_,CD,Google Earth,985,2,"1,024x1,024",0.5,RGB
 `LoveDA`_,S,Google Earth,"5,987",7,"1,024x1,024",0.3,RGB
+`Million-AID`_,C,Google Earth,1M,51--73,,0.5--153,RGB


Does anyone know the range of image sizes?

Download logic removed

* millionaid * test * separator * remove type ignore * type in test * requested changes * typos and glob pattern * task argument description * add test md5 hash * Remove download logic * Type ignore no longer needed Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

github-actions bot added datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing labels Mar 9, 2022

adamjstewart added this to the 0.3.0 milestone Mar 10, 2022

adamjstewart reviewed Mar 11, 2022

View reviewed changes

torchgeo/datasets/millionaid.py Outdated Show resolved Hide resolved

adamjstewart requested changes Apr 3, 2022

View reviewed changes

adamjstewart reviewed Apr 4, 2022

View reviewed changes

adamjstewart previously approved these changes Apr 4, 2022

View reviewed changes

calebrob6 previously requested changes Apr 5, 2022

View reviewed changes

torchgeo/datasets/millionaid.py Outdated Show resolved Hide resolved

torchgeo/datasets/millionaid.py Outdated Show resolved Hide resolved

torchgeo/datasets/millionaid.py Show resolved Hide resolved

nilsleh dismissed adamjstewart’s stale review via 1b040f9 April 5, 2022 19:03

isaaccorley previously requested changes Apr 6, 2022

View reviewed changes

adamjstewart self-assigned this Jul 9, 2022

nilsleh added 9 commits July 9, 2022 14:27

millionaid

20406b7

test

1c7951a

separator

8903905

remove type ignore

af42f59

type in test

28c3cc3

requested changes

9af29b4

typos and glob pattern

0993e60

task argument description

ea6ad5b

add test md5 hash

5715315

adamjstewart force-pushed the millionAID branch from f5d3b1e to 5715315 Compare July 9, 2022 21:28

Remove download logic

976ad3d

adamjstewart reviewed Jul 9, 2022

View reviewed changes

Type ignore no longer needed

1fce2bc

adamjstewart approved these changes Jul 9, 2022

View reviewed changes

adamjstewart enabled auto-merge (squash) July 9, 2022 22:03

adamjstewart merged commit 2d14883 into microsoft:main Jul 9, 2022

adamjstewart mentioned this pull request Jul 11, 2022

0.3.0 release #664

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Million-AID dataset #455

Million-AID dataset #455

nilsleh commented Mar 9, 2022

adamjstewart commented Mar 11, 2022

calebrob6 commented Mar 20, 2022

adamjstewart commented Mar 20, 2022

isaaccorley left a comment

isaaccorley Apr 6, 2022

adamjstewart Apr 6, 2022

nilsleh Apr 7, 2022

adamjstewart Apr 11, 2022

adamjstewart Apr 11, 2022

isaaccorley Apr 11, 2022

calebrob6 Jun 15, 2022

adamjstewart Jun 27, 2022

nilsleh Jun 27, 2022

adamjstewart Jul 9, 2022

adamjstewart commented Jul 9, 2022

adamjstewart Jul 9, 2022

Million-AID dataset #455

Million-AID dataset #455

Conversation

nilsleh commented Mar 9, 2022

adamjstewart commented Mar 11, 2022

calebrob6 commented Mar 20, 2022

adamjstewart commented Mar 20, 2022

isaaccorley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamjstewart commented Jul 9, 2022

Choose a reason for hiding this comment