-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Million-AID dataset #455
Million-AID dataset #455
Conversation
Bleh, I'll fix the mypy issues in another PR, it looks like PyTorch finally added type hints for many of its functions in the latest release that came out yesterday. |
@nilsleh can you rebase on main here |
Rebasing isn't necessary, just need to remove the few remaining type ignores that are causing the tests to fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test set is ~260GB and multiple parts compared to the train set which is a single zip file. I wonder if we even want users to be downloading this in a single sequential process. @adamjstewart
torchgeo/datasets/millionaid.py
Outdated
} | ||
url = { | ||
"train": "https://eastus1-mediap.svc.ms/transform/zip?cs=fFNQTw", | ||
"test": "https://eastus1-mediap.svc.ms/transform/zip?cs=fFNQTw", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually download the test set? These look like the same url to me. Also looks like the test set is made of multiple parts (e.g. test.zip.001
, test.zip.002
, etc.). Does extract_archive
support this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good question, I'm not sure whether that will work or not. We've had to hack things to support deflate64-compressed zip files before, so multi-part zip files should be possible somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually download the test set? These look like the same url to me.
If I look in the OneDrive and download the train or test folder, the download link for me are the same for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@isaaccorley you mentioned that you downloaded the test set and computed its MD5. Where did you download the test set from? Was it multiple parts when you downloaded and checksummed it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@isaaccorley you mentioned that you downloaded the test set and computed its MD5. Where did you download the test set from? Was it multiple parts when you downloaded and checksummed it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually just noticed that test.zip
was corrupted and I couldn't unzip it completely so I think the hash may be incorrect. Going to try and download each of the test files individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this was ever resolved, this is the only thing holding up this PR. If there's a single link for all data that's fine, we can download the data with one link, checksum it once, then extract it and extract and zip files that contains. I don't have the bandwidth/storage to download this myself, but can someone investigate this and see if it's even possible to download this dataset? If not, we could just remove the download logic until we figure it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I unfortunately also do not have the bandwidth/storage for this download :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did anyone ever reach out to the dataset authors and see if we can rehost a single zip file on something like zenodo?
Since the download logic doesn't currently work, I'm going to remove it so we can get this into 0.3.0. We can add download support in a future release. |
@@ -16,6 +16,7 @@ Dataset,Task,Source,# Samples,# Classes,Size (px),Resolution (m),Bands | |||
`LandCover.ai`_,S,Aerial,"10,674",5,512x512,0.25--0.5,RGB | |||
`LEVIR-CD+`_,CD,Google Earth,985,2,"1,024x1,024",0.5,RGB | |||
`LoveDA`_,S,Google Earth,"5,987",7,"1,024x1,024",0.3,RGB | |||
`Million-AID`_,C,Google Earth,1M,51--73,,0.5--153,RGB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone know the range of image sizes?
Download logic removed
* millionaid * test * separator * remove type ignore * type in test * requested changes * typos and glob pattern * task argument description * add test md5 hash * Remove download logic * Type ignore no longer needed Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
This PR adds the Million-AID dataset which contains one million aerial scenes from Google Earth engine.
Comments/Questions:
__getitem__
currently returns variable length label tensorPlot Examples: