You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After chatting with @favyen2 , I had a look at the repo and started playing around. Here are some notes in case they prove useful to update documentation.
When deciding what to download, I found the need to know "a priori" the size of what I was going to download. Partly to make sure I download the right thing, and partly to inform my system choice (i.e. do I work from my mac or from a cloud box)
To understand what you have published at a glance (training datasets & model weights) as well as their respective sizes, I ran:
aws s3 ls s3://ai2-public-datasets/satlas/ --human-readable
Interacting with the R2 bucket
Just like for the S3 bucket, I wanted to list files present in the bucket with their size (dataset & model weights) especially as this Remote is apparently used for the fine-tuning tasks that interest me.
R2 is supposed to expose a S3 API (ref)
Unfortunately, I was unable to get anywhere and I don't get a helpful error message either, so I am stuck:
→ aws s3api list-buckets --endpoint-url https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev/
An error occurred () when calling the ListBuckets operation:
→ aws s3api list-objects-v2 --endpoint-url https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev --bucket satlas_explorer_datasets
An error occurred () when calling the ListObjectsV2 operation:
Minor gitignore tweak
Because the docs expect me to populate a models/ and a vis/ folder, but those are not tracked in git, I ended up adding those 2 folders to my local gitignore in .git/info/exclude so that they don't get tracked while not touching the committed .gitignore
@louisguitton would it be easier if we just migrated all the files over into the GitHub repo and used either Git LFS or XetHub to host the large files themselves? Then people don't have to juggle interacting with 3 different data sources / hosting providers.
When you run git clone ... or git pull ..., the large files also will appear locally along with the code while GitHub just sees pointers / hashes. This will also eliminate the need for storing models and datasets in the .gitignore file
After chatting with @favyen2 , I had a look at the repo and started playing around. Here are some notes in case they prove useful to update documentation.
Notes
List of remotes
There are 3 Remotes with downloadable data:
whois pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev
)Interacting with the S3 bucket
When deciding what to download, I found the need to know "a priori" the size of what I was going to download. Partly to make sure I download the right thing, and partly to inform my system choice (i.e. do I work from my mac or from a cloud box)
To understand what you have published at a glance (training datasets & model weights) as well as their respective sizes, I ran:
Interacting with the R2 bucket
Just like for the S3 bucket, I wanted to list files present in the bucket with their size (dataset & model weights) especially as this Remote is apparently used for the fine-tuning tasks that interest me.
R2 is supposed to expose a S3 API (ref)
Unfortunately, I was unable to get anywhere and I don't get a helpful error message either, so I am stuck:
Minor gitignore tweak
Because the docs expect me to populate a
models/
and avis/
folder, but those are not tracked in git, I ended up adding those 2 folders to my local gitignore in.git/info/exclude
so that they don't get tracked while not touching the committed.gitignore
Solar Farm model links?
According to the docs,
models/solar_farm/best.pth
is one of the artifacts present in the R2 bucket (https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev/satlas_explorer_datasets/satlas_explorer_datasets_2023-07-24.tar).Is there a way to download only that file directly and not the rest of the archive?
The text was updated successfully, but these errors were encountered: