Notes from first-time user #16

louisguitton · 2023-09-19T15:41:12Z

After chatting with @favyen2 , I had a look at the repo and started playing around. Here are some notes in case they prove useful to update documentation.

Notes

List of remotes

There are 3 Remotes with downloadable data:

https://ai2-public-datasets.s3.amazonaws.com/satlas/ which is an AWS S3 bucket
https://huggingface.co/allenai/satlas-pretrain which is HuggingFace (relates to Consider hosting dataset on Huggingface & source.coop #15)
https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev/ which is a Cloudflare R2 bucket (ref) hosted on GCP (found via whois pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev)

Interacting with the S3 bucket

When deciding what to download, I found the need to know "a priori" the size of what I was going to download. Partly to make sure I download the right thing, and partly to inform my system choice (i.e. do I work from my mac or from a cloud box)
To understand what you have published at a glance (training datasets & model weights) as well as their respective sizes, I ran:

aws s3 ls s3://ai2-public-datasets/satlas/ --human-readable

Interacting with the R2 bucket

Just like for the S3 bucket, I wanted to list files present in the bucket with their size (dataset & model weights) especially as this Remote is apparently used for the fine-tuning tasks that interest me.

R2 is supposed to expose a S3 API (ref)
Unfortunately, I was unable to get anywhere and I don't get a helpful error message either, so I am stuck:

→ aws s3api list-buckets --endpoint-url  https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev/

An error occurred () when calling the ListBuckets operation:
→ aws s3api list-objects-v2 --endpoint-url https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev --bucket satlas_explorer_datasets

An error occurred () when calling the ListObjectsV2 operation:

Minor gitignore tweak

Because the docs expect me to populate a models/ and a vis/ folder, but those are not tracked in git, I ended up adding those 2 folders to my local gitignore in .git/info/exclude so that they don't get tracked while not touching the committed .gitignore

Solar Farm model links?

According to the docs, models/solar_farm/best.pth is one of the artifacts present in the R2 bucket (https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev/satlas_explorer_datasets/satlas_explorer_datasets_2023-07-24.tar).
Is there a way to download only that file directly and not the rest of the archive?

The text was updated successfully, but these errors were encountered:

louisguitton · 2023-09-19T15:44:04Z

just saw that my Solar farm question was answered in #12

srinify · 2023-12-05T17:26:07Z

@louisguitton would it be easier if we just migrated all the files over into the GitHub repo and used either Git LFS or XetHub to host the large files themselves? Then people don't have to juggle interacting with 3 different data sources / hosting providers.

When you run git clone ... or git pull ..., the large files also will appear locally along with the code while GitHub just sees pointers / hashes. This will also eliminate the need for storing models and datasets in the .gitignore file

Proposed here: #25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes from first-time user #16

Notes from first-time user #16

louisguitton commented Sep 19, 2023 •

edited

Loading

louisguitton commented Sep 19, 2023

srinify commented Dec 5, 2023 •

edited

Loading

Notes from first-time user #16

Notes from first-time user #16

Comments

louisguitton commented Sep 19, 2023 • edited Loading

Notes

List of remotes

Interacting with the S3 bucket

Interacting with the R2 bucket

Minor gitignore tweak

Solar Farm model links?

louisguitton commented Sep 19, 2023

srinify commented Dec 5, 2023 • edited Loading

louisguitton commented Sep 19, 2023 •

edited

Loading

srinify commented Dec 5, 2023 •

edited

Loading