Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes from first-time user #16

Open
louisguitton opened this issue Sep 19, 2023 · 2 comments
Open

Notes from first-time user #16

louisguitton opened this issue Sep 19, 2023 · 2 comments

Comments

@louisguitton
Copy link

louisguitton commented Sep 19, 2023

After chatting with @favyen2 , I had a look at the repo and started playing around. Here are some notes in case they prove useful to update documentation.

Notes

List of remotes

There are 3 Remotes with downloadable data:

Interacting with the S3 bucket

When deciding what to download, I found the need to know "a priori" the size of what I was going to download. Partly to make sure I download the right thing, and partly to inform my system choice (i.e. do I work from my mac or from a cloud box)
To understand what you have published at a glance (training datasets & model weights) as well as their respective sizes, I ran:

aws s3 ls s3://ai2-public-datasets/satlas/ --human-readable

Interacting with the R2 bucket

Just like for the S3 bucket, I wanted to list files present in the bucket with their size (dataset & model weights) especially as this Remote is apparently used for the fine-tuning tasks that interest me.

R2 is supposed to expose a S3 API (ref)
Unfortunately, I was unable to get anywhere and I don't get a helpful error message either, so I am stuck:

→ aws s3api list-buckets --endpoint-url  https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev/

An error occurred () when calling the ListBuckets operation:
→ aws s3api list-objects-v2 --endpoint-url https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev --bucket satlas_explorer_datasets

An error occurred () when calling the ListObjectsV2 operation:

Minor gitignore tweak

Because the docs expect me to populate a models/ and a vis/ folder, but those are not tracked in git, I ended up adding those 2 folders to my local gitignore in .git/info/exclude so that they don't get tracked while not touching the committed .gitignore

Solar Farm model links?

According to the docs, models/solar_farm/best.pth is one of the artifacts present in the R2 bucket (https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev/satlas_explorer_datasets/satlas_explorer_datasets_2023-07-24.tar).
Is there a way to download only that file directly and not the rest of the archive?

@louisguitton
Copy link
Author

just saw that my Solar farm question was answered in #12

@srinify
Copy link

srinify commented Dec 5, 2023

@louisguitton would it be easier if we just migrated all the files over into the GitHub repo and used either Git LFS or XetHub to host the large files themselves? Then people don't have to juggle interacting with 3 different data sources / hosting providers.

When you run git clone ... or git pull ..., the large files also will appear locally along with the code while GitHub just sees pointers / hashes. This will also eliminate the need for storing models and datasets in the .gitignore file

Proposed here: #25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants