Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue reproducing the GitHub partition #118

Open
osainz59 opened this issue Oct 30, 2023 · 3 comments
Open

Issue reproducing the GitHub partition #118

osainz59 opened this issue Oct 30, 2023 · 3 comments

Comments

@osainz59
Copy link

osainz59 commented Oct 30, 2023

Hi there,

I followed the GitHub downloader repository and executed the download_repo_text.py script.

I obtained a total of 27,819,203 documents, just half of the documents reported here:

return 56626342

I fixed and added some metadata I need for my analyses. In total the file is around 60Gb in disc. I did not run the github_reduce.py yet, as the full dataset is not the same as reported by the authors.

Also, as the links to the GitHub partition are not available anymore, I would like to know if there is something I can do to obtain the original GitHub data that is in The Pile (hopefully with correct metadata).

Thank you

@Zengyu-98
Copy link

Hello, does your documents refer to the total number of single data item (namely each json line)? I also had the same problem where our total data items are similar to your number.

Is your downloaded data around 800 GB? We downloaded the pile but only around ~400 GB. That's probably why it is only half of it?

@osainz59
Copy link
Author

osainz59 commented Nov 6, 2023

Yes, by documents I refer to each data item. I only have the preprocessed data after downloading the GitHub partition, so I do not know the numbers for the complete dataset.

@Zengyu-98
Copy link

Yes, by documents I refer to each data item. I only have the preprocessed data after downloading the GitHub partition, so I do not know the numbers for the complete dataset.

How large is your downloaded dataset in total? Mine is ~400 GB which is also half of the paper (800 GB). I guess that might be the reason

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants