You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I fixed and added some metadata I need for my analyses. In total the file is around 60Gb in disc. I did not run the github_reduce.py yet, as the full dataset is not the same as reported by the authors.
Also, as the links to the GitHub partition are not available anymore, I would like to know if there is something I can do to obtain the original GitHub data that is in The Pile (hopefully with correct metadata).
Thank you
The text was updated successfully, but these errors were encountered:
Hello, does your documents refer to the total number of single data item (namely each json line)? I also had the same problem where our total data items are similar to your number.
Is your downloaded data around 800 GB? We downloaded the pile but only around ~400 GB. That's probably why it is only half of it?
Yes, by documents I refer to each data item. I only have the preprocessed data after downloading the GitHub partition, so I do not know the numbers for the complete dataset.
Yes, by documents I refer to each data item. I only have the preprocessed data after downloading the GitHub partition, so I do not know the numbers for the complete dataset.
How large is your downloaded dataset in total? Mine is ~400 GB which is also half of the paper (800 GB). I guess that might be the reason
Hi there,
I followed the GitHub downloader repository and executed the download_repo_text.py script.
I obtained a total of 27,819,203 documents, just half of the documents reported here:
the-pile/the_pile/datasets.py
Line 704 in df97f86
I fixed and added some metadata I need for my analyses. In total the file is around 60Gb in disc. I did not run the github_reduce.py yet, as the full dataset is not the same as reported by the authors.
Also, as the links to the GitHub partition are not available anymore, I would like to know if there is something I can do to obtain the original GitHub data that is in The Pile (hopefully with correct metadata).
Thank you
The text was updated successfully, but these errors were encountered: