Issue reproducing the GitHub partition #118

osainz59 · 2023-10-30T09:39:33Z

Hi there,

I followed the GitHub downloader repository and executed the download_repo_text.py script.

I obtained a total of 27,819,203 documents, just half of the documents reported here:

Line 704 in df97f86

return 56626342

I fixed and added some metadata I need for my analyses. In total the file is around 60Gb in disc. I did not run the github_reduce.py yet, as the full dataset is not the same as reported by the authors.

Also, as the links to the GitHub partition are not available anymore, I would like to know if there is something I can do to obtain the original GitHub data that is in The Pile (hopefully with correct metadata).

Thank you

Zengyu-98 · 2023-11-06T16:04:20Z

Hello, does your documents refer to the total number of single data item (namely each json line)? I also had the same problem where our total data items are similar to your number.

Is your downloaded data around 800 GB? We downloaded the pile but only around ~400 GB. That's probably why it is only half of it?

osainz59 · 2023-11-06T16:15:13Z

Yes, by documents I refer to each data item. I only have the preprocessed data after downloading the GitHub partition, so I do not know the numbers for the complete dataset.

Zengyu-98 · 2023-11-06T16:30:57Z

Yes, by documents I refer to each data item. I only have the preprocessed data after downloading the GitHub partition, so I do not know the numbers for the complete dataset.

How large is your downloaded dataset in total? Mine is ~400 GB which is also half of the paper (800 GB). I guess that might be the reason

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue reproducing the GitHub partition #118

Issue reproducing the GitHub partition #118

osainz59 commented Oct 30, 2023 •

edited

Loading

Zengyu-98 commented Nov 6, 2023

osainz59 commented Nov 6, 2023

Zengyu-98 commented Nov 6, 2023

Issue reproducing the GitHub partition #118

Issue reproducing the GitHub partition #118

Comments

osainz59 commented Oct 30, 2023 • edited Loading

Zengyu-98 commented Nov 6, 2023

osainz59 commented Nov 6, 2023

Zengyu-98 commented Nov 6, 2023

osainz59 commented Oct 30, 2023 •

edited

Loading