You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, very awesome work on open source LLM. I have a little confusion when using your data.
The paper says that most pre-training dataset are scraped from hf. There is also one section demonstrating the conversion of PDF texts. However, no detail is provided about where these pdfs come from.
In the processed dataset hosted on hf, we can't find a clear mapping between each sub-category to its origin. Most data comes with a metalink='N/A', some data has clearly wrong metalink (since its size is very different from that on hf, for example, book_math).
Could you provided more details on the data structuring conventions used? Thanks.
The text was updated successfully, but these errors were encountered:
Hi, very awesome work on open source LLM. I have a little confusion when using your data.
The paper says that most pre-training dataset are scraped from hf. There is also one section demonstrating the conversion of PDF texts. However, no detail is provided about where these pdfs come from.
In the processed dataset hosted on hf, we can't find a clear mapping between each sub-category to its origin. Most data comes with a metalink='N/A', some data has clearly wrong metalink (since its size is very different from that on hf, for example, book_math).
Could you provided more details on the data structuring conventions used? Thanks.
The text was updated successfully, but these errors were encountered: