Add n_documents option for HF datasets#316
Conversation
| device="cpu", | ||
| fold_bias=True, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Could be valuable to add some tests that load and tokenize actual datasets we care about. For instance:
- Manual check: print decoded tokens from running tokenize_dataset on the pile and on tiny stories. Maybe just a simple pytest test that you can see the output of by using the
-sflag when running pytest. - check that there are no padding tokens
- check that EOT tokens exist occasionally.
I don't feel strongly that these tests need to be added if you have done some version of this manually.
There was a problem hiding this comment.
Wrote an assert to ensure that there are exactly len(dataset) EOT tokens (and thus no padding tokens). I accidentally called this commit Ensure tokenizers have eos_token_id.
I did a manual check that the decoded tokens make sense. By eye, the decoded tokens matches the raw inputs, with some EOT scattered throughout.
|
Some other dataset related improvements that could be added (feel free to skip)
|
Yeah this would be nice. I haven't looked at "offline mode" before. Not going to look into it now, but made an issue for it. |
|
Several changes since last review. Best for you to have a look. Most notably, as discussed, the |
Add n_documents option for HF datasets
Description
How Has This Been Tested?
test_data.test_invalid_hf_dataset_configfor checking the pydantic validation of HFDatasetConfign_samplesandseedarguments.Does this PR introduce a breaking change?
Yes.
return_set_n_sampleshas changed names ton_samples. Also, previously whenreturn_set_n_sampleswas used in a pythia config, it actually specified the number of documents loaded from the dataset. It now specifies the number of n_ctx-length samples.Note that: