Add n_documents option for HF datasets by danbraunai-apollo · Pull Request #316 · ApolloResearch/rib

danbraunai-apollo · 2024-01-30T14:27:26Z

Add n_documents option for HF datasets

Description

Add n_documents as an attribute of HFDatasetConfig. This governs how many documents to load from the dataset, before splitting into n_samples lots of n_ctx-length samples.
Add seed attribute to HFDatasetConfig. This governs which random selection of n_ctx-length samples are selected from the larger set of documents

How Has This Been Tested?

Add test_data.test_invalid_hf_dataset_config for checking the pydantic validation of HFDatasetConfig
Add TestTokenizeDataset which checks various properties of rib.loader.tokenize_dataset, that now also takes in n_samples and seed arguments.

Does this PR introduce a breaking change?

Yes. return_set_n_samples has changed names to n_samples. Also, previously when return_set_n_samples was used in a pythia config, it actually specified the number of documents loaded from the dataset. It now specifies the number of n_ctx-length samples.

Note that:

tinystories has ~235 toks / document.
pile-10k, which we use for pythia, has ~1555 toks / document

rib/loader.py

rib/data.py

nix-apollo · 2024-01-31T15:03:01Z

tests/test_loader.py

        device="cpu",
        fold_bias=True,
    )
+


Could be valuable to add some tests that load and tokenize actual datasets we care about. For instance:

Manual check: print decoded tokens from running tokenize_dataset on the pile and on tiny stories. Maybe just a simple pytest test that you can see the output of by using the -s flag when running pytest.

check that there are no padding tokens

check that EOT tokens exist occasionally.

I don't feel strongly that these tests need to be added if you have done some version of this manually.

Wrote an assert to ensure that there are exactly len(dataset) EOT tokens (and thus no padding tokens). I accidentally called this commit Ensure tokenizers have eos_token_id.

I did a manual check that the decoded tokens make sense. By eye, the decoded tokens matches the raw inputs, with some EOT scattered throughout.

nix-apollo · 2024-01-31T15:04:13Z

Some other dataset related improvements that could be added (feel free to skip)

Support for the tiny stories test set, which is titled "validation" instead of "test"
Caching downloaded data (and models?) in some folder on the ssd drive. Best case we can set transformers to offline mode and shave a second off of imports (nice for fast tests!)

danbraunai-apollo · 2024-01-31T16:55:57Z

Caching downloaded data (and models?) in some folder on the ssd drive. Best case we can set transformers to offline mode and shave a second off of imports (nice for fast tests!)

Yeah this would be nice. I haven't looked at "offline mode" before. Not going to look into it now, but made an issue for it.

danbraunai-apollo · 2024-01-31T19:27:13Z

Several changes since last review. Best for you to have a look. Most notably, as discussed, the test_stochastic_basis_tinystories should be made more robust to different seeds and/or n_samples in the dataset config.

tests/test_data.py

danbraunai-apollo added 2 commits January 30, 2024 14:26

Add return_set_n_documents option for HF datasets

6402708

Assert there are enough documents for n_samples

fc8ff14

nix-apollo approved these changes Jan 31, 2024

View reviewed changes

danbraunai-apollo added 4 commits January 31, 2024 16:13

Merge main into feature/hf-documents

8566fa5

Add assert that tokenizer.eos_token_id exists

305f43d

Remove 'return_set' from return_set_n_samples and return_set_n_documents

e3a295e

Ensure tokenizers have eos_token_id

a0ada4f

danbraunai-apollo changed the title ~~Add return_set_n_documents option for HF datasets~~ Add n_documents option for HF datasets Jan 31, 2024

danbraunai-apollo mentioned this pull request Jan 31, 2024

Cache transformers datasets #320

Open

danbraunai-apollo added 4 commits January 31, 2024 17:13

Support validation return_set (for tinystories)

588807b

Fix broken test that used edge_formula=stochastic

f6458e7

Hacky temp fix for test_stochastic_basis_tinystories

c2ba218

Simplify validation for return_set_frac, n_documents, n_samples

f895ad2

nix-apollo approved these changes Feb 1, 2024

View reviewed changes

tests/test_data.py Outdated Show resolved Hide resolved

danbraunai-apollo added 3 commits February 1, 2024 10:08

Fix dataset config validation tests

00414ac

Clean validation tests further

2ca210b

Fix failing dataset config validation tests

680c28f

danbraunai-apollo merged commit 7ef52a8 into main Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add n_documents option for HF datasets#316

Add n_documents option for HF datasets#316
danbraunai-apollo merged 13 commits intomainfrom
feature/hf-documents

danbraunai-apollo commented Jan 30, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

nix-apollo Jan 31, 2024 •

edited by danbraunai-apollo

Loading

Uh oh!

danbraunai-apollo Jan 31, 2024

Uh oh!

nix-apollo commented Jan 31, 2024 •

edited by danbraunai-apollo

Loading

Uh oh!

danbraunai-apollo commented Jan 31, 2024

Uh oh!

danbraunai-apollo commented Jan 31, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danbraunai-apollo commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add n_documents option for HF datasets

Description

How Has This Been Tested?

Does this PR introduce a breaking change?

Uh oh!

Uh oh!

Uh oh!

nix-apollo Jan 31, 2024 • edited by danbraunai-apollo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danbraunai-apollo Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

nix-apollo commented Jan 31, 2024 • edited by danbraunai-apollo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danbraunai-apollo commented Jan 31, 2024

Uh oh!

danbraunai-apollo commented Jan 31, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danbraunai-apollo commented Jan 30, 2024 •

edited

Loading

nix-apollo Jan 31, 2024 •

edited by danbraunai-apollo

Loading

nix-apollo commented Jan 31, 2024 •

edited by danbraunai-apollo

Loading