Skip to content

Add n_documents option for HF datasets#316

Merged
danbraunai-apollo merged 13 commits intomainfrom
feature/hf-documents
Feb 1, 2024
Merged

Add n_documents option for HF datasets#316
danbraunai-apollo merged 13 commits intomainfrom
feature/hf-documents

Conversation

@danbraunai-apollo
Copy link
Contributor

@danbraunai-apollo danbraunai-apollo commented Jan 30, 2024

Add n_documents option for HF datasets

Description

  • Add n_documents as an attribute of HFDatasetConfig. This governs how many documents to load from the dataset, before splitting into n_samples lots of n_ctx-length samples.
  • Add seed attribute to HFDatasetConfig. This governs which random selection of n_ctx-length samples are selected from the larger set of documents

How Has This Been Tested?

  • Add test_data.test_invalid_hf_dataset_config for checking the pydantic validation of HFDatasetConfig
  • Add TestTokenizeDataset which checks various properties of rib.loader.tokenize_dataset, that now also takes in n_samples and seed arguments.

Does this PR introduce a breaking change?

Yes. return_set_n_samples has changed names to n_samples. Also, previously when return_set_n_samples was used in a pythia config, it actually specified the number of documents loaded from the dataset. It now specifies the number of n_ctx-length samples.

Note that:

  • tinystories has ~235 toks / document.
  • pile-10k, which we use for pythia, has ~1555 toks / document

device="cpu",
fold_bias=True,
)

Copy link
Contributor

@nix-apollo nix-apollo Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be valuable to add some tests that load and tokenize actual datasets we care about. For instance:

  • Manual check: print decoded tokens from running tokenize_dataset on the pile and on tiny stories. Maybe just a simple pytest test that you can see the output of by using the -s flag when running pytest.
  • check that there are no padding tokens
  • check that EOT tokens exist occasionally.

I don't feel strongly that these tests need to be added if you have done some version of this manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote an assert to ensure that there are exactly len(dataset) EOT tokens (and thus no padding tokens). I accidentally called this commit Ensure tokenizers have eos_token_id.

I did a manual check that the decoded tokens make sense. By eye, the decoded tokens matches the raw inputs, with some EOT scattered throughout.

@nix-apollo
Copy link
Contributor

nix-apollo commented Jan 31, 2024

Some other dataset related improvements that could be added (feel free to skip)

  • Support for the tiny stories test set, which is titled "validation" instead of "test"
  • Caching downloaded data (and models?) in some folder on the ssd drive. Best case we can set transformers to offline mode and shave a second off of imports (nice for fast tests!)

@danbraunai-apollo danbraunai-apollo changed the title Add return_set_n_documents option for HF datasets Add n_documents option for HF datasets Jan 31, 2024
@danbraunai-apollo
Copy link
Contributor Author

  • Caching downloaded data (and models?) in some folder on the ssd drive. Best case we can set transformers to offline mode and shave a second off of imports (nice for fast tests!)

Yeah this would be nice. I haven't looked at "offline mode" before. Not going to look into it now, but made an issue for it.

@danbraunai-apollo
Copy link
Contributor Author

Several changes since last review. Best for you to have a look. Most notably, as discussed, the test_stochastic_basis_tinystories should be made more robust to different seeds and/or n_samples in the dataset config.

@danbraunai-apollo danbraunai-apollo merged commit 7ef52a8 into main Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants