Fixes #78 Reading comprehension synthetic data regex improvements #80
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Enhance regexes to post process LLM output for reading comprehension dataset creation.
Improves pubmed 500 dataset generation from yielding only 7% of the dataset to yielding 93% of the dataset:
to:
Changes in this PR:
tokenizer.apply_chat_template()
. This blew up during development on a few inputs, and it caused the entire data pipeline to fail because of a few bad inputs withNone
as the content. After this fix it will log an exception with a stacktrace, but will continue with the rest of the dataset and provide a much better user experience. Also note that it should never need to kick in as long as the upstream data generation is generating correct inputs.Validating on the Nuclear patent dataset
I used a tiny subset of the nuclear patent dataset to validate that it also works on that dataset:
nuclear_patents_25_arcee_format.csv
and it yields 85% of the dataset:
Here are some example extracted reading comprehension texts from the nuclear patent dataset.
Manual e2e testing
Here are some generated chat completions on the pubmed500 dataset.
Example generated doc 1
Example generated doc 2
This one has a null value - which should probably be fixed.
Example generated doc 3
Example generated doc 4
This one isn't even generating questions .. it looks meaningless.
Unit tests
See
tests/datasets/reading_comprehension_generation/test_utils.py