You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using sentence_dedup to deduplicate text in Chinese, I encountered unexpected behavior:
1. I duplicated the same line of text 1000 times to create a dataset.
2. When setting split_sentences=False, the deduplication works as expected, resulting in only one record, which is correct.
3. However, when setting split_sentences=True, the output still contains 1000 records. Upon inspecting the text field, only two unique variations exist among the 1000 records. It seems like the deduplication did not fully complete as expected. And I have checked generated hashes for each doc and 35 same hashes for all docs. So it seems dedup step failed.
Could you please help investigate this issue? Thank you!
Let me know if you need further refinements or clarifications!
`
modify sentence dedup hyper params here
sent_dedup_config = SentDedupConfig(
n_sentences=3,
split_sentences=True, # set to False to split on \n instead
only_dedup_in_index=True,
min_doc_words=50,
)
FINDER_WORKERS = 10 # this will speed up/parallelize step 2
`
pipeline_2 = [SentenceFindDedups(data_folder="c4/sigs", output_folder="c4/dups", config=sent_dedup_config)]
pipeline_3 = [
JsonlReader(data_folder="intermediate/"),
SentenceDedupFilter(data_folder="c4/dups", config=sent_dedup_config, language=Languages.mandarin_chinese),
JsonlWriter("c4/final_output"), # save the final filtered output to disk
]
executor_1: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_1, workers=1, tasks=1)
executor_2: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_2, workers=1, tasks=FINDER_WORKERS)
executor_3: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_3, workers=1, tasks=1)
`
The text was updated successfully, but these errors were encountered:
ftgreat
changed the title
Unexpected behavior when using sentence_dedup Chinese with split_sentences=True
Unexpected behavior when using sentence_dedup with split_sentences=True
Jan 19, 2025
When using sentence_dedup to deduplicate text in Chinese, I encountered unexpected behavior:
1. I duplicated the same line of text 1000 times to create a dataset.
2. When setting split_sentences=False, the deduplication works as expected, resulting in only one record, which is correct.
3. However, when setting split_sentences=True, the output still contains 1000 records. Upon inspecting the text field, only two unique variations exist among the 1000 records. It seems like the deduplication did not fully complete as expected. And I have checked generated hashes for each doc and 35 same hashes for all docs. So it seems dedup step failed.
Could you please help investigate this issue? Thank you!
Let me know if you need further refinements or clarifications!
`
modify sentence dedup hyper params here
sent_dedup_config = SentDedupConfig(
n_sentences=3,
split_sentences=True, # set to False to split on \n instead
only_dedup_in_index=True,
min_doc_words=50,
)
FINDER_WORKERS = 10 # this will speed up/parallelize step 2
`
`
def run_example():
pipeline_1 = [
JsonlReader(data_folder="demo", limit=1000),
JsonlWriter("intermediate/"),
SentenceDedupSignature(output_folder="c4/sigs", config=sent_dedup_config, finder_workers=FINDER_WORKERS),
]
`
The text was updated successfully, but these errors were encountered: