You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the last line below, data.pop("metadata") could be of type other than dict, and will fail then.
File: src/datatrove/pipeline/readers/base.py
def_default_adapter(self, data: dict, path: str, id_in_file: int|str):
""" The default data adapter to adapt input data into the datatrove Document format Args: data: a dictionary with the "raw" representation of the data path: file path or source for this sample id_in_file: its id in this particular file or source Returns: a dictionary with text, id, media and metadata fields """return {
"text": data.pop(self.text_key, ""),
"id": data.pop(self.id_key, f"{path}/{id_in_file}"),
"media": data.pop("media", []),
"metadata": data.pop("metadata", {}) |data, # remaining data goes into metadata
}
It happened when I tried to tokenize FineMath, which has a metadata column with a a default string type.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.10/dist-packages/datatrove/executor/local.py", line 76, in _launch_run_for_rank
return self._run_for_rank(rank, local_rank)
File "/usr/local/lib/python3.10/dist-packages/datatrove/executor/base.py", line 109, in _run_for_rank
raise e
File "/usr/local/lib/python3.10/dist-packages/datatrove/executor/base.py", line 90, in _run_for_rank
pipelined_data = pipeline_step(pipelined_data, rank, self.world_size)
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/base.py", line 119, in __call__
return self.run(data, rank, world_size)
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/tokens/tokenizer.py", line 390, in run
outputfile: TokenizedFile = self.write_unshuffled(data, unshuf_filename)
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/tokens/tokenizer.py", line 359, in write_unshuffled
for batch in batched(data, self.batch_size):
File "/usr/local/lib/python3.10/dist-packages/datatrove/utils/batching.py", line 20, in batched
while batch := list(itertools.islice(it, n)):
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/huggingface.py", line 125, in run
document = self.get_document_from_dict(line, self.dataset, f"{rank:05d}/{li}")
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/huggingface.py", line 60, in get_document_from_dict
document = super().get_document_from_dict(data, source_file, id_in_file)
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/base.py", line 79, in get_document_from_dict
parsed_data = self.adapter(data, source_file, id_in_file)
File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/base.py", line 65, in _default_adapter
"metadata": data.pop("metadata", {}) | data, # remaining data goes into metadata
TypeError: unsupported operand type(s) for |: 'str' and 'dict'
The text was updated successfully, but these errors were encountered:
amangup
changed the title
Bug: Default Adapter assumes type of source data column
Bug: Default Adapter assumes type of metadata column in source data
Jan 25, 2025
In the last line below,
data.pop("metadata")
could be of type other thandict
, and will fail then.File:
src/datatrove/pipeline/readers/base.py
It happened when I tried to tokenize FineMath, which has a
metadata
column with a a default string type.The text was updated successfully, but these errors were encountered: