Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Default Adapter assumes type of metadata column in source data #328

Open
amangup opened this issue Jan 25, 2025 · 0 comments
Open

Bug: Default Adapter assumes type of metadata column in source data #328

amangup opened this issue Jan 25, 2025 · 0 comments

Comments

@amangup
Copy link

amangup commented Jan 25, 2025

In the last line below, data.pop("metadata") could be of type other than dict, and will fail then.

File: src/datatrove/pipeline/readers/base.py

    def _default_adapter(self, data: dict, path: str, id_in_file: int | str):
        """
        The default data adapter to adapt input data into the datatrove Document format

        Args:
            data: a dictionary with the "raw" representation of the data
            path: file path or source for this sample
            id_in_file: its id in this particular file or source

        Returns: a dictionary with text, id, media and metadata fields

        """
        return {
            "text": data.pop(self.text_key, ""),
            "id": data.pop(self.id_key, f"{path}/{id_in_file}"),
            "media": data.pop("media", []),
            "metadata": data.pop("metadata", {}) | data,  # remaining data goes into metadata
        }

It happened when I tried to tokenize FineMath, which has a metadata column with a a default string type.

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.10/dist-packages/datatrove/executor/local.py", line 76, in _launch_run_for_rank
    return self._run_for_rank(rank, local_rank)
  File "/usr/local/lib/python3.10/dist-packages/datatrove/executor/base.py", line 109, in _run_for_rank
    raise e
  File "/usr/local/lib/python3.10/dist-packages/datatrove/executor/base.py", line 90, in _run_for_rank
    pipelined_data = pipeline_step(pipelined_data, rank, self.world_size)
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/base.py", line 119, in __call__
    return self.run(data, rank, world_size)
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/tokens/tokenizer.py", line 390, in run
    outputfile: TokenizedFile = self.write_unshuffled(data, unshuf_filename)
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/tokens/tokenizer.py", line 359, in write_unshuffled
    for batch in batched(data, self.batch_size):
  File "/usr/local/lib/python3.10/dist-packages/datatrove/utils/batching.py", line 20, in batched
    while batch := list(itertools.islice(it, n)):
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/huggingface.py", line 125, in run
    document = self.get_document_from_dict(line, self.dataset, f"{rank:05d}/{li}")
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/huggingface.py", line 60, in get_document_from_dict
    document = super().get_document_from_dict(data, source_file, id_in_file)
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/base.py", line 79, in get_document_from_dict
    parsed_data = self.adapter(data, source_file, id_in_file)
  File "/usr/local/lib/python3.10/dist-packages/datatrove/pipeline/readers/base.py", line 65, in _default_adapter
    "metadata": data.pop("metadata", {}) | data,  # remaining data goes into metadata
TypeError: unsupported operand type(s) for |: 'str' and 'dict'
@amangup amangup changed the title Bug: Default Adapter assumes type of source data column Bug: Default Adapter assumes type of metadata column in source data Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant