Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Fix] update load_dataset to support long data #878

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shizhediao
Copy link
Contributor

Today, while processing ultrachat data, I encountered an error with lmflow’s load_data as follows:

Generating train split: 0 examples [02:27, ? examples/s]
[rank7]: Traceback (most recent call last):
[rank7]: File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 2013, in _prepare_split_single
[rank7]: writer.write_table(table)
[rank7]: File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py", line 584, in write_table
[rank7]: pa_table = pa_table.combine_chunks()
[rank7]: File "pyarrow/table.pxi", line 4289, in pyarrow.lib.Table.combine_chunks
[rank7]: File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
[rank7]: File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
[rank7]: pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

I later found that this was due to the data text in ultrachat being too long (with over ten rounds of dialogue, individual dialogues sometimes exceeding 10,000 characters), which pyarrow does not support. Therefore, I converted the data type in load_data from string to large_string.

Please note: I have not fully tested whether this code might introduce other unforeseen changes, so I suggest not merging it for now. Instead, consider it a marker for future reference if someone encounters a similar issue. It can be merged after someone has thoroughly tested it.

@wheresmyhair wheresmyhair mentioned this pull request Jul 7, 2024
31 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant