[Bug Fix] update load_dataset to support long data #878

shizhediao · 2024-07-06T22:41:24Z

Today, while processing ultrachat data, I encountered an error with lmflow’s load_data as follows:

Generating train split: 0 examples [02:27, ? examples/s]
[rank7]: Traceback (most recent call last):
[rank7]: File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 2013, in _prepare_split_single
[rank7]: writer.write_table(table)
[rank7]: File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py", line 584, in write_table
[rank7]: pa_table = pa_table.combine_chunks()
[rank7]: File "pyarrow/table.pxi", line 4289, in pyarrow.lib.Table.combine_chunks
[rank7]: File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
[rank7]: File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
[rank7]: pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

I later found that this was due to the data text in ultrachat being too long (with over ten rounds of dialogue, individual dialogues sometimes exceeding 10,000 characters), which pyarrow does not support. Therefore, I converted the data type in load_data from string to large_string.

Please note: I have not fully tested whether this code might introduce other unforeseen changes, so I suggest not merging it for now. Instead, consider it a marker for future reference if someone encounters a similar issue. It can be merged after someone has thoroughly tested it.

dupdate load_dataset to support long data

9091819

wheresmyhair mentioned this pull request Jul 7, 2024

[Roadmap] LMFlow Roadmap #862

Open

31 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Fix] update load_dataset to support long data #878

[Bug Fix] update load_dataset to support long data #878

shizhediao commented Jul 6, 2024

[Bug Fix] update load_dataset to support long data #878

Are you sure you want to change the base?

[Bug Fix] update load_dataset to support long data #878

Conversation

shizhediao commented Jul 6, 2024