Couldn't cast array of type fixed_size_list to Sequence(Value(float64)) #6280

jmif · 2023-10-05T12:48:31Z

Describe the bug

I have a dataset with an embedding column, when I try to map that dataset I get the following exception:

Traceback (most recent call last):
  File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3189, in map
    for rank, done, content in iflatmap_unordered(
  File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1387, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1387, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
TypeError: Couldn't cast array of type
fixed_size_list<item: float>[2]
to
Sequence(feature=Value(dtype='float32', id=None), length=2, id=None)

Steps to reproduce the bug

Here's a simple repro script:

from datasets import Features, Value, Sequence, ClassLabel, Dataset

dataset_features = Features({
    'text': Value('string'),
    'embedding': Sequence(Value('double'), length=2),
    'categories': Sequence(ClassLabel(names=sorted([
        'one',
        'two',
        'three'
    ]))),
})

dataset = Dataset.from_dict(
    {
        'text': ['A'] * 10000,
        'embedding': [[0.0, 0.1]] * 10000,
        'categories': [[0]] * 10000,
    },
    features=dataset_features
)

def test_mapper(r):
    r['text'] = list(map(lambda t: t + ' b', r['text']))
    return r


dataset = dataset.map(test_mapper, batched=True, batch_size=10, features=dataset_features, num_proc=2)

Removing the embedding column fixes the issue!

Expected behavior

The mapping completes successfully.

Environment info

datasets version: 2.14.4
Platform: macOS-14.0-arm64-arm-64bit
Python version: 3.10.12
Huggingface_hub version: 0.17.1
PyArrow version: 13.0.0
Pandas version: 2.0.3

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-10-05T15:25:15Z

Thanks for reporting! I've opened a PR with a fix.

jmif · 2023-10-05T16:55:45Z

Thanks for the quick response @mariosasko! I just installed your branch via poetry add 'git+https://github.com/huggingface/datasets#fix-array_values' and I can confirm it works on the example provided.

Follow up question for you, should Nones be supported in these types of features as they are in others?

For example, the following script:

from datasets import Features, Value, Sequence, ClassLabel, Dataset

dataset_features = Features({
    'text': Value('string'),
    'embedding': Sequence(Value('double'), length=2),
    'categories': Sequence(ClassLabel(names=sorted([
        'one',
        'two',
        'three'
    ]))),
})

dataset = Dataset.from_dict(
    {
        'text': ['A'] * 10000,
        "embedding": [None] * 10000, # THIS LINE CHANGED
        'categories': [[0]] * 10000,
    },
    features=dataset_features
)

def test_mapper(r):
    r['text'] = list(map(lambda t: t + ' b', r['text']))
    return r


dataset = dataset.map(test_mapper, batched=True, batch_size=10, features=dataset_features, num_proc=2)

fails with

Traceback (most recent call last):
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1354, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3493, in _map_single
    writer.write_batch(batch)
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/arrow_writer.py", line 549, in write_batch
    array = cast_array_to_feature(col_values, col_type) if col_type is not None else col_values
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/table.py", line 1831, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/table.py", line 1831, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/table.py", line 2160, in cast_array_to_feature
    raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
fixed_size_list<item: double>[2]
to
Sequence(feature=Value(dtype='float64', id=None), length=2, id=None)

Ideally we can have empty embedding columns as well!

mariosasko · 2023-10-06T13:32:53Z

This part of PyArrow is buggy and inconsistent regarding features implemented across the types, so the only option is to operate on the Arrow buffer level to fix issues such as the above one.

jmif · 2023-10-13T09:41:54Z

Ok - can you take the POC I did here? Happy to turn this into an actual PR but would appreciate feedback on the implementation before I take another pass!

mariosasko mentioned this issue Oct 5, 2023

Fix array cast/embed with null values #6283

Merged

mariosasko closed this as completed in #6283 Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Couldn't cast array of type fixed_size_list to Sequence(Value(float64)) #6280

Couldn't cast array of type fixed_size_list to Sequence(Value(float64)) #6280

jmif commented Oct 5, 2023

mariosasko commented Oct 5, 2023

jmif commented Oct 5, 2023

mariosasko commented Oct 6, 2023

jmif commented Oct 13, 2023

Couldn't cast array of type fixed_size_list to Sequence(Value(float64)) #6280

Couldn't cast array of type fixed_size_list to Sequence(Value(float64)) #6280

Comments

jmif commented Oct 5, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

mariosasko commented Oct 5, 2023

jmif commented Oct 5, 2023

mariosasko commented Oct 6, 2023

jmif commented Oct 13, 2023