Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't cast array of type fixed_size_list to Sequence(Value(float64)) #6280

Closed
jmif opened this issue Oct 5, 2023 · 4 comments · Fixed by #6283
Closed

Couldn't cast array of type fixed_size_list to Sequence(Value(float64)) #6280

jmif opened this issue Oct 5, 2023 · 4 comments · Fixed by #6283

Comments

@jmif
Copy link

jmif commented Oct 5, 2023

Describe the bug

I have a dataset with an embedding column, when I try to map that dataset I get the following exception:

Traceback (most recent call last):
  File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3189, in map
    for rank, done, content in iflatmap_unordered(
  File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1387, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1387, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
TypeError: Couldn't cast array of type
fixed_size_list<item: float>[2]
to
Sequence(feature=Value(dtype='float32', id=None), length=2, id=None)

Steps to reproduce the bug

Here's a simple repro script:

from datasets import Features, Value, Sequence, ClassLabel, Dataset

dataset_features = Features({
    'text': Value('string'),
    'embedding': Sequence(Value('double'), length=2),
    'categories': Sequence(ClassLabel(names=sorted([
        'one',
        'two',
        'three'
    ]))),
})

dataset = Dataset.from_dict(
    {
        'text': ['A'] * 10000,
        'embedding': [[0.0, 0.1]] * 10000,
        'categories': [[0]] * 10000,
    },
    features=dataset_features
)

def test_mapper(r):
    r['text'] = list(map(lambda t: t + ' b', r['text']))
    return r


dataset = dataset.map(test_mapper, batched=True, batch_size=10, features=dataset_features, num_proc=2)

Removing the embedding column fixes the issue!

Expected behavior

The mapping completes successfully.

Environment info

  • datasets version: 2.14.4
  • Platform: macOS-14.0-arm64-arm-64bit
  • Python version: 3.10.12
  • Huggingface_hub version: 0.17.1
  • PyArrow version: 13.0.0
  • Pandas version: 2.0.3
@mariosasko
Copy link
Collaborator

Thanks for reporting! I've opened a PR with a fix.

@jmif
Copy link
Author

jmif commented Oct 5, 2023

Thanks for the quick response @mariosasko! I just installed your branch via poetry add 'git+https://github.com/huggingface/datasets#fix-array_values' and I can confirm it works on the example provided.

Follow up question for you, should Nones be supported in these types of features as they are in others?

For example, the following script:

from datasets import Features, Value, Sequence, ClassLabel, Dataset

dataset_features = Features({
    'text': Value('string'),
    'embedding': Sequence(Value('double'), length=2),
    'categories': Sequence(ClassLabel(names=sorted([
        'one',
        'two',
        'three'
    ]))),
})

dataset = Dataset.from_dict(
    {
        'text': ['A'] * 10000,
        "embedding": [None] * 10000, # THIS LINE CHANGED
        'categories': [[0]] * 10000,
    },
    features=dataset_features
)

def test_mapper(r):
    r['text'] = list(map(lambda t: t + ' b', r['text']))
    return r


dataset = dataset.map(test_mapper, batched=True, batch_size=10, features=dataset_features, num_proc=2)

fails with

Traceback (most recent call last):
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1354, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3493, in _map_single
    writer.write_batch(batch)
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/arrow_writer.py", line 549, in write_batch
    array = cast_array_to_feature(col_values, col_type) if col_type is not None else col_values
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/table.py", line 1831, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/table.py", line 1831, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/table.py", line 2160, in cast_array_to_feature
    raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
fixed_size_list<item: double>[2]
to
Sequence(feature=Value(dtype='float64', id=None), length=2, id=None)

Ideally we can have empty embedding columns as well!

@mariosasko
Copy link
Collaborator

This part of PyArrow is buggy and inconsistent regarding features implemented across the types, so the only option is to operate on the Arrow buffer level to fix issues such as the above one.

@jmif
Copy link
Author

jmif commented Oct 13, 2023

Ok - can you take the POC I did here? Happy to turn this into an actual PR but would appreciate feedback on the implementation before I take another pass!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants