Ray Dataset Unable to merge: field <> has incompatible types. Read slight different schemas #36051
Labels
data
Ray Data-related issues
enhancement
Request for new feature and/or capability
P2
Important issue, but not time-critical
What happened + What you expected to happen
Hi all,
I'm trying to read a folder whose contents has slight schema variation.
expected to work
in
./dataset/data{0..N}.json
I'm getting an error of the sort:
(DoRead pid=44337) pyarrow.lib.ArrowInvalid: Unable to merge: Field Records has incompatible types: struct<a: int64, b: string> vs struct<a: int64, b: int64> [repeated 5x across cluster]
I also can't seem to be able to force the schema
explicit_schema
having the error:(DoRead pid=45091) pyarrow.lib.ArrowInvalid: JSON parse error: Column(/Records/b) changed from string to number in row 0
is there any workaround to this? or a way to cast after reading?
ty.
regards,c.
Versions / Dependencies
uname -a
Linux ip-172-31-33-24.sa-east-1.compute.internal 6.1.27-43.48.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue May 2 04:53:36 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
python3 --version
Python 3.9.16
pip3 freeze | grep -E "ray|pandas|pyarrow"
pandas==2.0.2
pyarrow==12.0.0
ray==2.4.0
Reproduction script
gen dataset
read folder
read folder with explicit_schema
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: