-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent Schema Enforcement #4801
Comments
More context: #4800 (comment) I wonder if we could have "strict" and "non strict" schema checking -- e.g. for some things like arrow-json where there is a configuration object there is a natural place to add However, functions like |
Perhaps concat, etc... could take an explicit schema, this would also sidestep the issues around an empty slice... I could also be convinced to do away with all the validation, and just do explicit validation in the places it matters to correctness - e.g. parquet and nullability |
https://docs.rs/arrow/latest/arrow/compute/fn.concat_batches.html pub fn concat_batches<'a>(
schema: &Arc<Schema, Global>,
input_batches: impl IntoIterator<Item = &'a RecordBatch>
) -> Result<RecordBatch, ArrowError> |
In which case perhaps that is the answer to #4800, just use the provided schema and don't perform any additional validation? |
There are some field names that are kind of useless (such as those in Field metadata definitely matters, since that may contain extension type information. I'm not sure about top-level schema metadata. In many cases I think I'd be fine ignoring that by default, or at least I haven't encountered a situation where I really wanted it. |
What do you think of removing field names from those types. I find them a bit annoying sometimes. Or is there any place it matters? |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Comparing primitive arrays for equality, perhaps in the context of a compute kernel, is relatively straightforward. A
DataType::Int8
is equal toDataType::Int8
and not equal toDataType::UInt8
.For nested types such as
StructArray
,ListArray
andRecordBatch
this gets more complex, how strictly should we enforce that a schema is consistent. Should we allow an array to be of a different type to its schema, what about nullability or metadata?We currently have a range of approaches:
DataType::equals_datatype
, ignoring metadata and field names, but validating nullabilitySchema::contains
the provided batch schema, this forces nullability and metadata to be a subsetDescribe the solution you'd like
I don't really know, eagerly performing validation can help to catch bugs and issues, but on the flip side it is frustrating to be validating things like field names, metadata, or even nullability, that in most cases won't make a different to correctness
Describe alternatives you've considered
Additional context
#1888
#3226
#4799
The text was updated successfully, but these errors were encountered: