Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordBatchWriter only creates stats for the first 32 columns; this prevents calling create_checkpoint. #2745

Closed
adamfaulkner-at opened this issue Aug 8, 2024 · 7 comments
Labels
binding/rust Issues for the Rust crate bug Something isn't working

Comments

@adamfaulkner-at
Copy link
Contributor

Environment

Delta-rs version: 0.18.1

Binding: Rust (?)

Environment: MacOS, Linux

  • Cloud provider: AWS
  • OS: MacOS, Linux
  • Other:

Bug

Hi, RecordBatchWriter passes a constant DEFAULT_NUM_INDEX_COLS (which has value 32) to create_add (here)[https://github.com/delta-io/delta-rs/blob/main/crates/core/src/writer/record_batch.rs#L232]. This prevents us from later calling create_checkpoint, as this function seems to assert that all columns were indexed.

It feels like the number of indexed columns should be configurable at the table level. Alternatively, if we had some way to tell the RecordBatchWriter that we want to index all of the columns, we would be unblocked.

What happened:

RecordBatchWriter produces a commit that only includes the first 32 columns in the table in column stats.

What you expected to happen:

I would expect all columns to be included in column stats, or otherwise I would expect this behavior to be configurable.

At least, I would like to be able to create a checkpoint by calling create_checkpoint. Currently, this crashes with the following:

Failed to convert into Arrow schema: Json error: whilst decoding field 'add': whilst decoding field 'stats_parsed': whilst decoding field 'minValues': Encountered unmasked nulls in non-nullable StructArray child: Field { name: "userPrimaryKey", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }

(userPrimaryKey is the name of one of our table's columns).

How to reproduce it:

Simply write to a delta lake table using RecordBatchWriter and attempt to create a checkpoint by calling create_checkpoint.

More details:

@adamfaulkner-at adamfaulkner-at added the bug Something isn't working label Aug 8, 2024
@sherlockbeard
Copy link
Contributor

maybe duplicate of #2675
you should be able to control column index using configuration={"delta.dataSkippingNumIndexedCols": "2"}, or DataSkippingStatsColumns

Please create a MRE

@ion-elgreco
Copy link
Collaborator

"I would expect all columns to be included in column stats, or otherwise I would expect this behavior to be configurable."

Collecting all columns, will explode the logs and checkpoints.

If you do want this behavior you can set the numIndexedCols to -1

@rtyler rtyler added the binding/rust Issues for the Rust crate label Aug 10, 2024
@adamfaulkner-at
Copy link
Contributor Author

@sherlockbeard

you should be able to control column index using configuration={"delta.dataSkippingNumIndexedCols": "2"}, or DataSkippingStatsColumns

Please create a MRE

I'll give this a shot, thanks for the suggestion. I don't see how this would work, though. If you look at the lines of code that I linked, you'll see that it uses a value of 32 regardless of any configuration. That is what this issue is about.

RecordBatchWriter only takes in a table URI and storage options when it is constructed and doesn't appear to "magically" fetch configuration from somewhere else.

Sorry, what is an MRE?

"I would expect all columns to be included in column stats, or otherwise I would expect this behavior to be configurable."

Collecting all columns, will explode the logs and checkpoints.

If you do want this behavior you can set the numIndexedCols to -1

Just to reiterate, I do not see anywhere in the code that we read this configuration from the RecordBatchWriter in the course of writing a batch of records.

@ion-elgreco
Copy link
Collaborator

My bad got confused here, I am wondering why you are using the RecordBatchWriter though and not the normal write operation?

@sherlockbeard
Copy link
Contributor

Sorry @adamfaulkner-at for confusion i thought it was a direct dataSkippingNumIndexedCols question without reading all . 😭

MRE => Minimal reproducible example

@adamfaulkner-at
Copy link
Contributor Author

Hey, I think my main problem with checkpointing was actually fixed in 0.18.2: #2627

I am sorry, I didn't realize that I wasn't on the most recent version. I'll give this a shot and report back or close this issue.

I am wondering why you are using the RecordBatchWriter though and not the normal write operation?

I did not realize that the normal write operation existed, and RecordBatchWriter seemed like what I wanted. I'll switch to using the write operation.

@adamfaulkner-at
Copy link
Contributor Author

Thanks for your help, this is definitely fixed in 0.18.2!

It would be beneficial if the docs pointed users towards WriteBuilder over RecordBatchWriter, it was not clear to me when I was initially looking that WriteBuilder is the more supported interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants