Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support RecordBatch with zero columns but non zero row count, add field to RecordBatchOptions (#1536) #1552

Merged
merged 5 commits into from
Apr 12, 2022

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #1536

Rationale for this change

See ticket

What changes are included in this PR?

Makes it possible to construct a RecordBatch with no columns, but a non-zero row count. I had a brief search in the codebase and couldn't find anything obvious that this would break, but I guess there is only one way to find out 😄

Are there any user-facing changes?

This technically makes a breaking change to RecordBatchOptions as it wasn't marked as non-exhaustive.

schema: &SchemaRef,
columns: &[ArrayRef],
/// if any validation check fails, otherwise returns the created [`RecordBatch`]
fn try_new_impl(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure why we need a separate method for this, but I avoided changing it to keep the diff down

@@ -402,15 +402,20 @@ impl RecordBatch {

/// Options that control the behaviour used when creating a [`RecordBatch`].
#[derive(Debug)]
#[non_exhaustive]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a new option imo shouldn't be a breaking change, this should give us that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe it is time to make these fields non pub and add with_match_field_name(&mut self, val: bool) type accessors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel particularly strongly either way, although have to write and maintain setters and accessors is a bit of a PIA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although have to write and maintain setters and accessors is a bit of a PIA

The price for backwards compatibility?

Copy link
Contributor Author

@tustvold tustvold Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non_exhaustive should make additive changes non-breaking, which should be good enough hopefully 😀

Happy to change if you feel strongly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly

@alamb alamb added arrow Changes to the arrow crate api-change Changes to the arrow API labels Apr 12, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good 👍

arrow/src/record_batch.rs Show resolved Hide resolved
"must either specify a row count or at least one column".to_string(),
)
})?;

if columns.iter().any(|c| c.len() != row_count) {
return Err(ArrowError::InvalidArgumentError(
"all columns in a record batch must have the same length".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error will now also happen when the specified row count doesn't match the array counts (so the wording might be good to update?)

@@ -243,7 +248,7 @@ impl RecordBatch {
/// # }
/// ```
pub fn num_rows(&self) -> usize {
self.columns[0].data().len()
self.row_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could avoid adding row_count with something like

self
  .columns
  .get(0)
  .map(|col| col.data.len())
  .unwrap_or(0)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would always return 0 if there are no columns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- I think I was a little confused -- is the idea that the schema also has 0 fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it's basically a way to support projecting no columns and just getting the row count. This is needed for #1537

@@ -402,15 +402,20 @@ impl RecordBatch {

/// Options that control the behaviour used when creating a [`RecordBatch`].
#[derive(Debug)]
#[non_exhaustive]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe it is time to make these fields non pub and add with_match_field_name(&mut self, val: bool) type accessors?

@@ -402,15 +402,20 @@ impl RecordBatch {

/// Options that control the behaviour used when creating a [`RecordBatch`].
#[derive(Debug)]
#[non_exhaustive]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although have to write and maintain setters and accessors is a bit of a PIA

The price for backwards compatibility?

@alamb alamb merged commit c9549bb into apache:master Apr 12, 2022
@alamb alamb changed the title Create RecordBatch With Non-Zero Row Count But No Columns (#1536) Support RecordBatch with No Columns but Non-Zero Row Count (#1536) Apr 15, 2022
@alamb alamb changed the title Support RecordBatch with No Columns but Non-Zero Row Count (#1536) Support RecordBatch with No Columns but Non-Zero Row Count, add field to RecordBatchOptions (#1536) Apr 15, 2022
@alamb alamb changed the title Support RecordBatch with No Columns but Non-Zero Row Count, add field to RecordBatchOptions (#1536) Support RecordBatch with zero columns but non zero row count, add field to RecordBatchOptions (#1536) Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support RecordBatch with zero columns but non zero row count
2 participants