-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12411: [Rust] Add Builder interface for adding Arrays to RecordBatches #10063
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -93,7 +93,7 @@ impl RecordBatch { | |
Ok(RecordBatch { schema, columns }) | ||
} | ||
|
||
/// Creates a new empty [`RecordBatch`]. | ||
/// Creates a new empty [`RecordBatch`] based on `schema`. | ||
pub fn new_empty(schema: SchemaRef) -> Self { | ||
let columns = schema | ||
.fields() | ||
|
@@ -103,6 +103,56 @@ impl RecordBatch { | |
RecordBatch { schema, columns } | ||
} | ||
|
||
/// Creates a new [`RecordBatch`] with no columns | ||
/// | ||
/// TODO add an code example using `append` | ||
pub fn new() -> Self { | ||
Self { | ||
schema: Arc::new(Schema::empty()), | ||
columns: Vec::new(), | ||
} | ||
} | ||
|
||
/// Appends the `field_array` array to this `RecordBatch` as a | ||
/// field named `field_name`. | ||
/// | ||
/// TODO: code example | ||
/// | ||
/// TODO: on error, can we return `Self` in some meaningful way? | ||
pub fn append(self, field_name: &str, field_values: ArrayRef) -> Result<Self> { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that Am I correct that if this errors, the underlying let array = Int32Array::from(vec![1, 2, 3]);
let array_ref = Arc::new(array);
// append to existing batch
let mut batch = ...;
// assume that this fails, does array_ref get dropped as we no longer have references to it?
batch = batch.append("ints", array_ref)?; There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess I was thinking we would leave the "do I want to have access to the So like
Which I think is a common pattern in the Rust libraries (that if a function needs to clone its argument, it takes it by value instead and thus gives the caller a chance to avoid an extra copy if they don't need the argument again) But in this case with an |
||
if let Some(col) = self.columns.get(0) { | ||
if col.len() != field_values.len() { | ||
return Err(ArrowError::InvalidArgumentError( | ||
format!("all columns in a record batch must have the same length. expected {}, field {} had {} ", | ||
col.len(), field_name, field_values.len()) | ||
)); | ||
} | ||
} | ||
|
||
let Self { | ||
schema, | ||
mut columns, | ||
} = self; | ||
|
||
// modify the schema we have if possible, otherwise copy | ||
let mut schema = match Arc::try_unwrap(schema) { | ||
Ok(schema) => schema, | ||
Err(shared_schema) => shared_schema.as_ref().clone(), | ||
}; | ||
|
||
let nullable = field_values.null_count() > 0; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a limitation here. If the purpose is to create a single record batch, and that batch is used alone through its lifetime, then this is fine; otherwise we might need to take a In any case, I think that if someone uses this to create individual record batches, it'd be inefficient. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree this would be an important point to clarify in the comments. If you are creating more than one If you are creating a single one then this is more convenient There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't an area that I'm super-familiar with, but are there any existing methods or expectations around merging schemas that differ solely on nullability, e.g. merging two schemas together and inferring the nullability of each field by ORing the nullability on each side of the merge? So if either schema declares the field as nullable, we take the resultant schema as nullable. If not, then I'd agree that we probably should keep this explicitly declared and not infer it here. |
||
schema.push(Field::new( | ||
field_name, | ||
field_values.data_type().clone(), | ||
nullable, | ||
)); | ||
let schema = Arc::new(schema); | ||
|
||
columns.push(field_values); | ||
|
||
Ok(Self { schema, columns }) | ||
} | ||
|
||
/// Validate the schema and columns using [`RecordBatchOptions`]. Returns an error | ||
/// if any validation check fails. | ||
fn validate_new_batch( | ||
|
@@ -245,6 +295,12 @@ impl RecordBatch { | |
} | ||
} | ||
|
||
impl Default for RecordBatch { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not totally sure about this because it bypasses the invariants of |
||
fn default() -> Self { | ||
Self::new() | ||
} | ||
} | ||
|
||
/// Options that control the behaviour used when creating a [`RecordBatch`]. | ||
#[derive(Debug)] | ||
pub struct RecordBatchOptions { | ||
|
@@ -337,6 +393,38 @@ mod tests { | |
assert_eq!(5, record_batch.column(1).data().len()); | ||
} | ||
|
||
#[test] | ||
fn create_record_batch_builder() { | ||
let a = Arc::new(Int32Array::from(vec![ | ||
Some(1), | ||
Some(2), | ||
None, | ||
Some(4), | ||
Some(5), | ||
])); | ||
let b = Arc::new(StringArray::from(vec!["a", "b", "c", "d", "e"])); | ||
|
||
let record_batch = RecordBatch::new() | ||
.append("a", a) | ||
.unwrap() | ||
.append("b", b) | ||
.unwrap(); | ||
|
||
let expected_schema = Schema::new(vec![ | ||
Field::new("a", DataType::Int32, true), | ||
Field::new("b", DataType::Utf8, false), | ||
]); | ||
|
||
assert_eq!(record_batch.schema().as_ref(), &expected_schema); | ||
|
||
assert_eq!(5, record_batch.num_rows()); | ||
assert_eq!(2, record_batch.num_columns()); | ||
assert_eq!(&DataType::Int32, record_batch.schema().field(0).data_type()); | ||
assert_eq!(&DataType::Utf8, record_batch.schema().field(1).data_type()); | ||
assert_eq!(5, record_batch.column(0).data().len()); | ||
assert_eq!(5, record_batch.column(1).data().len()); | ||
} | ||
|
||
#[test] | ||
fn create_record_batch_schema_mismatch() { | ||
let schema = Schema::new(vec![Field::new("a", DataType::Int32, false)]); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worthwhile to maintain the property that a recordbatch is ummutable. So, this feels a bit out of place for me.
How about we create a builder that takes an iterator of
(&str, ArrayRef)
instead?The below is what I mean by out of place
There might be a benefit in having the above, so I'm open to convincing :)
This could partially address your question on the TODO about returning
Self
on errorThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense @nevi-me -- I think a separate
RecordBatchBuilder
sounds like a better idea to me. I'll wait a bit for feedback from others, and then give it a shot.