Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split out arrow-array crate (#2594) #2769

Merged
merged 6 commits into from
Sep 26, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Sep 22, 2022

Draft as I wish to perform another pass, and double-check the benchmarks

Which issue does this PR close?

Part of #2594

Rationale for this change

Continues the process of splitting apart the crate, so that components can depend on just what they need, compilation parallelizes better, etc...

What changes are included in this PR?

Moves the array, array builders, and record batch definitions into a new arrow-array crate

Are there any user-facing changes?

The deprecated RecordBatch::concat is removed, otherwise there are no breaking changes 🎉

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 22, 2022
@@ -166,19 +166,6 @@ impl<T: PyArrowConvert> PyArrowConvert for Vec<T> {
}
}

impl<T> PyArrowConvert for T
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be implemented as it errors complaining that arrow_schema::DataType could be updated to implement Array + From<ArrayData> which would then cause a conflict with the impl PyArrowConvert for DataType.

Ultimately this impl is not hugely important, as it is just a case of using make_array and Array::data

@tustvold tustvold marked this pull request as ready for review September 23, 2022 17:01
@alamb alamb added the api-change Changes to the arrow API label Sep 24, 2022
@alamb
Copy link
Contributor

alamb commented Sep 24, 2022

marking as api-change due to removal of RecordBatch::concat

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty epic PR -- I went through it fairly carefully and it looks great to me

/// assert_eq!(array.keys(), &Int8Array::from(vec![0, 0, 1, 2]));
/// assert_eq!(array.values(), &values);
/// ```
pub type Int8DictionaryArray = DictionaryArray<Int8Type>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these pub types are new, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use std::any::Any;

///
/// # Example: Using `collect`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for adding basic doc examples to these typedefs

assert!(!as_decimal_array(&array).is_empty());
let result_decimal = as_decimal_array(&array);
assert_eq!(result_decimal, &array);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

use crate::builder::*;

#[test]
fn test_buffer_builder_availability() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is the kind of thing that should be in in a tests type integration test to ensure that the types are pub and not pub(crate) for example

fn schema(&self) -> SchemaRef;

/// Reads the next `RecordBatch`.
#[deprecated(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 This is another breaking API change (nice cleanup

@@ -51,7 +51,7 @@ fn double(array: &PyAny, py: Python) -> PyResult<PyObject> {
let array = kernels::arithmetic::add(array, array).map_err(to_py_err)?;

// export
array.to_pyarrow(py)
array.data().to_pyarrow(py)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a very reasonable change

(but is it also an API change?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry I made this change after I wrote the PR description

fn schema(&self) -> SchemaRef;

/// Reads the next `RecordBatch`.
#[deprecated(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, whoops -- maybe we should remove this deprecated API (perhaps as a follow on PR)

@alamb alamb changed the title Split out arrow-array Split out arrow-array crate Sep 24, 2022
@alamb
Copy link
Contributor

alamb commented Sep 25, 2022

FWIW it occurs to me we probably need to update the github workflow triggers to reflect this new code organization:

For example:
https://github.com/apache/arrow-rs/blob/master/.github/workflows/arrow.yml#L21-L29

SHould probably include arrow-array and arrow-buffer

@tustvold
Copy link
Contributor Author

Running the benchmarks, some of the faster benchmarks do show the odd ~10% regression, but we're talking 10s of microseconds here. I'm inclined to think this is not an issue, and if it transpires to be so, we can revisit those kernels.

@tustvold tustvold merged commit 06c204c into apache:master Sep 26, 2022
@ursabot
Copy link

ursabot commented Sep 26, 2022

Benchmark runs are scheduled for baseline = 6bee576 and contender = 06c204c. 06c204c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@tustvold tustvold changed the title Split out arrow-array crate Split out arrow-array crate (#2594) Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants