Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts #8402

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
69f73e1
ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts
carols10cents Oct 7, 2020
12e1dda
Change variable name from index_type to key_type
carols10cents Oct 14, 2020
d55c5e9
cargo fmt
carols10cents Oct 14, 2020
bc13e3a
Change an unwrap to an expect
carols10cents Oct 14, 2020
ae28114
Add a let _
carols10cents Oct 14, 2020
f2f4459
Use roundtrip test helper function
carols10cents Oct 14, 2020
30d8843
We need a custom comparison of ArrayData
nevi-me Oct 10, 2020
141f0c6
Improve some variable names
carols10cents Oct 16, 2020
009da8a
Add a test and update comment to explain why it's ok to drop nulls
carols10cents Oct 16, 2020
af1fd17
Support all numeric dictionary key types
shepmaster Oct 16, 2020
60a3852
Serialize unsigned int dictionary index types
carols10cents Oct 19, 2020
8f621d0
Add a failing test for string dictionary indexed by an unsinged int
carols10cents Oct 19, 2020
be62e4a
Extract a method for converting dictionaries
carols10cents Oct 19, 2020
5f330b2
Extract a macro for string dictionary conversion
carols10cents Oct 19, 2020
45600e6
Convert string dictionaries indexed by unsigned integers too
carols10cents Oct 19, 2020
4cde14e
Convert one kind of primitive dictionary
carols10cents Oct 19, 2020
e45265c
Update based on rebase
carols10cents Oct 26, 2020
3d27a0e
cargo fmt
carols10cents Oct 26, 2020
f2f94fd
Complete dictionary support
nevi-me Oct 25, 2020
9d69248
Switch from general_err to unreachable
carols10cents Oct 26, 2020
f3b287d
Change match with one arm to an if let
carols10cents Oct 26, 2020
bb5d5d7
Remove some type aliases and calls to cast
carols10cents Oct 26, 2020
a1c153f
Remove RecordReader cast and the CastRecordReader trait
carols10cents Oct 26, 2020
bfe7669
Remove some more type aliases
carols10cents Oct 26, 2020
e15ecf7
Move the CastConverter code into PrimitiveArrayReader
carols10cents Oct 26, 2020
7e3d54a
Remove now unneeded CastConverter and BoolConverter
carols10cents Oct 26, 2020
c90485d
Remove a resolved TODO
carols10cents Oct 26, 2020
1cc53e4
Change a panic to unreachable
carols10cents Oct 26, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 56 additions & 1 deletion rust/arrow/src/array/data.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ use crate::util::bit_util;
/// An generic representation of Arrow array data which encapsulates common attributes and
/// operations for Arrow array. Specific operations for different arrays types (e.g.,
/// primitive, list, struct) are implemented in `Array`.
#[derive(PartialEq, Debug, Clone)]
#[derive(Debug, Clone)]
pub struct ArrayData {
/// The data type for this array data
data_type: DataType,
Expand Down Expand Up @@ -209,6 +209,61 @@ impl ArrayData {
}
}

impl PartialEq for ArrayData {
fn eq(&self, other: &Self) -> bool {
assert_eq!(
self.data_type(),
other.data_type(),
"Data types not the same"
);
assert_eq!(self.len(), other.len(), "Lengths not the same");
// TODO: when adding tests for this, test that we can compare with arrays that have offsets
assert_eq!(self.offset(), other.offset(), "Offsets not the same");
assert_eq!(self.null_count(), other.null_count());
// compare buffers excluding padding
let self_buffers = self.buffers();
let other_buffers = other.buffers();
assert_eq!(self_buffers.len(), other_buffers.len());
self_buffers.iter().zip(other_buffers).for_each(|(s, o)| {
compare_buffer_regions(
s,
self.offset(), // TODO mul by data length
o,
other.offset(), // TODO mul by data len
);
});
// assert_eq!(self.buffers(), other.buffers());

assert_eq!(self.child_data(), other.child_data());
// null arrays can skip the null bitmap, thus only compare if there are no nulls
if self.null_count() != 0 || other.null_count() != 0 {
compare_buffer_regions(
self.null_buffer().unwrap(),
self.offset(),
other.null_buffer().unwrap(),
other.offset(),
)
}
true
}
}

/// A helper to compare buffer regions of 2 buffers.
/// Compares the length of the shorter buffer.
fn compare_buffer_regions(
left: &Buffer,
left_offset: usize,
right: &Buffer,
right_offset: usize,
) {
// for convenience, we assume that the buffer lengths are only unequal if one has padding,
// so we take the shorter length so we can discard the padding from the longer length
let shorter_len = left.len().min(right.len());
let s_sliced = left.bit_slice(left_offset, shorter_len);
let o_sliced = right.bit_slice(right_offset, shorter_len);
assert_eq!(s_sliced, o_sliced);
}

/// Builder for `ArrayData` type
#[derive(Debug)]
pub struct ArrayDataBuilder {
Expand Down
30 changes: 23 additions & 7 deletions rust/arrow/src/ipc/convert.rs
Original file line number Diff line number Diff line change
Expand Up @@ -641,17 +641,23 @@ pub(crate) fn get_fb_dictionary<'a: 'b, 'b>(
fbb: &mut FlatBufferBuilder<'a>,
) -> WIPOffset<ipc::DictionaryEncoding<'b>> {
// We assume that the dictionary index type (as an integer) has already been
// validated elsewhere, and can safely assume we are dealing with signed
// integers
// validated elsewhere, and can safely assume we are dealing with integers
let mut index_builder = ipc::IntBuilder::new(fbb);
index_builder.add_is_signed(true);

match *index_type {
Int8 => index_builder.add_bitWidth(8),
Int16 => index_builder.add_bitWidth(16),
Int32 => index_builder.add_bitWidth(32),
Int64 => index_builder.add_bitWidth(64),
Int8 | Int16 | Int32 | Int64 => index_builder.add_is_signed(true),
UInt8 | UInt16 | UInt32 | UInt64 => index_builder.add_is_signed(false),
_ => {}
}

match *index_type {
Int8 | UInt8 => index_builder.add_bitWidth(8),
Int16 | UInt16 => index_builder.add_bitWidth(16),
Int32 | UInt32 => index_builder.add_bitWidth(32),
Int64 | UInt64 => index_builder.add_bitWidth(64),
_ => {}
}

let index_builder = index_builder.finish();

let mut builder = ipc::DictionaryEncodingBuilder::new(fbb);
Expand Down Expand Up @@ -773,6 +779,16 @@ mod tests {
123,
true,
),
Field::new_dict(
"dictionary<uint8, uint32>",
DataType::Dictionary(
Box::new(DataType::UInt8),
Box::new(DataType::UInt32),
),
true,
123,
true,
),
],
md,
);
Expand Down
Loading