-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized sorting for ordered dictionaries #1048
Changes from all commits
63f1936
38dfd18
60eca77
cf2aa11
a2ca03e
5c31db2
d84fbb3
25fcee6
d5a0575
789f6d1
6ad616e
743ea19
fc7189e
6ccf7f1
199aaeb
791e863
4482b1b
172752a
a013778
17c20e7
a540f08
10bc713
07389a2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,8 @@ | |
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
use crate::buffer::Buffer; | ||
use crate::compute::{sort_to_indices, take, TakeOptions}; | ||
use std::any::Any; | ||
use std::fmt; | ||
use std::iter::IntoIterator; | ||
|
@@ -27,7 +29,7 @@ use super::{ | |
use crate::datatypes::{ | ||
ArrowDictionaryKeyType, ArrowNativeType, ArrowPrimitiveType, DataType, | ||
}; | ||
use crate::error::Result; | ||
use crate::error::{ArrowError, Result}; | ||
|
||
/// A dictionary array where each element is a single value indexed by an integer key. | ||
/// This is mostly used to represent strings or a limited set of primitive types as integers, | ||
|
@@ -163,6 +165,92 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> { | |
self.is_ordered | ||
} | ||
|
||
/// Returns a DictionaryArray referencing the same data | ||
/// with the [DictionaryArray::is_ordered] flag set to `true`. | ||
/// Note that this does not actually reorder the values in the dictionary. | ||
pub fn as_ordered(&self) -> Self { | ||
Self { | ||
data: self.data.clone(), | ||
values: self.values.clone(), | ||
keys: PrimitiveArray::<K>::from(self.keys.data().clone()), | ||
is_ordered: true, | ||
} | ||
} | ||
|
||
pub fn make_ordered(&self) -> Result<Self> { | ||
let values = self.values(); | ||
if self.is_ordered || values.is_empty() { | ||
Ok(self.as_ordered()) | ||
} else { | ||
// validate up front that we can do conversions from/to usize for the whole range of keys | ||
// this allows using faster unchecked conversions below | ||
K::Native::from_usize(values.len()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
.ok_or(ArrowError::DictionaryKeyOverflowError)?; | ||
// sort indices are u32 so we cannot sort larger dictionaries | ||
u32::try_from(values.len()) | ||
.map_err(|_| ArrowError::DictionaryKeyOverflowError)?; | ||
|
||
// sort the dictionary values | ||
let sort_indices = sort_to_indices(values, None, None)?; | ||
let sorted_dictionary = take( | ||
values.as_ref(), | ||
&sort_indices, | ||
Some(TakeOptions { | ||
check_bounds: false, | ||
}), | ||
)?; | ||
|
||
// build a lookup table from old to new key | ||
let mut lookup = vec![0; sort_indices.len()]; | ||
sort_indices | ||
.values() | ||
.iter() | ||
.enumerate() | ||
.for_each(|(i, idx)| { | ||
lookup[*idx as usize] = i; | ||
}); | ||
|
||
let mapped_keys_iter = self.keys_iter().map(|opt_key| { | ||
if let Some(key) = opt_key { | ||
// Safety: | ||
// lookup has the same length as the dictionary values | ||
// so if the keys were valid for values they will be valid indices into lookup | ||
unsafe { | ||
debug_assert!(key < lookup.len()); | ||
let new_key = *lookup.get_unchecked(key); | ||
debug_assert!(new_key < values.len()); | ||
K::Native::from_usize(new_key).unwrap_unchecked() | ||
} | ||
} else { | ||
K::default_value() | ||
} | ||
}); | ||
|
||
// Safety: | ||
// PrimitiveIter has a trusted len | ||
let new_key_buffer = | ||
unsafe { Buffer::from_trusted_len_iter(mapped_keys_iter) }; | ||
|
||
// Safety: | ||
// after remapping the keys will be in the same range as before | ||
let new_data = unsafe { | ||
ArrayData::new_unchecked( | ||
self.data_type().clone(), | ||
self.len(), | ||
Some(self.data.null_count()), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It makes sense the nullness is the same before and after sorting. I assume it was faster to do this than to create the new values buffer directly from an iterator of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think I benchmarked this separately, but it should be quite a bit faster. In the most common case of offset being zero this would be zero-copy, while creating a PrimitiveArray from an iterator has some overhead per element, checking whether the buffers need to grow and setting individual bits in the validity bitmap. If there is no validity, the iterator api would also construct one first. I might try looking into the performance of the iterator based apis in a separate issue. |
||
self.data | ||
.null_buffer() | ||
.map(|b| b.bit_slice(self.data.offset(), self.len())), | ||
0, | ||
vec![new_key_buffer], | ||
vec![sorted_dictionary.data().clone()], | ||
) | ||
}; | ||
|
||
Ok(DictionaryArray::from(new_data).as_ordered()) | ||
} | ||
} | ||
|
||
/// Return an iterator over the keys (indexes into the dictionary) | ||
pub fn keys_iter(&self) -> impl Iterator<Item = Option<usize>> + '_ { | ||
self.keys | ||
|
@@ -485,6 +573,36 @@ mod tests { | |
.expect("All null array has valid array data"); | ||
} | ||
|
||
#[test] | ||
fn test_dictionary_make_ordered() { | ||
let test = vec![ | ||
Some("b"), | ||
Some("b"), | ||
None, | ||
Some("d"), | ||
Some("d"), | ||
Some("c"), | ||
Some("a"), | ||
]; | ||
let array: DictionaryArray<Int32Type> = test.into_iter().collect(); | ||
|
||
let ordered = array.make_ordered().unwrap(); | ||
let actual_keys = ordered.keys.iter().collect::<Vec<_>>(); | ||
|
||
let expected_keys = | ||
vec![Some(1), Some(1), None, Some(3), Some(3), Some(2), Some(0)]; | ||
assert_eq!(&expected_keys, &actual_keys); | ||
|
||
let expected_values = StringArray::from(vec!["a", "b", "c", "d"]); | ||
let actual_values = ordered | ||
.values | ||
.as_any() | ||
.downcast_ref::<StringArray>() | ||
.unwrap(); | ||
|
||
assert_eq!(&expected_values, actual_values); | ||
} | ||
|
||
#[test] | ||
fn test_dictionary_iter() { | ||
// Construct a value array | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this API should be marked as
unsafe
as it relies on the user "doing the right thing"?Perhaps we could call it something like
pub unsafe fn as_orderd_unchecked()
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I don't think it would be possible to trigger undefined behavior with this method, it would only sort differently. But it certainly makes an assumption that the compiler can not verify.
Maybe the method could also have a better name, my intention was something like
assume_can_be_sorted_by_keys
. Even if the dictionary is not actually sorted, setting the flag allows useful behavior, like sorting by keys and then usinglexicographical_partition_ranges
with the same key-based comparator.