Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed a few errors and sped up estimation by a large factor #15

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

LucaCappelletti94
Copy link

Hi @jedisct1 - I was benchmarking this implementation of HyperLogLog alongside a few others, and I noticed a few significant errors that I thought best to fix. Here is what I changed:

  1. Fixed indexing error for linear count threshold - before you had forgotten the -4 there, the precision exponent and the threshold were off.

rust-hyperloglog/src/lib.rs

Lines 132 to 134 in bf12a8a

fn get_threshold(p: u8) -> f64 {
THRESHOLD_DATA[p as usize - 4]
}

  1. I have fixed the unordered estimates, aligned with the biases

#[test]
fn test_threshold_data_is_sorted() {
for i in 1..THRESHOLD_DATA.len() {
assert!(THRESHOLD_DATA[i - 1] < THRESHOLD_DATA[i]);
}
}
#[test]
fn test_raw_estimate_data_is_sorted() {
for i in 1..RAW_ESTIMATE_DATA.len() {
for j in 1..RAW_ESTIMATE_DATA[i].len() {
assert!(RAW_ESTIMATE_DATA[i][j - 1] < RAW_ESTIMATE_DATA[i][j], "precision: {}, value: {}", i + 4, j);
}
}
}
#[test]
fn test_estimate_bias_length() {
assert_eq!(BIAS_DATA.len(), 15);
for i in 0..BIAS_DATA.len() {
assert_eq!(BIAS_DATA[i].len(), RAW_ESTIMATE_DATA[i].len());
}
}

  1. Given now that the estimates are sorted, we can use a partition point search to find the neighbours.

rust-hyperloglog/src/lib.rs

Lines 157 to 181 in f190a71

fn estimate_bias(estimate: f64, p: u8) -> f64 {
let bias_vector = BIAS_DATA[(p - 4) as usize];
let estimate_vector = RAW_ESTIMATE_DATA[(p - 4) as usize];
// Since the estimates are sorted, we can use a partition point to find the nearest neighbors
let partition_point = estimate_vector.partition_point(|&x| x < estimate);
let mut min = if partition_point > 6 {
partition_point - 6
} else {
0
};
let mut max = core::cmp::min(partition_point + 6, estimate_vector.len());
while max - min != 6 {
let (min_val, max_val) = (estimate_vector[min], estimate_vector[max - 1]);
if 2.0 * estimate - min_val > max_val {
min += 1;
} else {
max -= 1;
}
}
(min..max).map(|i| bias_vector[i]).sum::<f64>() / 6.0
}

Other changes are mostly aesthetic to make the code, in my opinion, more readable.

Do note that this pull request needs for the siphash pull request to be merged

Cheers!

let sip = &mut self.sip.clone();
value.hash(sip);
let mut sip = self.sip;
value.hash(&mut sip);
let x = sip.finish();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is correct? I'm under the impression that SipHasher13 maintains state that isn't cleared by finish, meaning that future identical value's will hash differently.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think let mut sip = self.sip; implicitly clones the value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, it implements Copy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, although I think replacing it with a generic that implements the Hasher trait would be cleaner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants