Fixed a few errors and sped up estimation by a large factor #15

LucaCappelletti94 · 2024-08-10T18:20:53Z

Hi @jedisct1 - I was benchmarking this implementation of HyperLogLog alongside a few others, and I noticed a few significant errors that I thought best to fix. Here is what I changed:

Fixed indexing error for linear count threshold - before you had forgotten the -4 there, the precision exponent and the threshold were off.

rust-hyperloglog/src/lib.rs

Lines 132 to 134 in bf12a8a

    
           fn get_threshold(p: u8) -> f64 { 
        
               THRESHOLD_DATA[p as usize - 4] 
        
           }

I have fixed the unordered estimates, aligned with the biases

rust-hyperloglog/src/weights.rs

Lines 4291 to 4313 in f190a71

    
           #[test] 
        
           fn test_threshold_data_is_sorted() { 
        
               for i in 1..THRESHOLD_DATA.len() { 
        
                   assert!(THRESHOLD_DATA[i - 1] < THRESHOLD_DATA[i]); 
        
               } 
        
           } 
        
           #[test] 
        
           fn test_raw_estimate_data_is_sorted() { 
        
               for i in 1..RAW_ESTIMATE_DATA.len() { 
        
                   for j in 1..RAW_ESTIMATE_DATA[i].len() { 
        
                       assert!(RAW_ESTIMATE_DATA[i][j - 1] < RAW_ESTIMATE_DATA[i][j], "precision: {}, value: {}", i + 4, j); 
        
                   } 
        
               } 
        
           } 
        
           #[test] 
        
           fn test_estimate_bias_length() { 
        
               assert_eq!(BIAS_DATA.len(), 15); 
        
               for i in 0..BIAS_DATA.len() { 
        
                   assert_eq!(BIAS_DATA[i].len(), RAW_ESTIMATE_DATA[i].len()); 
        
               } 
        
           }

Given now that the estimates are sorted, we can use a partition point search to find the neighbours.

rust-hyperloglog/src/lib.rs

Lines 157 to 181 in f190a71

    
           fn estimate_bias(estimate: f64, p: u8) -> f64 { 
        
               let bias_vector = BIAS_DATA[(p - 4) as usize]; 
        
               let estimate_vector = RAW_ESTIMATE_DATA[(p - 4) as usize]; 
        
               // Since the estimates are sorted, we can use a partition point to find the nearest neighbors 
        
               let partition_point = estimate_vector.partition_point(|&x| x < estimate); 
        
               let mut min = if partition_point > 6 { 
        
                   partition_point - 6 
        
               } else { 
        
                   0 
        
               }; 
        
               let mut max = core::cmp::min(partition_point + 6, estimate_vector.len()); 
        
               while max - min != 6 { 
        
                   let (min_val, max_val) = (estimate_vector[min], estimate_vector[max - 1]); 
        
                   if 2.0 * estimate - min_val > max_val { 
        
                       min += 1; 
        
                   } else { 
        
                       max -= 1; 
        
                   } 
        
               } 
        
               (min..max).map(|i| bias_vector[i]).sum::<f64>() / 6.0 
        
           }

Other changes are mostly aesthetic to make the code, in my opinion, more readable.

Do note that this pull request needs for the siphash pull request to be merged

Cheers!

…d full scan with partition

finnbear · 2024-10-31T15:43:43Z

src/lib.rs

-        let sip = &mut self.sip.clone();
-        value.hash(sip);
+        let mut sip = self.sip;
+        value.hash(&mut sip);
        let x = sip.finish();


Are you sure this is correct? I'm under the impression that SipHasher13 maintains state that isn't cleared by finish, meaning that future identical value's will hash differently.

I think let mut sip = self.sip; implicitly clones the value.

Ah, right, it implements Copy.

Yeah, although I think replacing it with a generic that implements the Hasher trait would be cleaner.

LucaCappelletti94 added 9 commits August 8, 2024 16:28

Added mem-dbg as optional feature

328e1d5

Switched version of siphasher to custom with mem-dbg

eef7178

Fixed error - author had forgotten to subtract 4 to exponent

bf12a8a

Improved error

d11b96a

Cleaned up code a bit

4399255

Resolved code smells

11f386c

Cleaned up computation of p

1f3d4b6

Ensured that estimates are sorted and aligned with biases and replace…

f190a71

…d full scan with partition

Moved exponent cutoff to 18

33c6856

finnbear reviewed Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed a few errors and sped up estimation by a large factor #15

Fixed a few errors and sped up estimation by a large factor #15

LucaCappelletti94 commented Aug 10, 2024

finnbear Oct 31, 2024

jedisct1 Oct 31, 2024

finnbear Oct 31, 2024

LucaCappelletti94 Oct 31, 2024

	fn get_threshold(p: u8) -> f64 {
	THRESHOLD_DATA[p as usize - 4]
	}

	#[test]
	fn test_threshold_data_is_sorted() {
	for i in 1..THRESHOLD_DATA.len() {
	assert!(THRESHOLD_DATA[i - 1] < THRESHOLD_DATA[i]);
	}
	}

	#[test]
	fn test_raw_estimate_data_is_sorted() {
	for i in 1..RAW_ESTIMATE_DATA.len() {
	for j in 1..RAW_ESTIMATE_DATA[i].len() {
	assert!(RAW_ESTIMATE_DATA[i][j - 1] < RAW_ESTIMATE_DATA[i][j], "precision: {}, value: {}", i + 4, j);
	}
	}
	}

	#[test]
	fn test_estimate_bias_length() {
	assert_eq!(BIAS_DATA.len(), 15);
	for i in 0..BIAS_DATA.len() {
	assert_eq!(BIAS_DATA[i].len(), RAW_ESTIMATE_DATA[i].len());
	}
	}

	fn estimate_bias(estimate: f64, p: u8) -> f64 {
	let bias_vector = BIAS_DATA[(p - 4) as usize];
	let estimate_vector = RAW_ESTIMATE_DATA[(p - 4) as usize];

	// Since the estimates are sorted, we can use a partition point to find the nearest neighbors
	let partition_point = estimate_vector.partition_point(\|&x\| x < estimate);

	let mut min = if partition_point > 6 {
	partition_point - 6
	} else {
	0
	};
	let mut max = core::cmp::min(partition_point + 6, estimate_vector.len());

	while max - min != 6 {
	let (min_val, max_val) = (estimate_vector[min], estimate_vector[max - 1]);
	if 2.0 * estimate - min_val > max_val {
	min += 1;
	} else {
	max -= 1;
	}
	}

	(min..max).map(\|i\| bias_vector[i]).sum::<f64>() / 6.0
	}

Fixed a few errors and sped up estimation by a large factor #15

Are you sure you want to change the base?

Fixed a few errors and sped up estimation by a large factor #15

Conversation

LucaCappelletti94 commented Aug 10, 2024

finnbear Oct 31, 2024

Choose a reason for hiding this comment

jedisct1 Oct 31, 2024

Choose a reason for hiding this comment

finnbear Oct 31, 2024

Choose a reason for hiding this comment

LucaCappelletti94 Oct 31, 2024

Choose a reason for hiding this comment