use fst for sstable index #2268

trinity-1686a · 2023-11-22T18:08:33Z

makes loading an sstable index basically free (previously you'd decode a datastructure, which could take a non-negligible amount of time, in practice ~200ms for a 125m term table, with keys encoding json paths).

…ximator

sstable/README.md

sstable/src/dictionary.rs

sstable/src/sstable_index.rs

fulmicoton · 2023-11-24T14:20:10Z

sstable/src/sstable_index.rs

+    // we actually add some correcting factor to have proper rounding, not truncation.
+
+    let denominator = (min_slop_idx + max_slop_idx) as u64;
+    let final_slop = ((min_slop_val + max_slop_val + denominator / 2) / denominator) as u32;


Suggested change

let final_slop = ((min_slop_val + max_slop_val + denominator / 2) / denominator) as u32;

let final_slop = ((min_slop_val + max_slop_val + denominator / 2) / denominator) as u32;

would using a decimal slope (not a float but just using a 4 bits as decimal for instance) help here?

let final_slop = ((min_slop_val + max_slop_val + denominator / 2) * 16 / denominator) as u32;

that code is enough to make sure we round in the right direction. I think using fixed point integers wouldn't really help as we would still need to use a similar trick when collapsing the fixed point integers to whole numbers

I meant: wouldn't it make the decimal slope make the prediction for high indexes and eventually help with the compression.

as we would still need to use a similar trick when collapsing the fixed point integers to whole numbers

I suspect you forgot there is a multiplication in between.

truncateroundorwhatever(slope*idx)
vs
idx * truncateroundorwhatever(slope)
will actually give you very different precision.

I would not be surprised if you could shave off one or two bits in your encoding.

ftr: I tried and on my sample dataset, there was a negligible gain of about 0.1%. Either I did something wrong and I can't figure out what, or the impact is negligible

sstable/src/sstable_index.rs

fulmicoton · 2023-11-24T14:36:07Z

sstable/src/sstable_index.rs

+
+        let cmp = cmp_fn(mid);
+
+        if cmp == Less {


nitpick: match would be nice here.

This implementation was in part copied from libcore. It said the following about that comparison:

// The reason why we use if/else control flow rather than match
// is because match reorders comparison operations, which is perf sensitive.
// This is x86 asm for u8: https://rust.godbolt.org/z/8Y8Pra.

I am seeing the same assembly code. Maybe the compiler improved?
anyway no need to fix.

sstable/src/sstable_index.rs

fulmicoton · 2023-11-24T14:40:06Z

sstable/src/sstable_index.rs

+    }
+
+    fn flush_block(&mut self) -> io::Result<()> {
+        let ref_block_addr = self.block_addrs[0].clone();


why is it non empty?

caller make sure of that, but I changed it so it's always safe to call

fulmicoton · 2023-11-24T14:41:04Z

sstable/src/sstable_index.rs

+                .map(|block| block.byte_range.start as u64)
+                .chain(std::iter::once(last_block_addr.byte_range.end as u64))
+                .enumerate()
+                .skip(1),


how do we know block addrs contains more than one el?

we don't need to, find_best_slope accept empty iterator

sstable/src/sstable_index.rs

sstable/README.md

* read path for new fst based index * implement BlockAddrStoreWriter * extract slop/derivation computation * use better linear approximator and allow negative correction to approximator * document format and reorder some fields * optimize single block sstable size * plug backward compat

trinity-1686a added 11 commits November 14, 2023 10:09

read path for new fst based index

97299b3

implement BlockAddrStoreWriter

fbecaba

new format with linear approx

50fe733

omit initial block range end save 8byte per 256 block

e6f1062

extract slop/derivation computation

b5c878a

use better linear approximator and allow negative correction to appro…

142cdeb

…ximator

document format and reorder some fields

bf07fed

optimize single block sstable size

2e4ff9e

plug backward compat

fd441da

remove debug prints

21181b4

handle errors and comment unwraps

b670a6e

trinity-1686a requested a review from fulmicoton November 22, 2023 18:08

trinity-1686a added 2 commits November 22, 2023 19:10

Merge branch 'main' into trinity--sstable-index-fst

5ddca48

fix sstable size in columnar tests

915d60b

trinity-1686a mentioned this pull request Nov 22, 2023

multilayer sstable #2246

Closed