-
-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Query panics on very large dataset #146
Comments
Is it possible to show the exact commands you used to generate the fst? |
I did not use the commandline tool to generate the fst. I used my code. |
After your question I tried using the cli tool to generate the fst. After 12 hours it was still not finished, the partial file failes with the same error message. If I use a smaller subset of the input, the queries do work. |
After running your steps (thank you for the excellent repro!) I was actually able to reproduce this problem precisely:
It looks like this is going to require some investigation to fix. I'm not sure when I'll have time for that.
By default, As far as partial files... A partial FST cannot be used. It needs to finish building first. That it fails with the same error is quite weird and perhaps a clue. |
Yeah, after a very brief look, it looks like something has resulted in the FST being corrupted somehow. And if shrinking the input makes the bug go away, it means getting to the bottom of this is going to require very carefully tracing each step through the FST and figuring out the first mis-step and then tying that back to something that went wrong during construction. (Unless FST traversal is the problem, but that seems less likely.) So... yeah this is going to be a bear to track down unfortunately. The next step here for me would be to write a program that reproduces the problem, but only uses the raw |
Might you mean something like the below with that? #[test]
fn walk_to_first_final_and_follow_one_key() -> Result<(), Box<dyn std::error::Error>> {
let mmap = unsafe { Mmap::map(&File::open("pwned-passwords.fst")?)? };
let map = Map::new(mmap)?;
let mut out = stdout().lock();
let fst = map.as_fst();
let mut node = fst.root();
writeln!(out, "{:?}", &node)?;
while !node.is_final() {
let transition = node.transitions().take(1).next();
writeln!(out, "{:?}", transition)?;
if let Some(transition) = transition {
node = fst.node(transition.addr)
}
}
Ok(())
}
Looks like the very first node is already corrupted. Or is that the very last node written to disc? Or did you mean to rewrite the generation to use the |
I would say we have found one dataset, that reliably generates one of the cases mentioned in the comment Lines 383 to 387 in 3bb9796
To me it looks like there is a quite simple sanity check which could be used to change the panic with this message to a more helpful error message: the Also one correction: If I run the above test on the fst that was aborted during generation I get the same wording and backtrace of the error message but different values for Btw the command |
I just realized I forgot to call |
Ahhhh. Yeah I used your program. I clearly did not scrutinize your code clearly enough. I'm also attempting to generate the fst via It'd also be nice to make the failure modes better here. I know that reading an fst has some sanity checks for returning an error if it thinks the fst is corrupt, but it looks like this one slipped through. |
I guess the To be honest, I completely forgot about the |
I think you always get an invalid fst. Actually I might have generated the smaller one using your cli tool Sadly I had a simliar problem a while ago, but from the API design perspective: My type needs to perform a final set of IO operations to produce a valid file. So my options were:
I opted for a custom Do you know of a way to make such an API harder to misuse? |
After adding the call to The question why it does not achieve any significant compression does remain. |
I try to run a
StartsWith
Query against a large generated FST and it panics with the following messagethread 'main' panicked at 'index out of bounds: the len is 32379027443 but the index is 17000494749432868067', /home/jakob/.cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.4.7/src/raw/node.rs:302:17
The len matches the filesize of the fst.
The code for the query is:
So basically the sample on a different data set and not much more.
There were no errors reported by generating the fst. To generate the same fst one can follow the procedure below:
Oh btw. I have no clue about finite automata and FSTs, but found the blogpost fascinating, so I wanted to play with them a little.
The text was updated successfully, but these errors were encountered: