Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Query panics on very large dataset #146

Closed
jakob-ledermann opened this issue Sep 10, 2022 · 12 comments
Closed

[Question] Query panics on very large dataset #146

jakob-ledermann opened this issue Sep 10, 2022 · 12 comments
Labels

Comments

@jakob-ledermann
Copy link

jakob-ledermann commented Sep 10, 2022

I try to run a StartsWith Query against a large generated FST and it panics with the following message thread 'main' panicked at 'index out of bounds: the len is 32379027443 but the index is 17000494749432868067', /home/jakob/.cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.4.7/src/raw/node.rs:302:17

The len matches the filesize of the fst.

The code for the query is:

use fst::{
    automaton::{AlwaysMatch, Str},
    Automaton, IntoStreamer, Map, Streamer,
};
use memmap::Mmap;
use std::fs::File;

pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mmap = unsafe { Mmap::map(&File::open("pwned-passwords.fst")?)? };
    let map = Map::new(mmap)?;

    for arg in std::env::args() {
        let query = Str::new(&arg).starts_with();
        let mut results = map.search(query).into_stream();
        while let Some((key, value)) = results.next() {
            println!(
                "{}: {:>20}",
                unsafe { std::str::from_utf8_unchecked(key) },
                value
            );
        }
    }

    Ok(())
}

So basically the sample on a different data set and not much more.

There were no errors reported by generating the fst. To generate the same fst one can follow the procedure below:

  1. download the v8 Version of Pwned-Passwords SHA1 (17GB compressed)
  2. Unpack this file and reorder it (unfortunately the correctly ordered file does not come with the values) (about 34GiB)
  3. Generate an fst from the reorderd file (the fst is 30.1 GiB, which feels a little high)

Oh btw. I have no clue about finite automata and FSTs, but found the blogpost fascinating, so I wanted to play with them a little.

@BurntSushi
Copy link
Owner

Is it possible to show the exact commands you used to generate the fst?

@jakob-ledermann
Copy link
Author

I did not use the commandline tool to generate the fst. I used my code.
The same repository contains also the full code for the query.

@jakob-ledermann
Copy link
Author

After your question I tried using the cli tool to generate the fst. After 12 hours it was still not finished, the partial file failes with the same error message. If I use a smaller subset of the input, the queries do work.

@BurntSushi
Copy link
Owner

After running your steps (thank you for the excellent repro!) I was actually able to reproduce this problem precisely:

$ ./target/release/query   
thread 'main' panicked at 'index out of bounds: the len is 32379027443 but the index is 17000494749432868067', /home/andrew/.cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.4.7/src/raw/node.rs:302:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

It looks like this is going to require some investigation to fix. I'm not sure when I'll have time for that.

After your question I tried using the cli tool to generate the fst. After 12 hours it was still not finished, the partial file failes with the same error message. If I use a smaller subset of the input, the queries do work.

By default, fst set will sort the input. It does it pretty naively. Since you already wrote a program to sort it (and it is much less naive than what fst set will do), it would be better to use that as input and pass the --sorted flag.

As far as partial files... A partial FST cannot be used. It needs to finish building first. That it fails with the same error is quite weird and perhaps a clue.

@BurntSushi BurntSushi added the bug label Sep 12, 2022
@BurntSushi
Copy link
Owner

BurntSushi commented Sep 12, 2022

Yeah, after a very brief look, it looks like something has resulted in the FST being corrupted somehow. And if shrinking the input makes the bug go away, it means getting to the bottom of this is going to require very carefully tracing each step through the FST and figuring out the first mis-step and then tying that back to something that went wrong during construction. (Unless FST traversal is the problem, but that seems less likely.)

So... yeah this is going to be a bear to track down unfortunately.

The next step here for me would be to write a program that reproduces the problem, but only uses the raw Node APIs. That way it will be easier to scrutinize where things go wrong.

@jakob-ledermann
Copy link
Author

The next step here for me would be to write a program that reproduces the problem, but only uses the raw Node APIs. That way it will be easier to scrutinize where things go wrong.

Might you mean something like the below with that?

#[test]
fn walk_to_first_final_and_follow_one_key() -> Result<(), Box<dyn std::error::Error>> {
    let mmap = unsafe { Mmap::map(&File::open("pwned-passwords.fst")?)? };
    let map = Map::new(mmap)?;

    let mut out = stdout().lock();
    let fst = map.as_fst();
    let mut node = fst.root();
    writeln!(out, "{:?}", &node)?;
    while !node.is_final() {
        let transition = node.transitions().take(1).next();
        writeln!(out, "{:?}", transition)?;
        if let Some(transition) = transition {
            node = fst.node(transition.addr)
        }
    }

    Ok(())
}
thread 'walk_to_first_final_and_follow_one_key' panicked at 'index out of bounds: the len is 32379027443 but the index is 17000494749432868067', C:\Users\Jakob\.cargo\registry\src\github.com-1ecc6299db9ec823\fst-0.4.7\src\raw\node.rs:302:17
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library\std\src\panicking.rs:584
   1: core::panicking::panic_fmt
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library\core\src\panicking.rs:142
   2: core::panicking::panic_bounds_check
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library\core\src\panicking.rs:84
   3: enum$<fst::raw::node::State>::new
             at C:\Users\Jakob\.cargo\registry\src\github.com-1ecc6299db9ec823\fst-0.4.7\src\raw\node.rs:302
   4: fst::raw::node::Node::new
             at C:\Users\Jakob\.cargo\registry\src\github.com-1ecc6299db9ec823\fst-0.4.7\src\raw\node.rs:60
   5: fst::raw::FstRef::node
             at C:\Users\Jakob\.cargo\registry\src\github.com-1ecc6299db9ec823\fst-0.4.7\src\raw\mod.rs:766
   6: fst::raw::FstRef::root
             at C:\Users\Jakob\.cargo\registry\src\github.com-1ecc6299db9ec823\fst-0.4.7\src\raw\mod.rs:761
   7: fst::raw::Fst<memmap::Mmap>::root<memmap::Mmap>
             at C:\Users\Jakob\.cargo\registry\src\github.com-1ecc6299db9ec823\fst-0.4.7\src\raw\mod.rs:612
   8: query::walk_to_first_final_and_follow_one_key
             at .\src\fst_query.rs:35
   9: query::iterate_all_keys::closure$0
             at .\src\fst_query.rs:29
  10: core::ops::function::FnOnce::call_once<query::iterate_all_keys::closure_env$0,tuple$<> >
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f\library\core\src\ops\function.rs:248
  11: core::ops::function::FnOnce::call_once
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library\core\src\ops\function.rs:248
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test walk_to_first_final_and_follow_one_key ... FAILED

Looks like the very first node is already corrupted. Or is that the very last node written to disc?

Or did you mean to rewrite the generation to use the Node API directly and scrutinize that?

@jakob-ledermann
Copy link
Author

I would say we have found one dataset, that reliably generates one of the cases mentioned in the comment

fst/src/raw/mod.rs

Lines 383 to 387 in 3bb9796

// If this check passes, it is still possible that the FST is invalid
// but probably unlikely. If this check reports a false positive, then
// the program will probably panic. In the worst case, the FST will
// operate but be subtly wrong. (This would require the bytes to be in
// a format expected by an FST, which is incredibly unlikely.)

To me it looks like there is a quite simple sanity check which could be used to change the panic with this message to a more helpful error message: the len and root_addr are not within the bounds of data.

Also one correction: If I run the above test on the fst that was aborted during generation I get the same wording and backtrace of the error message but different values for len and index. I'm not sure why I missed that difference yesterday.

Btw the command fst verify detects both files (original where I did not see any errors during generation and aborted one) as faulty.

@jakob-ledermann
Copy link
Author

I just realized I forgot to call finish on the builder while generating.
I will close this issue, if the fst works after I regenerate it.
@BurntSushi: Could you elaborate how you tested to reproduce the problem? Did you use my code to generate the set or did you use the cli tool?

@BurntSushi
Copy link
Owner

Ahhhh. Yeah I used your program. I clearly did not scrutinize your code clearly enough.

I'm also attempting to generate the fst via fst set. So we should both be able to confirm that not calling finish() is indeed the problem. It really kind of has to be to be honest. If you don't call finish it seems like you'd almost be guaranteed to get a corrupt fst. It is interesting that you were able to generate fsts from smaller inputs without calling finish though.

It'd also be nice to make the failure modes better here. I know that reading an fst has some sanity checks for returning an error if it thinks the fst is corrupt, but it looks like this one slipped through.

@BurntSushi
Copy link
Owner

I guess the Fst::verify routine is really indeed the first thing to try. But it doesn't tell you much other than that the FST is indeed trivially corrupt.

To be honest, I completely forgot about the Fst::verify API. I should have suggested that right away. If that fails, then it has to imply something is deeply deeply wrong. That probably would have led me to scrutinizing your fst_generate.rs program more carefully.

@jakob-ledermann
Copy link
Author

If you don't call finish it seems like you'd almost be guaranteed to get a corrupt fst. It is interesting that you were able to generate fsts from smaller inputs without calling finish though.

I think you always get an invalid fst. Actually I might have generated the smaller one using your cli tool fst-bin after I patched it to configure the delimiter of csv::ReaderBuilder for maps. (My values are colon separated and not comma separated and I was not smart or patient enough to replace the separator in the sort step)

Sadly I had a simliar problem a while ago, but from the API design perspective:

My type needs to perform a final set of IO operations to produce a valid file. So my options were:

  1. the user has nothing to prevent him from forgetting to call finish (which is always a bug, so clearly should be a compiler error if possible)
  2. the user has no way to detect the IO Errors that might happen during the final operations

I opted for a custom Drop implementation and option 2 at the time, as forgetting to call the final method would always end in an invalid file.

Do you know of a way to make such an API harder to misuse?
Unfortunately #[must_use] currently does not seem to be usable for this (or at least I don't see how).

@jakob-ledermann
Copy link
Author

After adding the call to finish the generated fst can be queried.

The question why it does not achieve any significant compression does remain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants