-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory blows up in LG determinization #357
Comments
Determinization of largish graphs will tend to require a lot of memory. How much did the server have? |
256 GB mempry server |
m, OK, that's a lot. You might want to do the same thing with OpenFST, that should clarify things a bit. |
... and you need to be careful about which way around it is... determinization is with respect to the primary labels (i.e. the ilabels). The disambig symbols need to be on "that side", or determinization would loop forever. |
I can see the disambiguation symbols in tokens.txt and lexicon_disambig.txt grep "#1" data/data_eval1/lang_bpe_500/lexicon_disambig.txt | wc -l the "#0" symbol in is the word symbol table and in the G.fst the "#0" is on the input side of G and "eps" on the output side as far as I know the only modif to compile_hlg.py is that the G is called bg rather than 3 (it's a bigram) I did it with Kladi-openFST and mkgraph and everything is fast and doesn't take much in memory (but I'm using chain left biphones, not bpe) as far as I know, I always use the same chain in k2-icefall for lang-graph build; usually very fast, this is the first time where LG determinization fails |
Hm, to help us debug this perhaps you could dump the graph just before determinization to OpenFST format, discard the olabels, and try to determinize with fstdeterminize? |
To convert graphs in k2 to OpenFST format, you may find the following repo helpful. |
Thanks I have dumped LG before determinization
it's about 10 hours that is running, though memory consumption is very low |
OK, so that suggests that it is not determnizable. One thing you could do it send fstdeterminizestar a signal SIGUSR1, e..g |
fstdeterminizestar --use-log=true data/data_eval1/lang_bpe_500/LG_before_determinize.acceptor.fst [ Stack-Trace: ] |
What are 500, 8 and 7 in words.txt and phones.txt or bpe_pieces.txt or whatever they are? |
.. also please show any pronunciations that seem like they may be relevant. |
500 is "#0" in tokens.txt it's actually a word that is also a BPE token, i.e. it's pronunciation is also "+BREATH+" it's an additional user-defined label in the bpe model (I have several of those, indeed); I use this same BPE model for a system with a reduced lexicon of 45k words in decoding and HLG compilation and WER are fine |
OK, so I'm assuming that unk and breath are simple as far as L.fst is concerned. There may be something weird going on in G.fst. I'm particularly concerned about what happens in the unigram state w.r.t. these symbols. I think what's happening is, first it's taking symbol |
that is the first fwe lines of tokens.txt as you can see, there additional user-defined symbols (besides ) this is the first few lines of words.txt btw, in this other system, tokens .txt is the same (model used in training) and words.txt here I have no problem in HLG compilation (results are also good) |
Can you please clarify what the 7th and 8th lines of tokens.txt are, and which of the systems is the one you have a problem with? I |
there's no "+BREATH+" in the language model usually these words-tokens/phones I modelled on kaldi by adding inter-words "silence arc" in make-lexicon or whatever I actually realized that 7, 8, 500 appears both as tokens and words in that debug trace, if I understand correctly btw, if +BREATH+ is not in G.fst, am I supposed to ever see it in LG?? |
oh, sorry, I did not want to add confusion but the tokens.txt is common to all systems (it's the words.txt and G that change) |
Perhaps something went weird with a mismatch between words.txt tokens.txt and you had things mapped to unk when you converted the G.fst to integers, because some characters were OOV. Notice that 7 and 8 are both ilabels and olabels. That is hard to make sense of with the tokens.txt and words.txt that you have shown, unless things were mapped to OOV. You cannot map unknown tokens to OOV when creating G.fst. |
... oh, wait.. I forgot, I think I asked you to create an acceptor by discarding olabels. But with fstdeterminize star you can keep the olabels, and this gives better debug info. |
fstdeterminizestar --use-log=true data/data_eval1/lang_bpe_500/LG_before_determinize.transducer.fst [ Stack-Trace: ] |
I can definitely check if I have OOV tokens within the words in this words.txt; but I actually do not understand why the mapping OOVtoken should happen during G.fst creation; I guess it's supposed to happen during L creation |
Need to look at words with ids 1 and 299978, and what their pronunciations in L.fst are. These seem to both have the same token sequence. |
Oh, $ and € symbols |
ok, I might have screwed up somewhere during those stages, I guess |
We should consider creating some kind of validation setup that can detect this. |
Yes, I will create one to check OOV tokens in the lexicon.txt |
Thanks that was indeed the problem |
Hi
I've not been able to compile the HLG where memory blows up during LG determinization ; I had to stop it manually after a while (almost 2 hours) to avoid consuming the while server memory
here it is the logging
2022-05-09 16:54:26,004 INFO [compile_hlg.py:73] Building ctc_topo. max_token_id: 499
2022-05-09 16:54:26,082 INFO [compile_hlg.py:82] Loading G.bg.fst.txt
2022-05-09 16:54:32,011 INFO [compile_hlg.py:93] Intersecting L and G
2022-05-09 16:54:35,137 INFO [compile_hlg.py:95] LG shape: (1867183, None)
2022-05-09 16:54:35,137 INFO [compile_hlg.py:97] Connecting LG
2022-05-09 16:54:35,137 INFO [compile_hlg.py:99] LG shape after k2.connect: (1867183, None)
2022-05-09 16:54:35,137 INFO [compile_hlg.py:101] <class 'torch.Tensor'>
2022-05-09 16:54:35,137 INFO [compile_hlg.py:102] Determinizing LG
arpa size is just 67M but lexicon contains about 300k words (bpe has 500 tokens)
this has been so far the biggest lexicon i used to build a graph in k2-icefall
in other runs, I used much bigger language models but smaller lexicons
are there requirements for graph construction ?
thanks in advance
The text was updated successfully, but these errors were encountered: