memory blows up in LG determinization #357

armusc · 2022-05-10T10:35:12Z

Hi

I've not been able to compile the HLG where memory blows up during LG determinization ; I had to stop it manually after a while (almost 2 hours) to avoid consuming the while server memory
here it is the logging
2022-05-09 16:54:26,004 INFO [compile_hlg.py:73] Building ctc_topo. max_token_id: 499
2022-05-09 16:54:26,082 INFO [compile_hlg.py:82] Loading G.bg.fst.txt
2022-05-09 16:54:32,011 INFO [compile_hlg.py:93] Intersecting L and G
2022-05-09 16:54:35,137 INFO [compile_hlg.py:95] LG shape: (1867183, None)
2022-05-09 16:54:35,137 INFO [compile_hlg.py:97] Connecting LG
2022-05-09 16:54:35,137 INFO [compile_hlg.py:99] LG shape after k2.connect: (1867183, None)
2022-05-09 16:54:35,137 INFO [compile_hlg.py:101] <class 'torch.Tensor'>
2022-05-09 16:54:35,137 INFO [compile_hlg.py:102] Determinizing LG

arpa size is just 67M but lexicon contains about 300k words (bpe has 500 tokens)

this has been so far the biggest lexicon i used to build a graph in k2-icefall
in other runs, I used much bigger language models but smaller lexicons
are there requirements for graph construction ?

thanks in advance

danpovey · 2022-05-10T12:59:06Z

Determinization of largish graphs will tend to require a lot of memory. How much did the server have?

armusc · 2022-05-10T14:12:59Z

256 GB mempry server
size of L_disambig.pt => 36 MB ~300K words
size of G_3_gram.pt => 59 MB

danpovey · 2022-05-10T15:44:40Z

m, OK, that's a lot. You might want to do the same thing with OpenFST, that should clarify things a bit.
Please show the exact script. If you remove the disambig symbols too soon, the determinization would never complete.
You need to have those '#0' disambig symbols in G, plus lexical disambig symbols in L_disambig.

danpovey · 2022-05-10T15:45:28Z

... and you need to be careful about which way around it is... determinization is with respect to the primary labels (i.e. the ilabels). The disambig symbols need to be on "that side", or determinization would loop forever.

armusc · 2022-05-10T17:43:38Z

I can see the disambiguation symbols in tokens.txt and lexicon_disambig.txt
tail -2 data/data_eval1/lang_bpe_500/tokens.txt
#0 500
#1 501

grep "#1" data/data_eval1/lang_bpe_500/lexicon_disambig.txt | wc -l
67347
L_disambig is generated afterwards by lexicon_to_fst_no_sil and save afterwards

the "#0" symbol in is the word symbol table and in the G.fst
grep "#0" data/data_eval1/lang_bpe_500/words.txt
#0 299979
grep -w 299979 data/data_eval1/lm/G.bg.fst.txt | wc -l
299974

the "#0" is on the input side of G and "eps" on the output side
grep -w 299979 data/data_eval1/lm/G.bg.fst.txt | head -2
742 0 299979 0 3.2241
1 0 299979 0 0.0241321

as far as I know the only modif to compile_hlg.py is that the G is called bg rather than 3 (it's a bigram)
I can see that Linv.pt is only used to recover token and word symbol table

I did it with Kladi-openFST and mkgraph and everything is fast and doesn't take much in memory (but I'm using chain left biphones, not bpe)

as far as I know, I always use the same chain in k2-icefall for lang-graph build; usually very fast, this is the first time where LG determinization fails

danpovey · 2022-05-11T02:16:44Z

Hm, to help us debug this perhaps you could dump the graph just before determinization to OpenFST format, discard the olabels, and try to determinize with fstdeterminize?

csukuangfj · 2022-05-11T03:51:11Z

To convert graphs in k2 to OpenFST format, you may find the following repo helpful.
https://github.com/csukuangfj/kaldifst/blob/master/kaldifst/python/kaldifst/utils/k2_converter.py

armusc · 2022-05-12T07:31:17Z

Thanks

I have dumped LG before determinization
1)
logging.info("Connecting LG")
LG = k2.connect(LG)
logging.info(f"LG shape after k2.connect: {LG.shape}")

#MODIF 
torch.save(LG.as_dict(), f"{lang_dir}/LG_before_determinize.pt")
#END MODIF

I used kaldifst and k2_converter to convert this fst into StdVectorFst as an acceptor _k2_acceptor_to_openfst(fsa)
I then use fstdeterminizestar as it's done in mkgraph
fstdeterminizestar --use-log=true lang_bpe_500/LG_before_determinize.acceptor.fst

it's about 10 hours that is running, though memory consumption is very low

danpovey · 2022-05-12T07:52:14Z

OK, so that suggests that it is not determnizable. One thing you could do it send fstdeterminizestar a signal SIGUSR1, e..g
kill -SIGUSR
That program prints out some debug info if you do that, we can find out why it's not determinizable.

armusc · 2022-05-12T08:09:43Z

fstdeterminizestar --use-log=true data/data_eval1/lang_bpe_500/LG_before_determinize.acceptor.fst
WARNING (fstdeterminizestar[5.5.1005-c8674]:Debug():fstext/determinize-star-inl.h:1074) Debug function called (probably SIGUSR1 caught)
ERROR (fstdeterminizestar[5.5.1005-c8674]:Debug():fstext/determinize-star-inl.h:1129) Traceback follows in format ilabel (olabel olabel) ilabel (olabel) ... : 500 ( 500 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 )
.
.
.
7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 )

[ Stack-Trace: ]
/opt/shared/kaldi/bin/../lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x999) [0x7f6ea08239c9]
fstdeterminizestar() [0x424870]
fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Debug()+0x4d5) [0x42fed5]
fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Determinize(bool*)+0x51e) [0x43cc1e]
fstdeterminizestar(bool fst::DeterminizeStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >(fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > >&, fst::MutableFst<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > >::Arc>, float, bool, int, bool)+0x400) [0x43d090]
fstdeterminizestar(fst::DeterminizeStarInLog(fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl > > > >, float, bool, int)+0x107) [0x43d2b7]
fstdeterminizestar(main+0xab0) [0x4241f0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f6e99b1309b]
fstdeterminizestar() [0x424752]

danpovey · 2022-05-12T08:45:12Z

What are 500, 8 and 7 in words.txt and phones.txt or bpe_pieces.txt or whatever they are?

danpovey · 2022-05-12T09:01:02Z

.. also please show any pronunciations that seem like they may be relevant.
It's odd that the same ilabels and olabels show up (8 and 7).

armusc · 2022-05-12T09:18:05Z

500 is "#0" in tokens.txt
it's a bpe with vocab size 500, so it's always in that position for every system that uses a bpe with that vocab size
7 is "<unk"> in tokens.txt
8 is "+BREATH+" in words.txt

it's actually a word that is also a BPE token, i.e. it's pronunciation is also "+BREATH+" it's an additional user-defined label in the bpe model (I have several of those, indeed); I use this same BPE model for a system with a reduced lexicon of 45k words in decoding and HLG compilation and WER are fine

danpovey · 2022-05-12T09:27:55Z

OK, so I'm assuming that unk and breath are simple as far as L.fst is concerned. There may be something weird going on in G.fst. I'm particularly concerned about what happens in the unigram state w.r.t. these symbols. I think what's happening is, first it's taking symbol #0, meaning it's backing off from the BOS history state, and from then it's taking unk and then breath. Please figure out, in G.fst, what sequences of states there are that only involve these symbols. E.g. you can compose G.fst with an FST that accepts 500, then (7 8)*, and we can see what states remain.

armusc · 2022-05-12T09:28:30Z

that is the first fwe lines of tokens.txt
<blk> 0
<sos/eos> 1
!SIL 2
+CONV+ 3
+BREATH+ 4
+NOISE+ 5
+FW+ 6
7

as you can see, there additional user-defined symbols (besides )

this is the first few lines of words.txt
0
$ 1
% 2
&Co 3
&P 4
&newlin 5
&oelig 6
's 7
+BREATH+ 8
+CONV+ 9
+FW+ 10
+NOISE+ 11
wc -l data/data_eval1/lang_bpe_500/words.txt
299982 data/data_eval1/lang_bpe_500/words.txt

btw, in this other system, tokens .txt is the same (model used in training) and words.txt
0
%POUR-CENT 1
&ET-COMMERCIAL 2
+BREATH+ 3
+CONV+ 4
+FW+ 5
+NOISE+ 6
-adjoint 7
-ce 8
-ci 9
wc -l data/data_eval2/lang_bpe_500/words.txt
45748 data/data_eval2/lang_bpe_500/words.txt

here I have no problem in HLG compilation (results are also good)

danpovey · 2022-05-12T09:44:21Z

Can you please clarify what the 7th and 8th lines of tokens.txt are, and which of the systems is the one you have a problem with? I

armusc · 2022-05-12T09:58:20Z

OK, so I'm assuming that unk and breath are simple as far as L.fst is concerned. There may be something weird going on in G.fst. I'm particularly concerned about what happens in the unigram state w.r.t. these symbols. I think what's happening is, first it's taking symbol #0, meaning it's backing off from the BOS history state, and from then it's taking unk and then breath. Please figure out, in G.fst, what sequences of states there are that only involve these symbols. E.g. you can compose G.fst with an FST that accepts 500, then (7 8)*, and we can see what states remain.

there's no "+BREATH+" in the language model
so there's no "8" in the G.fst

usually these words-tokens/phones I modelled on kaldi by adding inter-words "silence arc" in make-lexicon or whatever
I used them to model short acoustic events with no significant linguistic meaning to be modelled by a word LM (in this case, they won't be output in the final transcription, but that's no big deal)

I actually realized that 7, 8, 500 appears both as tokens and words in that debug trace, if I understand correctly
500 is 45t in words.txt and "#0" in tokens.txt
7 is " 's " in words.txt and "<unk>" in tokens.txt
8 is _ in tokens.txt and +BREATH+ in words.txt

btw, if +BREATH+ is not in G.fst, am I supposed to ever see it in LG??

armusc · 2022-05-12T09:59:36Z

Can you please clarify what the 7th and 8th lines of tokens.txt are, and which of the systems is the one you have a problem with? I

oh, sorry, I did not want to add confusion
head data/data_eval1/lang_bpe_500/tokens.txt
0
<sos/eos> 1
!SIL 2
+CONV+ 3
+BREATH+ 4
+NOISE+ 5
+FW+ 6
<unk> 7
▁ 8
' 9

but the tokens.txt is common to all systems (it's the words.txt and G that change)
the system where HLG compilation fails is indicated as eval1, it's the one with 299982 words

danpovey · 2022-05-12T10:06:02Z

Perhaps something went weird with a mismatch between words.txt tokens.txt and you had things mapped to unk when you converted the G.fst to integers, because some characters were OOV. Notice that 7 and 8 are both ilabels and olabels. That is hard to make sense of with the tokens.txt and words.txt that you have shown, unless things were mapped to OOV. You cannot map unknown tokens to OOV when creating G.fst.
So yu have
's 7
+BREATH+
and:
7
▁ 8
which doesn't make much sense to me.

danpovey · 2022-05-12T10:07:28Z

... oh, wait.. I forgot, I think I asked you to create an acceptor by discarding olabels. But with fstdeterminize star you can keep the olabels, and this gives better debug info.

armusc · 2022-05-12T10:10:30Z

fstdeterminizestar --use-log=true data/data_eval1/lang_bpe_500/LG_before_determinize.transducer.fst
ERROR (fstdeterminizestar[5.5.1005-c8674]:AddOneElement():fstext/determinize-star-inl.h:791) FST was not functional -> not determinizable.
First string: 1
Second string: 299978

[ Stack-Trace: ]
/opt/shared/kaldi/bin/../lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x999) [0x7f69cd7399c9]
fstdeterminizestar() [0x424870]
fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::EpsilonClosure::AddOneElement(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element const&, fst::LogWeightTpl const&)+0x2ec) [0x434dcc]
fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::EpsilonClosure::GetEpsilonClosure(std::vector<fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element, std::allocator<fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element> > const&, std::vector<fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element, std::allocator<fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element> >)+0x4f3) [0x43bab3]
fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Determinize(bool)+0x14e) [0x43c84e]
fstdeterminizestar(bool fst::DeterminizeStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >(fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > >&, fst::MutableFst<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > >::Arc>, float, bool, int, bool)+0x400) [0x43d090]
fstdeterminizestar(fst::DeterminizeStarInLog(fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl > > > >, float, bool, int)+0x107) [0x43d2b7]
fstdeterminizestar(main+0xab0) [0x4241f0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f69c6a2909b]
fstdeterminizestar() [0x424752]

armusc · 2022-05-12T10:34:44Z

Perhaps something went weird with a mismatch between words.txt tokens.txt and you had things mapped to unk when you converted the G.fst to integers, because some characters were OOV. Notice that 7 and 8 are both ilabels and olabels. That is hard to make sense of with the tokens.txt and words.txt that you have shown, unless things were mapped to OOV. You cannot map unknown tokens to OOV when creating G.fst. So yu have 's 7 +BREATH+ and: 7 ▁ 8 which doesn't make much sense to me.

I can definitely check if I have OOV tokens within the words in this words.txt; but I actually do not understand why the mapping OOVtoken should happen during G.fst creation; I guess it's supposed to happen during L creation
I'll check if I see some connection by looking at possible OOV tokens

danpovey · 2022-05-12T10:45:28Z

Need to look at words with ids 1 and 299978, and what their pronunciations in L.fst are. These seem to both have the same token sequence.

armusc · 2022-05-12T11:00:11Z

Need to look at words with ids 1 and 299978, and what their pronunciations in L.fst are. These seem to both have the same token sequence.

Oh, $ and € symbols
they are tokenized in
$ ▁ $
€ ▁ €
in lexicon_disambig.txt
but those symbols do not exist in tokens.txt
I'm gonna look at the lexicon creation, I'm pretty sure there is a mapping to unk somewhere, it's not the first time I have OOV tokens

armusc · 2022-05-12T11:01:13Z

ok, I might have screwed up somewhere during those stages, I guess
I'll let you know if everything works then

danpovey · 2022-05-12T13:01:46Z

We should consider creating some kind of validation setup that can detect this.

csukuangfj · 2022-05-12T13:03:30Z

We should consider creating some kind of validation setup that can detect this.

Yes, I will create one to check OOV tokens in the lexicon.txt

armusc · 2022-05-12T15:04:38Z

Thanks

that was indeed the problem

csukuangfj mentioned this issue May 13, 2022

Validate that there are no OOV tokens in BPE-based lexicons. #359

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory blows up in LG determinization #357

memory blows up in LG determinization #357

armusc commented May 10, 2022 •

edited

Loading

danpovey commented May 10, 2022

armusc commented May 10, 2022

danpovey commented May 10, 2022

danpovey commented May 10, 2022

armusc commented May 10, 2022 •

edited

Loading

danpovey commented May 11, 2022

csukuangfj commented May 11, 2022

armusc commented May 12, 2022

danpovey commented May 12, 2022

armusc commented May 12, 2022

danpovey commented May 12, 2022

danpovey commented May 12, 2022

armusc commented May 12, 2022

danpovey commented May 12, 2022

armusc commented May 12, 2022

danpovey commented May 12, 2022

armusc commented May 12, 2022

armusc commented May 12, 2022 •

edited

Loading

danpovey commented May 12, 2022 •

edited

Loading

danpovey commented May 12, 2022

armusc commented May 12, 2022

armusc commented May 12, 2022

danpovey commented May 12, 2022

armusc commented May 12, 2022

armusc commented May 12, 2022

danpovey commented May 12, 2022

csukuangfj commented May 12, 2022

armusc commented May 12, 2022

memory blows up in LG determinization #357

memory blows up in LG determinization #357

Comments

armusc commented May 10, 2022 • edited Loading

danpovey commented May 10, 2022

armusc commented May 10, 2022

danpovey commented May 10, 2022

danpovey commented May 10, 2022

armusc commented May 10, 2022 • edited Loading

danpovey commented May 11, 2022

csukuangfj commented May 11, 2022

armusc commented May 12, 2022

danpovey commented May 12, 2022

armusc commented May 12, 2022

danpovey commented May 12, 2022

danpovey commented May 12, 2022

armusc commented May 12, 2022

danpovey commented May 12, 2022

armusc commented May 12, 2022

danpovey commented May 12, 2022

armusc commented May 12, 2022

armusc commented May 12, 2022 • edited Loading

danpovey commented May 12, 2022 • edited Loading

danpovey commented May 12, 2022

armusc commented May 12, 2022

armusc commented May 12, 2022

danpovey commented May 12, 2022

armusc commented May 12, 2022

armusc commented May 12, 2022

danpovey commented May 12, 2022

csukuangfj commented May 12, 2022

armusc commented May 12, 2022

armusc commented May 10, 2022 •

edited

Loading

armusc commented May 10, 2022 •

edited

Loading

armusc commented May 12, 2022 •

edited

Loading

danpovey commented May 12, 2022 •

edited

Loading