Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory blows up in LG determinization #357

Open
armusc opened this issue May 10, 2022 · 28 comments
Open

memory blows up in LG determinization #357

armusc opened this issue May 10, 2022 · 28 comments

Comments

@armusc
Copy link
Contributor

armusc commented May 10, 2022

Hi

I've not been able to compile the HLG where memory blows up during LG determinization ; I had to stop it manually after a while (almost 2 hours) to avoid consuming the while server memory
here it is the logging
2022-05-09 16:54:26,004 INFO [compile_hlg.py:73] Building ctc_topo. max_token_id: 499
2022-05-09 16:54:26,082 INFO [compile_hlg.py:82] Loading G.bg.fst.txt
2022-05-09 16:54:32,011 INFO [compile_hlg.py:93] Intersecting L and G
2022-05-09 16:54:35,137 INFO [compile_hlg.py:95] LG shape: (1867183, None)
2022-05-09 16:54:35,137 INFO [compile_hlg.py:97] Connecting LG
2022-05-09 16:54:35,137 INFO [compile_hlg.py:99] LG shape after k2.connect: (1867183, None)
2022-05-09 16:54:35,137 INFO [compile_hlg.py:101] <class 'torch.Tensor'>
2022-05-09 16:54:35,137 INFO [compile_hlg.py:102] Determinizing LG

arpa size is just 67M but lexicon contains about 300k words (bpe has 500 tokens)

this has been so far the biggest lexicon i used to build a graph in k2-icefall
in other runs, I used much bigger language models but smaller lexicons
are there requirements for graph construction ?

thanks in advance

@danpovey
Copy link
Collaborator

Determinization of largish graphs will tend to require a lot of memory. How much did the server have?

@armusc
Copy link
Contributor Author

armusc commented May 10, 2022

256 GB mempry server
size of L_disambig.pt => 36 MB ~300K words
size of G_3_gram.pt => 59 MB

@danpovey
Copy link
Collaborator

m, OK, that's a lot. You might want to do the same thing with OpenFST, that should clarify things a bit.
Please show the exact script. If you remove the disambig symbols too soon, the determinization would never complete.
You need to have those '#0' disambig symbols in G, plus lexical disambig symbols in L_disambig.

@danpovey
Copy link
Collaborator

... and you need to be careful about which way around it is... determinization is with respect to the primary labels (i.e. the ilabels). The disambig symbols need to be on "that side", or determinization would loop forever.

@armusc
Copy link
Contributor Author

armusc commented May 10, 2022

I can see the disambiguation symbols in tokens.txt and lexicon_disambig.txt
tail -2 data/data_eval1/lang_bpe_500/tokens.txt
#0 500
#1 501

grep "#1" data/data_eval1/lang_bpe_500/lexicon_disambig.txt | wc -l
67347
L_disambig is generated afterwards by lexicon_to_fst_no_sil and save afterwards

the "#0" symbol in is the word symbol table and in the G.fst
grep "#0" data/data_eval1/lang_bpe_500/words.txt
#0 299979
grep -w 299979 data/data_eval1/lm/G.bg.fst.txt | wc -l
299974

the "#0" is on the input side of G and "eps" on the output side
grep -w 299979 data/data_eval1/lm/G.bg.fst.txt | head -2
742 0 299979 0 3.2241
1 0 299979 0 0.0241321

as far as I know the only modif to compile_hlg.py is that the G is called bg rather than 3 (it's a bigram)
I can see that Linv.pt is only used to recover token and word symbol table

I did it with Kladi-openFST and mkgraph and everything is fast and doesn't take much in memory (but I'm using chain left biphones, not bpe)

as far as I know, I always use the same chain in k2-icefall for lang-graph build; usually very fast, this is the first time where LG determinization fails

@danpovey
Copy link
Collaborator

Hm, to help us debug this perhaps you could dump the graph just before determinization to OpenFST format, discard the olabels, and try to determinize with fstdeterminize?

@csukuangfj
Copy link
Collaborator

To convert graphs in k2 to OpenFST format, you may find the following repo helpful.
https://github.com/csukuangfj/kaldifst/blob/master/kaldifst/python/kaldifst/utils/k2_converter.py

@armusc
Copy link
Contributor Author

armusc commented May 12, 2022

Thanks

I have dumped LG before determinization
1)
logging.info("Connecting LG")
LG = k2.connect(LG)
logging.info(f"LG shape after k2.connect: {LG.shape}")

#MODIF 
torch.save(LG.as_dict(), f"{lang_dir}/LG_before_determinize.pt")
#END MODIF
  1. I used kaldifst and k2_converter to convert this fst into StdVectorFst as an acceptor _k2_acceptor_to_openfst(fsa)

  2. I then use fstdeterminizestar as it's done in mkgraph
    fstdeterminizestar --use-log=true lang_bpe_500/LG_before_determinize.acceptor.fst

it's about 10 hours that is running, though memory consumption is very low

@danpovey
Copy link
Collaborator

OK, so that suggests that it is not determnizable. One thing you could do it send fstdeterminizestar a signal SIGUSR1, e..g
kill -SIGUSR
That program prints out some debug info if you do that, we can find out why it's not determinizable.

@armusc
Copy link
Contributor Author

armusc commented May 12, 2022

fstdeterminizestar --use-log=true data/data_eval1/lang_bpe_500/LG_before_determinize.acceptor.fst
WARNING (fstdeterminizestar[5.5.1005-c8674]:Debug():fstext/determinize-star-inl.h:1074) Debug function called (probably SIGUSR1 caught)
ERROR (fstdeterminizestar[5.5.1005-c8674]:Debug():fstext/determinize-star-inl.h:1129) Traceback follows in format ilabel (olabel olabel) ilabel (olabel) ... : 500 ( 500 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 )
.
.
.
7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 )

[ Stack-Trace: ]
/opt/shared/kaldi/bin/../lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x999) [0x7f6ea08239c9]
fstdeterminizestar() [0x424870]
fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Debug()+0x4d5) [0x42fed5]
fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Determinize(bool*)+0x51e) [0x43cc1e]
fstdeterminizestar(bool fst::DeterminizeStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >(fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > >&, fst::MutableFst<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > >::Arc>, float, bool, int, bool)+0x400) [0x43d090]
fstdeterminizestar(fst::DeterminizeStarInLog(fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl > > > >, float, bool, int)+0x107) [0x43d2b7]
fstdeterminizestar(main+0xab0) [0x4241f0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f6e99b1309b]
fstdeterminizestar() [0x424752]

@danpovey
Copy link
Collaborator

What are 500, 8 and 7 in words.txt and phones.txt or bpe_pieces.txt or whatever they are?

@danpovey
Copy link
Collaborator

.. also please show any pronunciations that seem like they may be relevant.
It's odd that the same ilabels and olabels show up (8 and 7).

@armusc
Copy link
Contributor Author

armusc commented May 12, 2022

500 is "#0" in tokens.txt
it's a bpe with vocab size 500, so it's always in that position for every system that uses a bpe with that vocab size
7 is "<unk"> in tokens.txt
8 is "+BREATH+" in words.txt

it's actually a word that is also a BPE token, i.e. it's pronunciation is also "+BREATH+" it's an additional user-defined label in the bpe model (I have several of those, indeed); I use this same BPE model for a system with a reduced lexicon of 45k words in decoding and HLG compilation and WER are fine

@danpovey
Copy link
Collaborator

OK, so I'm assuming that unk and breath are simple as far as L.fst is concerned. There may be something weird going on in G.fst. I'm particularly concerned about what happens in the unigram state w.r.t. these symbols. I think what's happening is, first it's taking symbol #0, meaning it's backing off from the BOS history state, and from then it's taking unk and then breath. Please figure out, in G.fst, what sequences of states there are that only involve these symbols. E.g. you can compose G.fst with an FST that accepts 500, then (7 8)*, and we can see what states remain.

@armusc
Copy link
Contributor Author

armusc commented May 12, 2022

that is the first fwe lines of tokens.txt
<blk> 0
<sos/eos> 1
!SIL 2
+CONV+ 3
+BREATH+ 4
+NOISE+ 5
+FW+ 6
7

as you can see, there additional user-defined symbols (besides )

this is the first few lines of words.txt
0
$ 1
% 2
&Co 3
&P 4
&newlin 5
&oelig 6
's 7
+BREATH+ 8
+CONV+ 9
+FW+ 10
+NOISE+ 11
wc -l data/data_eval1/lang_bpe_500/words.txt
299982 data/data_eval1/lang_bpe_500/words.txt

btw, in this other system, tokens .txt is the same (model used in training) and words.txt
0
%POUR-CENT 1
&ET-COMMERCIAL 2
+BREATH+ 3
+CONV+ 4
+FW+ 5
+NOISE+ 6
-adjoint 7
-ce 8
-ci 9
wc -l data/data_eval2/lang_bpe_500/words.txt
45748 data/data_eval2/lang_bpe_500/words.txt

here I have no problem in HLG compilation (results are also good)

@danpovey
Copy link
Collaborator

Can you please clarify what the 7th and 8th lines of tokens.txt are, and which of the systems is the one you have a problem with? I

@armusc
Copy link
Contributor Author

armusc commented May 12, 2022

OK, so I'm assuming that unk and breath are simple as far as L.fst is concerned. There may be something weird going on in G.fst. I'm particularly concerned about what happens in the unigram state w.r.t. these symbols. I think what's happening is, first it's taking symbol #0, meaning it's backing off from the BOS history state, and from then it's taking unk and then breath. Please figure out, in G.fst, what sequences of states there are that only involve these symbols. E.g. you can compose G.fst with an FST that accepts 500, then (7 8)*, and we can see what states remain.

there's no "+BREATH+" in the language model
so there's no "8" in the G.fst

usually these words-tokens/phones I modelled on kaldi by adding inter-words "silence arc" in make-lexicon or whatever
I used them to model short acoustic events with no significant linguistic meaning to be modelled by a word LM (in this case, they won't be output in the final transcription, but that's no big deal)

I actually realized that 7, 8, 500 appears both as tokens and words in that debug trace, if I understand correctly
500 is 45t in words.txt and "#0" in tokens.txt
7 is " 's " in words.txt and "<unk>" in tokens.txt
8 is _ in tokens.txt and +BREATH+ in words.txt

btw, if +BREATH+ is not in G.fst, am I supposed to ever see it in LG??

@armusc
Copy link
Contributor Author

armusc commented May 12, 2022

Can you please clarify what the 7th and 8th lines of tokens.txt are, and which of the systems is the one you have a problem with? I

oh, sorry, I did not want to add confusion
head data/data_eval1/lang_bpe_500/tokens.txt
0
<sos/eos> 1
!SIL 2
+CONV+ 3
+BREATH+ 4
+NOISE+ 5
+FW+ 6
<unk> 7
▁ 8
' 9

but the tokens.txt is common to all systems (it's the words.txt and G that change)
the system where HLG compilation fails is indicated as eval1, it's the one with 299982 words

@danpovey
Copy link
Collaborator

danpovey commented May 12, 2022

Perhaps something went weird with a mismatch between words.txt tokens.txt and you had things mapped to unk when you converted the G.fst to integers, because some characters were OOV. Notice that 7 and 8 are both ilabels and olabels. That is hard to make sense of with the tokens.txt and words.txt that you have shown, unless things were mapped to OOV. You cannot map unknown tokens to OOV when creating G.fst.
So yu have
's 7
+BREATH+
and:
7
▁ 8
which doesn't make much sense to me.

@danpovey
Copy link
Collaborator

... oh, wait.. I forgot, I think I asked you to create an acceptor by discarding olabels. But with fstdeterminize star you can keep the olabels, and this gives better debug info.

@armusc
Copy link
Contributor Author

armusc commented May 12, 2022

fstdeterminizestar --use-log=true data/data_eval1/lang_bpe_500/LG_before_determinize.transducer.fst
ERROR (fstdeterminizestar[5.5.1005-c8674]:AddOneElement():fstext/determinize-star-inl.h:791) FST was not functional -> not determinizable.
First string: 1
Second string: 299978

[ Stack-Trace: ]
/opt/shared/kaldi/bin/../lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x999) [0x7f69cd7399c9]
fstdeterminizestar() [0x424870]
fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::EpsilonClosure::AddOneElement(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element const&, fst::LogWeightTpl const&)+0x2ec) [0x434dcc]
fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::EpsilonClosure::GetEpsilonClosure(std::vector<fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element, std::allocator<fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element> > const&, std::vector<fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element, std::allocator<fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element> >)+0x4f3) [0x43bab3]
fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Determinize(bool
)+0x14e) [0x43c84e]
fstdeterminizestar(bool fst::DeterminizeStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >(fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > >&, fst::MutableFst<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > >::Arc>, float, bool, int, bool)+0x400) [0x43d090]
fstdeterminizestar(fst::DeterminizeStarInLog(fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl > > > >, float, bool, int)+0x107) [0x43d2b7]
fstdeterminizestar(main+0xab0) [0x4241f0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f69c6a2909b]
fstdeterminizestar() [0x424752]

@armusc
Copy link
Contributor Author

armusc commented May 12, 2022

Perhaps something went weird with a mismatch between words.txt tokens.txt and you had things mapped to unk when you converted the G.fst to integers, because some characters were OOV. Notice that 7 and 8 are both ilabels and olabels. That is hard to make sense of with the tokens.txt and words.txt that you have shown, unless things were mapped to OOV. You cannot map unknown tokens to OOV when creating G.fst. So yu have 's 7 +BREATH+ and: 7 ▁ 8 which doesn't make much sense to me.

I can definitely check if I have OOV tokens within the words in this words.txt; but I actually do not understand why the mapping OOVtoken should happen during G.fst creation; I guess it's supposed to happen during L creation
I'll check if I see some connection by looking at possible OOV tokens

@danpovey
Copy link
Collaborator

Need to look at words with ids 1 and 299978, and what their pronunciations in L.fst are. These seem to both have the same token sequence.

@armusc
Copy link
Contributor Author

armusc commented May 12, 2022

Need to look at words with ids 1 and 299978, and what their pronunciations in L.fst are. These seem to both have the same token sequence.

Oh, $ and € symbols
they are tokenized in
$ ▁ $
€ ▁ €
in lexicon_disambig.txt
but those symbols do not exist in tokens.txt
I'm gonna look at the lexicon creation, I'm pretty sure there is a mapping to unk somewhere, it's not the first time I have OOV tokens

@armusc
Copy link
Contributor Author

armusc commented May 12, 2022

ok, I might have screwed up somewhere during those stages, I guess
I'll let you know if everything works then

@danpovey
Copy link
Collaborator

We should consider creating some kind of validation setup that can detect this.

@csukuangfj
Copy link
Collaborator

We should consider creating some kind of validation setup that can detect this.

Yes, I will create one to check OOV tokens in the lexicon.txt

@armusc
Copy link
Contributor Author

armusc commented May 12, 2022

Thanks

that was indeed the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants