Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to figure out which state represents which history in FST text file? #31

Open
Banaf89 opened this issue Apr 24, 2023 · 1 comment
Open

Comments

@Banaf89
Copy link

Banaf89 commented Apr 24, 2023

I am using the kaldilm library to convert arpa files to FST text format to make an n-gram language model with FSTs. In the Arpa file, you can see the entire n-gram along with its probabilities. However, in the FST txt format, you just see numbers. It's quite easy to find out which number represents which n-gram when the data is limited, but as the amount of data grows, it becomes harder to understand which state represents which history (of previous words).

One possible solution would be to perform DFS on the graph and label each state according to the previous states and the arcs between them, but it would take a long time when the data is large (and hence, the model is large as well). It would be much easier if I had a dictionary that shows what each state number in the final FST text file represents.

Note: I know we have an argument called --keep-symbols but it only stores information about the arcs while I'm interested in knowing what each state represents.

Is there a way to figure out which state represents which history in the FST text file? Thank you for your help.

@csukuangfj
Copy link
Owner

Is there a way to figure out which state represents which history in the FST text file? Thank you for your help.

That information is an implementation detail and can be accessed in C++ only.
please read the c++ code if you want to learn more.

HistoryMap history_;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants