Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexicon: don't add phoneme-inventory tag when lexicon doesn't have phonemes. #547

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Icemole
Copy link
Collaborator

@Icemole Icemole commented Oct 10, 2024

This change might break other setups, so it's fine if you don't want to merge this.

This change might break other setups, so it's fine if you don't want to merge this.
@albertz
Copy link
Member

albertz commented Oct 10, 2024

This change might break other setups, so it's fine if you don't want to merge this.

I mean, this is the relevant question, right? I don't know the answer. But it all depends on the answer. If there are jobs where this would change the behavior, then we cannot merge it, we would need to introduce this as a new flag for the job. If there are no jobs where this would change sth, then we can merge this. But I don't know.

@JackTemaki
Copy link
Contributor

JackTemaki commented Oct 10, 2024

The question I rather have is: Does this give any benefit somewhere? So is there something not working when it exists but is empty?

@Icemole
Copy link
Collaborator Author

Icemole commented Oct 10, 2024

Some context on this decision: for our ASR pipelines at AppTek we're starting to provide extra lexica, which assumes that there's no phoneme inventory tag at all on such extra lexica. However, when we find a phoneme tag, even if it's empty, we crash with an assertion error.

We could also fix this on our end, but I found it pretty weird that the phoneme-inventory tag was being generated at all when there were no phonemes, so I wanted to get some discussion going with respect to this.

@JackTemaki
Copy link
Contributor

Then it is fine for me, at i6 we usually do not have lexica without an inventory, and I do not see where something would crash if it is not there or cause new behavior if it was empty anyway.

Still, we should get a lot of people to look at this to be on the safe side.

@albertz
Copy link
Member

albertz commented Oct 10, 2024

I think specifically people who use lexica should look into this. That's not me, but mostly our RASR users. Maybe @Marvin84 @SimBe195.

@albertz albertz requested review from Marvin84 and SimBe195 and removed request for albertz October 10, 2024 13:39
@Marvin84
Copy link
Contributor

we're starting to provide extra lexica, which assumes that there's no phoneme inventory tag at all on such extra lexica

I am not sure if I understand your problem. Are you saying that you have an external lexicon and you cannot extend your original lexicon with the additional external one? And you need to use multiple lexica at the same time?

AFAIK, the main use of the phoneme inventory within rasr is to have the order of phonemes, which will have an effect on the state tyings that are done within rasr, if you are not using any external state tying file.

when there were no phonemes

I also don't understand this statement. And you are also saying that your external lexicon does not have any phoneme inventory? Or are you saying that you have two lexica that do not share the same token inventory?
I personally don't see why we should give an option to a lexicon to live without its token inventory. The definition of a lexicon assumes that you have certain base tokens (in this case phonemes) and you provide the mapping of your words to a sequence of these tokens. I assume that this is the reason an assertion exists for this.

@sarahberanek
Copy link
Collaborator

sarahberanek commented Oct 10, 2024

Hi Tina

we're starting to provide extra lexica, which assumes that there's no phoneme inventory tag at all on such extra lexica

I am not sure if I understand your problem. Are you saying that you have an external lexicon and you cannot extend your original lexicon with the additional external one? And you need to use multiple lexica at the same time?

Yes exactly at AppTek we offer the option to add an additional lexicon at runtime. So we do not want to append the main one. The lexicons are created with the same g2p as the main one with a seperate API - so the phoneme set is the same - but RASR crashs if <phoneme></phoneme> is given in the extra lexicon because it reads the phonemes from the main one and does not expect phonemes in the 2nd one. As far as I remember it is not crashing because its empty.

We have a method to remove the phoneme inventory element in apptek_asr but Nahuels implementation here is much more elegant for the feature of extra lexicons / custom words
So having said that, if this PR is rejected there is a solution to avoid that in the lab. (In the API Steve is creating the extra lexicon without phoneme entries in the first place - so we try to mimic production here (ok and rasr crashes if not :-D )

@Marvin84
Copy link
Contributor

Thanks @sarahberanek for the explanation, now it is clear.
I personally don't think the change of Nahuel should affect any of our setups, since we generally have the base case where we have phonemes. But I would wait also for @SimBe195 's opinion on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants