Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

id2lang in tokenization_xlm.py should be int, and removing hardcoding #6734

Closed
stas00 opened this issue Aug 25, 2020 · 1 comment · Fixed by #7034
Closed

id2lang in tokenization_xlm.py should be int, and removing hardcoding #6734

stas00 opened this issue Aug 25, 2020 · 1 comment · Fixed by #7034

Comments

@stas00
Copy link
Contributor

stas00 commented Aug 25, 2020

In https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_xlm.py#L78 we have:

        "id2lang": {"0": "de", "1": "en"},
        "lang2id": {"de": 0, "en": 1},

and then:

        lang2id (:obj:`Dict[str, int]`, `optional`, defaults to :obj:`None`):
            Dictionary mapping languages string identifiers to their IDs.
        id2lang (:obj:`Dict[int, str`, `optional`, defaults to :obj:`None`):

So it should be:

        "id2lang": {0: "de", 1: "en"},
        "lang2id": {"de": 0, "en": 1},

All other entries need this change too.

The problem hasn't been detected until now since they were used to only count the number of languages it seems.

I need to pass src/tgt languages to the tokenizer I'm porting from fairseq, so I was looking at how to do that and id2lang seems to fit the purpose. But I actually need to look them up by int id, that's how I saw the problem.

But I'm also not sure why do we need to hardcode the reversal, when it can be done in 1 line of code? Which would also remove this assertion code:

        self.lang2id = lang2id
        self.id2lang = id2lang
        if lang2id is not None and id2lang is not None:
            assert len(lang2id) == len(id2lang)

Further we don't even need to hardcode the ids. Replace:

       "id2lang": {0: "de", 1: "en"},

with:

       "id2lang": ["de", "en"]

So all we need is one of the two entries, and now generate the 2 lookup dicts on the fly.

And since it's no longer id2lang semantically, probably renaming it to just langs would be more appropriate.

I think I will use this approach regardless of the outcome of this issue.

Thanks.

@stas00 stas00 changed the title id2lang in tokenization_xlm.py should be int id2lang in tokenization_xlm.py should be int, and removing hardcoding Aug 25, 2020
@stas00
Copy link
Contributor Author

stas00 commented Sep 10, 2020

Not sure if the suggested rewrite to remove all those numbers is desirable - perhaps it's important to see those numbers, so I left it alone and just fixed the keys of id2lang to be int.

#7034

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant