You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
lang2id (:obj:`Dict[str, int]`, `optional`, defaults to :obj:`None`):
Dictionary mapping languages string identifiers to their IDs.
id2lang (:obj:`Dict[int, str`, `optional`, defaults to :obj:`None`):
The problem hasn't been detected until now since they were used to only count the number of languages it seems.
I need to pass src/tgt languages to the tokenizer I'm porting from fairseq, so I was looking at how to do that and id2lang seems to fit the purpose. But I actually need to look them up by int id, that's how I saw the problem.
But I'm also not sure why do we need to hardcode the reversal, when it can be done in 1 line of code? Which would also remove this assertion code:
self.lang2id = lang2id
self.id2lang = id2lang
if lang2id is not None and id2lang is not None:
assert len(lang2id) == len(id2lang)
Further we don't even need to hardcode the ids. Replace:
"id2lang": {0: "de", 1: "en"},
with:
"id2lang": ["de", "en"]
So all we need is one of the two entries, and now generate the 2 lookup dicts on the fly.
And since it's no longer id2lang semantically, probably renaming it to just langs would be more appropriate.
I think I will use this approach regardless of the outcome of this issue.
Thanks.
The text was updated successfully, but these errors were encountered:
stas00
changed the title
id2lang in tokenization_xlm.py should be int
id2lang in tokenization_xlm.py should be int, and removing hardcoding
Aug 25, 2020
Not sure if the suggested rewrite to remove all those numbers is desirable - perhaps it's important to see those numbers, so I left it alone and just fixed the keys of id2lang to be int.
In https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_xlm.py#L78 we have:
and then:
So it should be:
All other entries need this change too.
The problem hasn't been detected until now since they were used to only count the number of languages it seems.
I need to pass src/tgt languages to the tokenizer I'm porting from fairseq, so I was looking at how to do that and
id2lang
seems to fit the purpose. But I actually need to look them up byint
id, that's how I saw the problem.But I'm also not sure why do we need to hardcode the reversal, when it can be done in 1 line of code? Which would also remove this assertion code:
Further we don't even need to hardcode the ids. Replace:
with:
So all we need is one of the two entries, and now generate the 2 lookup dicts on the fly.
And since it's no longer
id2lang
semantically, probably renaming it to justlangs
would be more appropriate.I think I will use this approach regardless of the outcome of this issue.
Thanks.
The text was updated successfully, but these errors were encountered: