You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I train the UNIGRAM model on a large corpus in the tsv "sentence frequency" format. The input is structured, where the alphabet consists of 8K characters, and all the words are length 4. The resulting number even of possible trigrams on the first three symbols is in the billions, but somehow the number of the resulting seed sentencepieces is ~30M. Because of this, I achieve a very low compression rate on the corpus compared to BPE with the same vocabulary size.
To add to the issue, I constructed an explicit seed_sentencepieces_file in the (seed \tab freq) format whose size turned out to be 458M seed sentencepieces. When I set seed_sentencepiece_size=500_000_000, the model seems to load them fine during training issuing the correct number of initialized sentencepieces via this line. But immediately after, the trainer fails without issuing an error.
I am suspecting it secretely OOMs, because the same seed_sentencepieces_file but with seed_sentencepiece_size=100_000_000 trains successfully.
I train the UNIGRAM model on a large corpus in the tsv "sentence frequency" format. The input is structured, where the alphabet consists of 8K characters, and all the words are length 4. The resulting number even of possible trigrams on the first three symbols is in the billions, but somehow the number of the resulting seed sentencepieces is ~30M. Because of this, I achieve a very low compression rate on the corpus compared to BPE with the same vocabulary size.
Here is the train config
Relevant log piece:
Any ideas for why the resulting number of seed sentencepieces is so low?
The text was updated successfully, but these errors were encountered: