Initialized number of seed sentencepieces too low #1068

DmitriiP20 · 2024-11-14T00:07:26Z

I train the UNIGRAM model on a large corpus in the tsv "sentence frequency" format. The input is structured, where the alphabet consists of 8K characters, and all the words are length 4. The resulting number even of possible trigrams on the first three symbols is in the billions, but somehow the number of the resulting seed sentencepieces is ~30M. Because of this, I achieve a very low compression rate on the corpus compared to BPE with the same vocabulary size.

Here is the train config

input: /tmp/tmp77cwmc1p/<>.tsv
  input_format: tsv
  model_prefix: /tmp/tmp77cwmc1p/<>
  model_type: UNIGRAM
  vocab_size: 18000000
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 0
  seed_sentencepiece_size: 1000000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 64
  num_sub_iterations: 2
  max_sentencepiece_length: 4
  split_by_unicode_script: 0
  split_by_number: 0
  split_by_whitespace: 0
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 1
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 1
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: -1
  eos_id: -1
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differential_privacy_noise_level: 0
  differential_privacy_clipping_threshold: 0

Relevant log piece:

trainer_interface.cc(409) LOG(INFO) Loaded all 161822788 sentences
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(430) LOG(INFO) Normalizing sentences...
trainer_interface.cc(539) LOG(INFO) all chars count=58075329678132
trainer_interface.cc(560) LOG(INFO) Alphabet size=8192
trainer_interface.cc(561) LOG(INFO) Final character coverage=1
trainer_interface.cc(592) LOG(INFO) Done! preprocessed 161822788 sentences.
unigram_model_trainer.cc(265) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(269) LOG(INFO) Extracting frequent sub strings... node_num=592917021
unigram_model_trainer.cc(312) LOG(INFO) Initialized 29230687 seed sentencepieces
unigram_model_trainer.cc(602) LOG(INFO) Using 161822788 sentences for EM training

Any ideas for why the resulting number of seed sentencepieces is so low?

The text was updated successfully, but these errors were encountered:

DmitriiP20 · 2024-12-19T17:16:33Z

To add to the issue, I constructed an explicit seed_sentencepieces_file in the (seed \tab freq) format whose size turned out to be 458M seed sentencepieces. When I set seed_sentencepiece_size=500_000_000, the model seems to load them fine during training issuing the correct number of initialized sentencepieces via this line. But immediately after, the trainer fails without issuing an error.

I am suspecting it secretely OOMs, because the same seed_sentencepieces_file but with seed_sentencepiece_size=100_000_000 trains successfully.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialized number of seed sentencepieces too low #1068

Initialized number of seed sentencepieces too low #1068

DmitriiP20 commented Nov 14, 2024 •

edited

Loading

DmitriiP20 commented Dec 19, 2024 •

edited

Loading

Initialized number of seed sentencepieces too low #1068

Initialized number of seed sentencepieces too low #1068

Comments

DmitriiP20 commented Nov 14, 2024 • edited Loading

DmitriiP20 commented Dec 19, 2024 • edited Loading

DmitriiP20 commented Nov 14, 2024 •

edited

Loading

DmitriiP20 commented Dec 19, 2024 •

edited

Loading