You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here I have appended all the meaningful sentence in my dataset, one after another in a single line of a .txt file. All the characters in each word of each sentence has been space separated and representing word boundaries with a |. The training data size is around 7GB which is quite big in terms of text.
I want to train the 6gram model using the command: ./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa"
Demo sample of how path_to_my_preprocessed_text_corpus.txt file looks like is shown above.
Running the command: ./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa" gives the following error:
=== 1/5 Counting and sorting n-grams ===
Reading /home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/our_data/train_processed_char_level_git_data_proper_nouns_ai4bharat.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Unigram tokens 2832827518 types 64
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:768 2:6642523648 3:12454731776 4:19927572480 5:29061042176 6:39855144960
/home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0'.
Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 2 because we didn't observe any 1-grams with adjusted count 1; Is this small or artificial data?
Try deduplicating the input. To override this error for e.g. a class-based model, rerun with --discount_fallback
Aborted (core dumped)
But when I run the training using the same command but with --discount_fallback there error does not persist anymore and training starts, the command with --discount_fallback is: ./kenlm/build/bin/lmplz -o 6 --discount_fallback --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa" . My question is why is this? and when I run training using --discount_fallback will there be anything wrong with the model?
@kpu I want to train a character ngram model for Bangla language. I have preprocessed my corpus so that it looks like this, here is a small demo:
| অ প ে ক ্ ষ া | ক র ত ে ন | উ প ভ ো গ | ক র ত ে ন | ত া র | উ জ ্ জ ্ ব ল | উ প স ্ থ ি ত ি | এ ই | স র ক া র | ল ু ট ে র া | ত ো ষ ণ ক া র ী ঃ | র ু ম ি ন | ফ া র হ া ন া | হ ্ য া ঁ | আ প ন া র | র ে জ ি স ্ ট ্ র ে শ ন | ফ র ্ ম | স ম ্ প ন ্ ন | ক র া র | প র | প র ি ব র ্ ত ন | ক র া | স ম ্ ভ ব | স া ধ া র ণ ত | ন ি শ ্ চ ি ত ক র ণ | এ ব ং | চ া ল া ন |......
Here I have appended all the meaningful sentence in my dataset, one after another in a single line of a .txt file. All the characters in each word of each sentence has been space separated and representing word boundaries with a |. The training data size is around 7GB which is quite big in terms of text.
I want to train the 6gram model using the command:
./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa"
Demo sample of how path_to_my_preprocessed_text_corpus.txt file looks like is shown above.
Running the command:
./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa"
gives the following error:=== 1/5 Counting and sorting n-grams ===
Reading /home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/our_data/train_processed_char_level_git_data_proper_nouns_ai4bharat.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Unigram tokens 2832827518 types 64
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:768 2:6642523648 3:12454731776 4:19927572480 5:29061042176 6:39855144960
/home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0'.
Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 2 because we didn't observe any 1-grams with adjusted count 1; Is this small or artificial data?
Try deduplicating the input. To override this error for e.g. a class-based model, rerun with --discount_fallback
Aborted (core dumped)
But when I run the training using the same command but with
--discount_fallback
there error does not persist anymore and training starts, the command with--discount_fallback
is:./kenlm/build/bin/lmplz -o 6 --discount_fallback --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa"
. My question is why is this? and when I run training using --discount_fallback will there be anything wrong with the model?Originally posted by @amitbcp in #302 (comment)
The text was updated successfully, but these errors were encountered: