You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am facing an error in the pre gcm stage while running the script for a large set of hindi-english parallel sentences. Please find the error in the log below. I am also facing an issue in the alignment stage where some sentences are not being considered due to some error.
Error in line 300000
||| <sentence of length 300>
__main__: INFO: 2022-01-04 17:24:56,088: Parsing sentences: 0, 499
Traceback (most recent call last):
File "pre_gcm.py", line 204, in <module>
main()
File "pre_gcm.py", line 174, in main
output = ["(ROOT "+" ".join(str(berkeley_parser.parse(sentence)).split())+")\n" for sentence in target_s]
File "pre_gcm.py", line 174, in <listcomp>
output = ["(ROOT "+" ".join(str(berkeley_parser.parse(sentence)).split())+")\n" for sentence in target_s]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/nltk_plugin.py", line 115, in parse
return list(self.parse_sents([sentence]))[0]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/nltk_plugin.py", line 137, in parse_sents
for parse_raw, tags_raw, sentence in self._batched_parsed_raw(self._nltk_process_sents(sents)):
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/base_parser.py", line 342, in _batched_parsed_raw
for sentence, datum in sentence_data_pairs:
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/nltk_plugin.py", line 89, in _nltk_process_sents
sentence = nltk.word_tokenize(sentence, self._tokenizer_lang)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize
return tokenizer.tokenize(text)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize
for sentence in slices:
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
for sentence1, sentence2 in _pair_iter(slices):
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
prev = next(iterator)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
for match, context in self._match_potential_end_contexts(text):
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
before_words[match] = split[-1]
IndexError: list index out of range
The text was updated successfully, but these errors were encountered:
I am facing an error in the pre gcm stage while running the script for a large set of hindi-english parallel sentences. Please find the error in the log below. I am also facing an issue in the alignment stage where some sentences are not being considered due to some error.
The text was updated successfully, but these errors were encountered: