Skip to content

Error in Pre GCM Stage #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AnshulP10 opened this issue Jan 4, 2022 · 0 comments
Open

Error in Pre GCM Stage #9

AnshulP10 opened this issue Jan 4, 2022 · 0 comments

Comments

@AnshulP10
Copy link

I am facing an error in the pre gcm stage while running the script for a large set of hindi-english parallel sentences. Please find the error in the log below. I am also facing an issue in the alignment stage where some sentences are not being considered due to some error.

Error in line 300000
||| <sentence of length 300>
__main__: INFO: 2022-01-04 17:24:56,088: Parsing sentences: 0, 499
Traceback (most recent call last):
File "pre_gcm.py", line 204, in <module>
main()
File "pre_gcm.py", line 174, in main
output = ["(ROOT "+" ".join(str(berkeley_parser.parse(sentence)).split())+")\n" for sentence in target_s]
File "pre_gcm.py", line 174, in <listcomp>
output = ["(ROOT "+" ".join(str(berkeley_parser.parse(sentence)).split())+")\n" for sentence in target_s]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/nltk_plugin.py", line 115, in parse
return list(self.parse_sents([sentence]))[0]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/nltk_plugin.py", line 137, in parse_sents
for parse_raw, tags_raw, sentence in self._batched_parsed_raw(self._nltk_process_sents(sents)):
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/base_parser.py", line 342, in _batched_parsed_raw
for sentence, datum in sentence_data_pairs:
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/nltk_plugin.py", line 89, in _nltk_process_sents
sentence = nltk.word_tokenize(sentence, self._tokenizer_lang)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize
return tokenizer.tokenize(text)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize
for sentence in slices:
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
for sentence1, sentence2 in _pair_iter(slices):
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
prev = next(iterator)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
for match, context in self._match_potential_end_contexts(text):
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
before_words[match] = split[-1]
IndexError: list index out of range
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant