Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔥 🐛 MLTextTranslatorPipelineItem -- causes issues with the indices in tranlation (batching mode) #146

Closed
Tracked by #131
nicolay-r opened this issue Jul 7, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@nicolay-r
Copy link
Owner

nicolay-r commented Jul 7, 2024

Using the following script:

#!/bin/bash
python3 -m arelight.run.infer \
	--sampling-framework "arekit" \
	--ner-model-name "ner_ontonotes_bert_mult" \
	--ner-types "ORG|PERSON|LOC|GPE" \
	--terms-per-context 50 \
	--sentence-parser "nltk:russian" \
	--text-b-type "nli_m" \
	--tokens-per-context 128 \
	--bert-framework "opennre" \
	--batch-size 10 \
	--inference-writer "sqlite3" \
	--stemmer "mystem" \
	--pretrained-bert "DeepPavlov/rubert-base-cased" \
	--bert-torch-checkpoint "ra4-rsr1_DeepPavlov-rubert-base-cased_cls.pth.tar" \
	--backend "d3js_graphs" \
	-o "output" \
	--from-files $1 \
	--log-file "arelight.log.txt"
	#--translate-framework googletrans \
	#--translate-text "en:ru" \

Uncommenting last two lines for launching translator will lead to the incorrect indexing
Google Trans Version:
googletrans==3.1.0a0

There is a need to split translated earlier text parts into words.
For example, here is the representation of the list in which we seek for the related position of indexed named entities:

['Он командовал 50 000 солдатами ', <arelight.pipelines.items.entity.IndexedEntity object at 0x7f6346931760>, ', дислоцированными в провинциях ', <arelight.pipelines.items.entity.IndexedEntity object at 0x7f63469310d0>]
@nicolay-r nicolay-r self-assigned this Jul 7, 2024
@nicolay-r nicolay-r changed the title GoogleTrans causes issues with the indices in tranlation 🐛 GoogleTrans causes issues with the indices in tranlation Jul 7, 2024
@nicolay-r nicolay-r changed the title 🐛 GoogleTrans causes issues with the indices in tranlation 🐛 GoogleTrans -- causes issues with the indices in tranlation (batching mode) Jul 7, 2024
@nicolay-r nicolay-r changed the title 🐛 GoogleTrans -- causes issues with the indices in tranlation (batching mode) 🐛 MLTextTranslatorPipelineItem -- causes issues with the indices in tranlation (batching mode) Jul 7, 2024
@nicolay-r nicolay-r added the bug Something isn't working label Jul 7, 2024
@nicolay-r nicolay-r mentioned this issue Jul 7, 2024
28 tasks
nicolay-r added a commit that referenced this issue Jul 15, 2024
@nicolay-r
Copy link
Owner Author

nicolay-r commented Jul 15, 2024

There is a need to refactor the quick fix by adapting flattened iterator and setting up separator for terms

def flatten(xss):
    return [x for xs in xss for x in xs]

@nicolay-r nicolay-r changed the title 🐛 MLTextTranslatorPipelineItem -- causes issues with the indices in tranlation (batching mode) 🔥 🐛 MLTextTranslatorPipelineItem -- causes issues with the indices in tranlation (batching mode) Jul 16, 2024
nicolay-r added a commit that referenced this issue Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant