language detection for documents that have multiple languages #185

crayzsociety · 2018-10-05T07:29:40Z

hi,
we have a problem about language detection for documents has multiple languages, could you help to us for this

ahmetaa · 2018-10-05T08:21:16Z

Please define the problem with a small code example and document in question (or a public text you find if doc contains private data). Also keep in mind language detection may make mistakes.

crayzsociety · 2018-10-05T08:29:03Z

LanguageIdentifier lid = LanguageIdentifier.fromInternalModels();
result = lid.identify("Ahmet eve gitti.Ayşe has gone");

bu örnekte dökümanda hem türkçe hem ingilizce veri olduğunu anlamaya çalışıyoruz.zemberek bize bir tane dil veriyor birden fazla olduğunu anlamamızın bir yolu var mıdır ?

ahmetaa · 2018-10-05T09:44:08Z

Bunun bir kaç yolu olabilir. Paragrafı cümlelere ayırıp belli karakterden uzun cümleler için ayrı ayrı tespit yapılabilir. Sonra bazı kurallar vs kullanıp aşağı yukarı hangi dillerin kullanıldığı bulunabilir.

Alternatif olarak belli bir kaç dilin olduğunu kesin ise her dil için containsLanguage metodu denenebilir.
Ama metindeki dile ait karakter sayısı vs başarı oranını etkiler. Denemeniz gerekir.

Son olarak, Müge, satırlara bölünürse ployglot aracının bunu yapabildiğini söylemiş.

ahmetaa · 2018-10-05T09:50:51Z

Bahsi geçen yöntemlerin işlemi oldukça yavaşlatacağını da hatırlatırım.
Bu konuyu açık bırakıyorum, belki buna özel bir metod yazılabilir.

ahmetaa · 2018-10-05T09:59:03Z

polyglot'u inceleyebilirsiniz, Karışık dilli metinler için özel hazırlanmış bir kütüphane imiş.
https://github.com/saffsd/polyglot

crayzsociety · 2018-10-05T11:49:16Z

teşekkürler

ahmetaa changed the title ~~language detection for documents has multiple languages~~ language detection for documents that have multiple languages Oct 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

language detection for documents that have multiple languages #185

language detection for documents that have multiple languages #185

crayzsociety commented Oct 5, 2018

ahmetaa commented Oct 5, 2018

crayzsociety commented Oct 5, 2018

ahmetaa commented Oct 5, 2018 •

edited

Loading

ahmetaa commented Oct 5, 2018

ahmetaa commented Oct 5, 2018

crayzsociety commented Oct 5, 2018

language detection for documents that have multiple languages #185

language detection for documents that have multiple languages #185

Comments

crayzsociety commented Oct 5, 2018

ahmetaa commented Oct 5, 2018

crayzsociety commented Oct 5, 2018

ahmetaa commented Oct 5, 2018 • edited Loading

ahmetaa commented Oct 5, 2018

ahmetaa commented Oct 5, 2018

crayzsociety commented Oct 5, 2018

ahmetaa commented Oct 5, 2018 •

edited

Loading