Replace cld2 to whatlanggo #47

mosuka · 2020-04-09T02:21:42Z

I would like to replace cld2 to whatlanggo as it seems to be archived and not maintained.
What do you think about this?

mschoch · 2020-04-09T04:18:02Z

Can you share how you're using this? We also plan to deprecate all of blevex soon as well because no one is maintaining it.

As for detectlang in particular, we originally thought it might be useful to do text analysis based on the language that was detected. But, bleve offers not way to make use of this info, because the mappings are static, and the language isn't detected until index time. So, can you explain how you're using it?

mosuka · 2020-04-09T04:50:38Z

I don't actually use this feature (detect_lang_filter), but I am actively trying to support Blevex.
I knew that CLD2 was deprecated and suggested replacing it with an alternative library. The request to replace CLD2 with something that doesn't depend on another CGO has been there for a long time from users of my product, so this time I sent you a PR.

mosuka · 2020-04-09T04:55:33Z

BTW, you mentioned deprecating all of blevex soon, Japanese tokenizer will also be deprecated?

mschoch · 2020-04-09T19:47:26Z

So, unless we can better understand that the detect_lang filter has some actual use, I would prefer to get rid of it, rather than change which library it uses.

The Japanese tokenizer is the only thing in blevex that I think makes sense to save. Most likely it would move to be it's own top-level module. Do you think there is anything else of value in blevex?

mosuka · 2020-04-10T16:03:53Z

Yes, I think you're right about detectlang. I can't think of a good use case either...

Basically, I'd like to keep the language analysis modules. For example, icu, lang, stemmer. It would be helpful if you could save these that I can support as many languages like Lucene.

mschoch · 2020-04-10T17:11:50Z

Does the icu tokenizer still work? Does it work with recent version of icu or some specific old ones? It hasn't been touched for 5 years, and it was difficult to get working back then, so I'm surprised if it does.

I believe all the languages supported by libstemmer (using cgo) are also supported by our pure Go snowball stemmers: https://github.com/blevesearch/snowballstem

The only 2 languages not covered there are Japanese, which we plan to continue supporting, and Thai, which uses a dictionary based tokenizer as part of ICU. So it seems like Thai is the only language we would lose support for. Are you aware of any alternative tokenizers for Thai?

mosuka · 2020-04-11T07:24:42Z

I was not aware of the existence of snowballsrem. With this, I don't need to use libstemmer. Thank you for letting me know!

How about this for Thai tokenizer?
https://github.com/veer66/mapkha

Replace cld2 to whatlanggo

c096960

mosuka mentioned this pull request Apr 9, 2020

replace cgo's idea mosuka/blast#53

Open

Format

8d1869e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace cld2 to whatlanggo #47

Replace cld2 to whatlanggo #47

mosuka commented Apr 9, 2020

mschoch commented Apr 9, 2020

mosuka commented Apr 9, 2020

mosuka commented Apr 9, 2020

mschoch commented Apr 9, 2020

mosuka commented Apr 10, 2020

mschoch commented Apr 10, 2020

mosuka commented Apr 11, 2020 •

edited

Loading

Replace cld2 to whatlanggo #47

Are you sure you want to change the base?

Replace cld2 to whatlanggo #47

Conversation

mosuka commented Apr 9, 2020

mschoch commented Apr 9, 2020

mosuka commented Apr 9, 2020

mosuka commented Apr 9, 2020

mschoch commented Apr 9, 2020

mosuka commented Apr 10, 2020

mschoch commented Apr 10, 2020

mosuka commented Apr 11, 2020 • edited Loading

mosuka commented Apr 11, 2020 •

edited

Loading