Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
As of eland-8.7.0, the model which uses BertJapaneseTokenizer (such as cl-tohoku/bert-base-japanese-v2) can't be uploaded by
eland_import_hub_model
command.Error message says:
This PR adds BertJapaneseTokenizer into SUPPORTED_TOKENIZERS to avoid this error.
Test
I manually tested this change by using:
Scripts:
Then verified this model on Kibana.
At least it works without any errors.
Notes
When I checked the trained model via API, response showed as:
Here we can see tokenization is "bert", so Elasticsearch will use BertTokenizer.
And with using BertTokenizer, pre-tokenize will done by WhitespaceTokenizer.
Ideally we should have BertJapaneseTokenizer on Elasticsearch side as well.
I believe pre-tokenizer for WordPieceAnaluzer for Japanese should be JapaneseAnalyzer from Kuromoji, but it is a part of Japanese (kuromoji) analysis plugin, so it is not available by default. These area requires another work.