Add BertJapaneseTokenizer support #529

daixque · 2023-04-18T13:59:46Z

Overview

As of eland-8.7.0, the model which uses BertJapaneseTokenizer (such as cl-tohoku/bert-base-japanese-v2) can't be uploaded by eland_import_hub_model command.

Error message says:

TypeError: Tokenizer type BertJapaneseTokenizer(name_or_path='cl-tohoku/bert-base-japanese-v2', vocab_size=32768, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True) not supported, must be one of: <class 'transformers.models.bart.tokenization_bart.BartTokenizer'>, <class 'transformers.models.bert.tokenization_bert.BertTokenizer'>, <class 'transformers.models.distilbert.tokenization_distilbert.DistilBertTokenizer'>, <class 'transformers.models.dpr.tokenization_dpr.DPRContextEncoderTokenizer'>, <class 'transformers.models.dpr.tokenization_dpr.DPRQuestionEncoderTokenizer'>, <class 'transformers.models.electra.tokenization_electra.ElectraTokenizer'>, <class 'transformers.models.mobilebert.tokenization_mobilebert.MobileBertTokenizer'>, <class 'transformers.models.mpnet.tokenization_mpnet.MPNetTokenizer'>, <class 'transformers.models.retribert.tokenization_retribert.RetriBertTokenizer'>, <class 'transformers.models.roberta.tokenization_roberta.RobertaTokenizer'>, <class 'transformers.models.squeezebert.tokenization_squeezebert.SqueezeBertTokenizer'>

This PR adds BertJapaneseTokenizer into SUPPORTED_TOKENIZERS to avoid this error.

Test

I manually tested this change by using:

ESS 8.7.0
Python 3.9.16 on Colab notebook

Scripts:

# install depencencies
!pip install torch==1.11
!pip install transformers
!pip install sentence_transformers
!pip install fugashi
!pip install ipadic
!pip install unidic_lite

# install modified eland
!git clone https://github.com/daixque/eland.git
!cd eland;pwd;ls; git checkout add_BertJapaneseTokenizer_support; pip install .

# upload the model which uses BertJapaneseTokenizer
!eland_import_hub_model \
--url https://user:password@myess:9243 \
--hub-model-id cl-tohoku/bert-base-japanese-v2 \
--task-type text_embedding \
--start

Then verified this model on Kibana.

At least it works without any errors.

Notes

When I checked the trained model via API, response showed as:

{
  "count": 1,
  "trained_model_configs": [
    {
      "model_id": "cl-tohoku__bert-base-japanese-v2",
      "model_type": "pytorch",
      "created_by": "api_user",
      "version": "8.7.0",
      "create_time": 1681702816443,
      "model_size_bytes": 0,
      "estimated_operations": 0,
      "license_level": "platinum",
      "description": "Model cl-tohoku/bert-base-japanese-v2 for task type 'text_embedding'",
      "tags": [],
      "input": {
        "field_names": [
          "text_field"
        ]
      },
      "inference_config": {
        "text_embedding": {
          "vocabulary": {
            "index": ".ml-inference-native-000001"
          },
          "tokenization": {
            "bert": {
              "do_lower_case": false,
              "with_special_tokens": true,
              "max_sequence_length": 512,
              "truncate": "first",
              "span": -1
            }
          }
        }
      },
      "location": {
        "index": {
          "name": ".ml-inference-native-000001"
        }
      }
    }
  ]
}

Here we can see tokenization is "bert", so Elasticsearch will use BertTokenizer.
And with using BertTokenizer, pre-tokenize will done by WhitespaceTokenizer.
Ideally we should have BertJapaneseTokenizer on Elasticsearch side as well.

I believe pre-tokenizer for WordPieceAnaluzer for Japanese should be JapaneseAnalyzer from Kuromoji, but it is a part of Japanese (kuromoji) analysis plugin, so it is not available by default. These area requires another work.

davidkyle · 2023-04-19T10:50:28Z

Hi @daixque thanks for the PR it is fantastic that you have this Japanese BERT model working in Elastic!

Ideally we should have BertJapaneseTokenizer on Elasticsearch side as well.

Yes Elasticsearch should implement the same MeCab tokeniser which the model was trained with.

https://huggingface.co/cl-tohoku/bert-base-japanese-v2#tokenization

As you noted the default whitespace tokeniser is used instead. Do you know how this affects the quality of the results?

daixque · 2023-04-19T13:18:43Z

@davidkyle On the trained model side, it uses MeCab as pre-tokenizer, so the Japanese sentence will be tokenized differently from training phase, which will cause many sub-word will not found in vocabulary on inference phase on Elasticsearch, then it will finally be tokenized as [UNK]. So result won't be as expected, but some word will be tokenized as expected eventually, so it's still much better than unsupported state.

davidkyle · 2023-06-12T10:37:30Z

Note that to use the BertJapaneseTokenizer Mecab must be installed on your system.

Mecab requires the following additional packages:

The Eland eland_import_hub_model script needs to use the BertJapaneseTokenizer to create the inputs for tracing the model into TorchScript format. There are no extra dependencies on the Elasticsearch server, the Japanese tokenizer is only required when tracing the model (Eland uploads the traced model to Elasticsearch).

davidkyle

LGTM

daixque · 2023-06-12T12:59:27Z

This is no longer needed, instead please merge:
#534

Add BertJapaneseTokenizer support

e3d157b

davidkyle mentioned this pull request Apr 19, 2023

[ML] Support MeCab tokenization for Japanese based BERT models elastic/elasticsearch#95380

Closed

daixque mentioned this pull request May 1, 2023

Add BertJapaneseTokenizer support with bert_ja tokenization configuration #534

Merged

Merge branch 'main' into add_BertJapaneseTokenizer_support

a301ba7

davidkyle approved these changes Jun 12, 2023

View reviewed changes

daixque closed this Jun 12, 2023

davidkyle mentioned this pull request Jul 8, 2024

Add the required packages for BERT Japanese Tokenizer to the Eland docker image #709

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BertJapaneseTokenizer support #529

Add BertJapaneseTokenizer support #529

daixque commented Apr 18, 2023

davidkyle commented Apr 19, 2023

daixque commented Apr 19, 2023 •

edited

Loading

davidkyle commented Jun 12, 2023

davidkyle left a comment

daixque commented Jun 12, 2023

Add BertJapaneseTokenizer support #529

Add BertJapaneseTokenizer support #529

Conversation

daixque commented Apr 18, 2023

Overview

Test

Notes

davidkyle commented Apr 19, 2023

daixque commented Apr 19, 2023 • edited Loading

davidkyle commented Jun 12, 2023

davidkyle left a comment

Choose a reason for hiding this comment

daixque commented Jun 12, 2023

daixque commented Apr 19, 2023 •

edited

Loading