Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BertJapaneseTokenizer support #529

Closed

Conversation

daixque
Copy link
Contributor

@daixque daixque commented Apr 18, 2023

Overview

As of eland-8.7.0, the model which uses BertJapaneseTokenizer (such as cl-tohoku/bert-base-japanese-v2) can't be uploaded by eland_import_hub_model command.

Error message says:

TypeError: Tokenizer type BertJapaneseTokenizer(name_or_path='cl-tohoku/bert-base-japanese-v2', vocab_size=32768, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True) not supported, must be one of: <class 'transformers.models.bart.tokenization_bart.BartTokenizer'>, <class 'transformers.models.bert.tokenization_bert.BertTokenizer'>, <class 'transformers.models.distilbert.tokenization_distilbert.DistilBertTokenizer'>, <class 'transformers.models.dpr.tokenization_dpr.DPRContextEncoderTokenizer'>, <class 'transformers.models.dpr.tokenization_dpr.DPRQuestionEncoderTokenizer'>, <class 'transformers.models.electra.tokenization_electra.ElectraTokenizer'>, <class 'transformers.models.mobilebert.tokenization_mobilebert.MobileBertTokenizer'>, <class 'transformers.models.mpnet.tokenization_mpnet.MPNetTokenizer'>, <class 'transformers.models.retribert.tokenization_retribert.RetriBertTokenizer'>, <class 'transformers.models.roberta.tokenization_roberta.RobertaTokenizer'>, <class 'transformers.models.squeezebert.tokenization_squeezebert.SqueezeBertTokenizer'>

This PR adds BertJapaneseTokenizer into SUPPORTED_TOKENIZERS to avoid this error.

Test

I manually tested this change by using:

  • ESS 8.7.0
  • Python 3.9.16 on Colab notebook

Scripts:

# install depencencies
!pip install torch==1.11
!pip install transformers
!pip install sentence_transformers
!pip install fugashi
!pip install ipadic
!pip install unidic_lite

# install modified eland
!git clone https://github.com/daixque/eland.git
!cd eland;pwd;ls; git checkout add_BertJapaneseTokenizer_support; pip install .

# upload the model which uses BertJapaneseTokenizer
!eland_import_hub_model \
--url https://user:password@myess:9243 \
--hub-model-id cl-tohoku/bert-base-japanese-v2 \
--task-type text_embedding \
--start

Then verified this model on Kibana.

image

At least it works without any errors.

Notes

When I checked the trained model via API, response showed as:

{
  "count": 1,
  "trained_model_configs": [
    {
      "model_id": "cl-tohoku__bert-base-japanese-v2",
      "model_type": "pytorch",
      "created_by": "api_user",
      "version": "8.7.0",
      "create_time": 1681702816443,
      "model_size_bytes": 0,
      "estimated_operations": 0,
      "license_level": "platinum",
      "description": "Model cl-tohoku/bert-base-japanese-v2 for task type 'text_embedding'",
      "tags": [],
      "input": {
        "field_names": [
          "text_field"
        ]
      },
      "inference_config": {
        "text_embedding": {
          "vocabulary": {
            "index": ".ml-inference-native-000001"
          },
          "tokenization": {
            "bert": {
              "do_lower_case": false,
              "with_special_tokens": true,
              "max_sequence_length": 512,
              "truncate": "first",
              "span": -1
            }
          }
        }
      },
      "location": {
        "index": {
          "name": ".ml-inference-native-000001"
        }
      }
    }
  ]
}

Here we can see tokenization is "bert", so Elasticsearch will use BertTokenizer.
And with using BertTokenizer, pre-tokenize will done by WhitespaceTokenizer.
Ideally we should have BertJapaneseTokenizer on Elasticsearch side as well.

I believe pre-tokenizer for WordPieceAnaluzer for Japanese should be JapaneseAnalyzer from Kuromoji, but it is a part of Japanese (kuromoji) analysis plugin, so it is not available by default. These area requires another work.

@davidkyle
Copy link
Member

Hi @daixque thanks for the PR it is fantastic that you have this Japanese BERT model working in Elastic!

Ideally we should have BertJapaneseTokenizer on Elasticsearch side as well.

Yes Elasticsearch should implement the same MeCab tokeniser which the model was trained with.

https://huggingface.co/cl-tohoku/bert-base-japanese-v2#tokenization

As you noted the default whitespace tokeniser is used instead. Do you know how this affects the quality of the results?

@daixque
Copy link
Contributor Author

daixque commented Apr 19, 2023

@davidkyle On the trained model side, it uses MeCab as pre-tokenizer, so the Japanese sentence will be tokenized differently from training phase, which will cause many sub-word will not found in vocabulary on inference phase on Elasticsearch, then it will finally be tokenized as [UNK]. So result won't be as expected, but some word will be tokenized as expected eventually, so it's still much better than unsupported state.

@davidkyle
Copy link
Member

Note that to use the BertJapaneseTokenizer Mecab must be installed on your system.

Mecab requires the following additional packages:

The Eland eland_import_hub_model script needs to use the BertJapaneseTokenizer to create the inputs for tracing the model into TorchScript format. There are no extra dependencies on the Elasticsearch server, the Japanese tokenizer is only required when tracing the model (Eland uploads the traced model to Elasticsearch).

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@daixque
Copy link
Contributor Author

daixque commented Jun 12, 2023

This is no longer needed, instead please merge:
#534

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants