add japanese bert pretrained model #118

gorogoroyasu · 2020-01-24T03:36:04Z

in order to use bert-base-japanese-whole-word-masking model, installed transformer independently, and fixed few codes.

congdoanit98 · 2020-11-29T14:02:22Z

in order to use bert-base-japanese-whole-word-masking model, installed transformer independently, and fixed few codes.

hi guys, Can you give me the ways to get new BERT pre-trained model ? Thank you.

beanandrew · 2021-01-10T11:20:56Z

Hello, thanks for your contribution. I notice that you didn't change some functions designed to preprocess the english datasets in the data_builder file, and you use the multilingual model to substitute the old one, so I guess that you use the english datasets to train your Japanese model. Is my guess correct? Look forward to your soonest reply. Thank you！

gorogoroyasu · 2021-01-12T10:56:21Z

@congdoanit98
sorry for my too late reply..
I downloaded it from this link.
https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking

@beanandrew
Yeah, actually the work was not finished, and I had to close this PR...
Anyways, I will try to reply your questions.

you didn't change some functions designed to preprocess the english datasets

You're correct. But, I used Japanese dataset.
I wrote some codes to do the same thing in Japanese as step1 to step4 (it simply generating json ), and I preprocessed my dataset by using it. After that, I adapted it to step5. Notice, you have to tokenize the japanese with JapaneseTokenizer from huggingface/transformers .
After that, I changed some part to load pretrained model from bert-base-uncased to cl-tohoku/bert-base-japanese-whole-word-masking. Then I did the training.
You also have to add these words ("[unused0]", "[unused1]", "[unused2]", "[unused3]", "[unused4]", "[unused5]", "[unused6]") to token.txt

I think these were all changes.
If things above does not work, don't hesitate to ask me.

Thanks.

beanandrew · 2021-01-12T11:38:44Z

Thanks for your reply.
I am also trying to transferring this work to another language, and before I see your work, I noticed that, even if I have finished work from step1 to step4 with my own code, and only use the format_to_bert function to finish the step 5, some functions still need to be changed.
For example, the _rouge_clean() function in file data_builder.py in src as followed, is used in the step 5 to clean up all the punctuations in my sentence. But actually, it does this by removing all non-a-z and 0-9 characters, it means that, words in Japanese will be removed, and return an empty list as the sent_labels.

def _rouge_clean(s):
    return re.sub(r'[^a-zA-Z0-9 ]', '', s)

I am still on the way to change the code, and I haven't do experiments on this, so I wanna know how you solve problems like this, or after your experiments, these problems won't change the experiment results?
Hopes for your reply, Thank you!

gorogoroyasu · 2021-01-12T12:05:56Z

Well, You are correct again..
I didn't care about the _rouge_clean function or some others in format_to_bert function, but it seems to be critical to the generated dataset..

I conducted my experiments by using the original step5 codes, and I could generate some summaries, and it showed reasonable result.
Though, the code you mentioned seems to have huge impact on EXT result, which can lead the bad EXTABS result.
I have to modify it and have to re-experiment it, again.

I hope this information will support your work.

beanandrew · 2021-01-12T12:39:28Z

Thanks for your reply, your answer helps me lot!
By the way, I wanna ask you a question about the convert_tokens_to_ids() function of class BertTokenizer in tokenization.py file, path src/others.

    def convert_tokens_to_ids(self, tokens):
        """Converts a sequence of tokens into ids using the vocab."""
        ids = []

        for token in tokens:
            if(token in self.never_split):
                continue
            else:
                ids.append(self.vocab[token])
        return ids

When I try to debug the code in the Step5, I notice that, because of this code, the labels like [CLS] and [SEP] are skipped, and that cause the Error "CUDAType Error" when I use the dataset it preprocessed. Thus, I changed the code as followed, and the Error no longer appear.

    def convert_tokens_to_ids(self, tokens):
        """Converts a sequence of tokens into ids using the vocab."""
        ids = []

        for token in tokens:
            ids.append(self.vocab[token])
        return ids

I want to know, do you come with Errors like this when you use the original code? Or is this just my personal problem?
Also, I wanna know that, if I want to test my model in other language with the 'mode -test', should I do some other changes in the code?
Hopes for your reply. Thank you!

gorogoroyasu · 2021-01-12T13:13:43Z

Hmm, I couldn't find that code you mentioned on master branch. Anyways, your suggestion seems to work beautifully.
Check it out again, please.
https://github.com/nlpyang/PreSumm/blame/master/src/others/tokenization.py#L108

I remember that I commented out the code below (rouge score calculator), because I couldn't solve errors from pyrouge library.
https://github.com/nlpyang/PreSumm/blame/master/src/models/predictor.py#L188

Instead of the pyrouge library, I used this one. I fixed the code to export the summarized text, and after all results were written, I evaluated the rouge score. If you could correctly install the pyrouge library, you don't have to care I think.

Additionally, the token size is very important. In Japanese token.txt, I couldn't find something like [unused0], so I expanded the token.txt to support them. I think you already know what I mean, but, to make sure :)

beanandrew · 2021-01-12T14:39:13Z

Thank you very much for your quick reply!
I followed your advise and check the master brach, and found that this code was not there. Also, I check other branches and my repo, only to find its nowhere to find... maybe I copied a wrong version of the project and it lead to this confusing error.
Your change from token [unused0] to [unused7] is very useful, before I see your work, I didn't know how to solve this problem, and tried to add the [unused0] to the vocab roughly... Thanks to your work!
Now I can finally do experiments on my datasets with your help. I will contact you if I have some new findings~

gorogoroyasu added 6 commits January 24, 2020 11:23

add japanese bert pretrained model

ffd0898

fixed some codes to use japanese

71a94fc

ignore some files

984e3f6

MODEL_PATH directory

bf90794

from transformers to pytorch_transformers

dfd193d

symbols を誤って削除していたので修正

4cb4722

SebastianVeile mentioned this pull request Jun 18, 2020

Advice on prediction and replacing the bert model #174

Open

gorogoroyasu marked this pull request as draft January 12, 2021 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add japanese bert pretrained model #118

add japanese bert pretrained model #118

gorogoroyasu commented Jan 24, 2020

congdoanit98 commented Nov 29, 2020

beanandrew commented Jan 10, 2021

gorogoroyasu commented Jan 12, 2021

beanandrew commented Jan 12, 2021

gorogoroyasu commented Jan 12, 2021 •

edited

Loading

beanandrew commented Jan 12, 2021

gorogoroyasu commented Jan 12, 2021

beanandrew commented Jan 12, 2021

add japanese bert pretrained model #118

Are you sure you want to change the base?

add japanese bert pretrained model #118

Conversation

gorogoroyasu commented Jan 24, 2020

congdoanit98 commented Nov 29, 2020

beanandrew commented Jan 10, 2021

gorogoroyasu commented Jan 12, 2021

beanandrew commented Jan 12, 2021

gorogoroyasu commented Jan 12, 2021 • edited Loading

beanandrew commented Jan 12, 2021

gorogoroyasu commented Jan 12, 2021

beanandrew commented Jan 12, 2021

gorogoroyasu commented Jan 12, 2021 •

edited

Loading