Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add japanese bert pretrained model #118

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft

add japanese bert pretrained model #118

wants to merge 6 commits into from

Conversation

gorogoroyasu
Copy link

in order to use bert-base-japanese-whole-word-masking model, installed transformer independently, and fixed few codes.

@congdoanit98
Copy link

in order to use bert-base-japanese-whole-word-masking model, installed transformer independently, and fixed few codes.

hi guys, Can you give me the ways to get new BERT pre-trained model ? Thank you.

@beanandrew
Copy link

Hello, thanks for your contribution. I notice that you didn't change some functions designed to preprocess the english datasets in the data_builder file, and you use the multilingual model to substitute the old one, so I guess that you use the english datasets to train your Japanese model. Is my guess correct? Look forward to your soonest reply. Thank you!

@gorogoroyasu
Copy link
Author

@congdoanit98
sorry for my too late reply..
I downloaded it from this link.
https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking

@beanandrew
Yeah, actually the work was not finished, and I had to close this PR...
Anyways, I will try to reply your questions.

you didn't change some functions designed to preprocess the english datasets

You're correct. But, I used Japanese dataset.
I wrote some codes to do the same thing in Japanese as step1 to step4 (it simply generating json ), and I preprocessed my dataset by using it. After that, I adapted it to step5. Notice, you have to tokenize the japanese with JapaneseTokenizer from huggingface/transformers .
After that, I changed some part to load pretrained model from bert-base-uncased to cl-tohoku/bert-base-japanese-whole-word-masking. Then I did the training.
You also have to add these words ("[unused0]", "[unused1]", "[unused2]", "[unused3]", "[unused4]", "[unused5]", "[unused6]") to token.txt

I think these were all changes.
If things above does not work, don't hesitate to ask me.

Thanks.

@gorogoroyasu gorogoroyasu marked this pull request as draft January 12, 2021 10:56
@beanandrew
Copy link

Thanks for your reply.
I am also trying to transferring this work to another language, and before I see your work, I noticed that, even if I have finished work from step1 to step4 with my own code, and only use the format_to_bert function to finish the step 5, some functions still need to be changed.
For example, the _rouge_clean() function in file data_builder.py in src as followed, is used in the step 5 to clean up all the punctuations in my sentence. But actually, it does this by removing all non-a-z and 0-9 characters, it means that, words in Japanese will be removed, and return an empty list as the sent_labels.

def _rouge_clean(s):
    return re.sub(r'[^a-zA-Z0-9 ]', '', s) 

I am still on the way to change the code, and I haven't do experiments on this, so I wanna know how you solve problems like this, or after your experiments, these problems won't change the experiment results?
Hopes for your reply, Thank you!

@gorogoroyasu
Copy link
Author

gorogoroyasu commented Jan 12, 2021

Well, You are correct again..
I didn't care about the _rouge_clean function or some others in format_to_bert function, but it seems to be critical to the generated dataset..

I conducted my experiments by using the original step5 codes, and I could generate some summaries, and it showed reasonable result.
Though, the code you mentioned seems to have huge impact on EXT result, which can lead the bad EXTABS result.
I have to modify it and have to re-experiment it, again.

I hope this information will support your work.

@beanandrew
Copy link

Thanks for your reply, your answer helps me lot!
By the way, I wanna ask you a question about the convert_tokens_to_ids() function of class BertTokenizer in tokenization.py file, path src/others.

    def convert_tokens_to_ids(self, tokens):
        """Converts a sequence of tokens into ids using the vocab."""
        ids = []

        for token in tokens:
            if(token in self.never_split):
                continue
            else:
                ids.append(self.vocab[token])
        return ids

When I try to debug the code in the Step5, I notice that, because of this code, the labels like [CLS] and [SEP] are skipped, and that cause the Error "CUDAType Error" when I use the dataset it preprocessed. Thus, I changed the code as followed, and the Error no longer appear.

    def convert_tokens_to_ids(self, tokens):
        """Converts a sequence of tokens into ids using the vocab."""
        ids = []

        for token in tokens:
            ids.append(self.vocab[token])
        return ids

I want to know, do you come with Errors like this when you use the original code? Or is this just my personal problem?
Also, I wanna know that, if I want to test my model in other language with the 'mode -test', should I do some other changes in the code?
Hopes for your reply. Thank you!

@gorogoroyasu
Copy link
Author

Hmm, I couldn't find that code you mentioned on master branch. Anyways, your suggestion seems to work beautifully.
Check it out again, please.
https://github.com/nlpyang/PreSumm/blame/master/src/others/tokenization.py#L108

I remember that I commented out the code below (rouge score calculator), because I couldn't solve errors from pyrouge library.
https://github.com/nlpyang/PreSumm/blame/master/src/models/predictor.py#L188

Instead of the pyrouge library, I used this one. I fixed the code to export the summarized text, and after all results were written, I evaluated the rouge score. If you could correctly install the pyrouge library, you don't have to care I think.

Additionally, the token size is very important. In Japanese token.txt, I couldn't find something like [unused0], so I expanded the token.txt to support them. I think you already know what I mean, but, to make sure :)

@beanandrew
Copy link

Thank you very much for your quick reply!
I followed your advise and check the master brach, and found that this code was not there. Also, I check other branches and my repo, only to find its nowhere to find... maybe I copied a wrong version of the project and it lead to this confusing error.
Your change from token [unused0] to [unused7] is very useful, before I see your work, I didn't know how to solve this problem, and tried to add the [unused0] to the vocab roughly... Thanks to your work!
Now I can finally do experiments on my datasets with your help. I will contact you if I have some new findings~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants