Add mT5 #98

haileyschoelkopf · 2021-11-02T00:14:46Z

add mT5 model (using a checkpoint fine-tuned on the XLSum dataset.)

Ready to merge, but still todo:

possibly adding the rest of the 101 languages that mT5-base was trained on to supported languages, instead of just including the languages in XLSum as supported languages (~45 languages)

niansong1996 · 2021-11-06T20:06:03Z

Thanks a lot for the PR, Nick!

I haven't got the time to review everything yet, which I will do ASAP.

One thing I noticed is that we don't have the mBART model listed in the Readme.md tables about supported models. Can you add it altogether with mT5 in this PR? Thanks!

haileyschoelkopf · 2021-11-08T19:04:34Z

Sure, I can add documentation for this and mBART in the PR!

niansong1996 · 2021-11-11T18:24:32Z

README.md

@@ -235,7 +238,7 @@ print(corpus)
 ```

 ### Loading a custom dataset
-You can use load custom data using the `CustomDataset` class that puts the data in the SummerTime dataset Class
+You can usecustom data using the `CustomDataset` class that loads the data in the SummerTime dataset Class


Ooops, missed a space here

niansong1996

Looks good. The slight issue is that seems a lot of the changes are duplicates of #96, presumably because of the delay of my reviews on that branch. Sorry about that.

Let's try to merge #96 first and pull from main for this branch.

niansong1996 · 2021-11-11T18:26:22Z

summertime/model/single_doc/bart_model.py

@@ -22,6 +22,8 @@ def __init__(self, device="cpu"):
    def summarize(self, corpus, queries=None):
        self.assert_summ_input_type(corpus, queries)

+        self.assert_summ_input_language(corpus, queries)


I have made a comment about this in #96, after the refactoring on that branch, let's merge it to main and pull main for this branch so it's fixed here automatically as well.

niansong1996 · 2021-11-11T21:13:34Z

Okay, now that #96 is merged, we should rebase this branch on main or pull from main?

haileyschoelkopf · 2021-11-11T23:19:35Z

@niansong1996 This PR should be all set for review now!

niansong1996 · 2021-11-12T21:25:23Z

summertime/model/base_model.py

@@ -86,6 +86,16 @@ def generate_basic_description(cls) -> str:

        return basic_description

+    # TODO nick: implement this function eventually!


should this be in the base_model.py or the multingual_model?

Oh, I see that you are adding the function of returning "english" for non-multilingual models. Okay, then this is good.

niansong1996 · 2021-11-12T21:31:45Z

summertime/model/single_doc/multilingual/mt5_model.py

+    is_neural = True
+    is_multilingual = True
+
+    lang_tag_dict = {


hmm, if all the keys and values are the same, why is it a Dict and not a List? I understand that you want to maintain some consistency across different multi-lingual models. If it's not a mBART-specific thing, then maybe better to store them in a list and initalize the dict with that list. Leaves smaller room for error this way.

Yes, I was using a dict just to stay consistent with mBART. I can change the initialization for this dictionary to be from a list though

niansong1996 · 2021-11-12T21:35:41Z

summertime/model/single_doc/multilingual/mbart_model.py

            "Weaknesses: \n - High memory usage"
            "Initialization arguments: \n "
            "- `device = 'cpu'` specifies the device the model is stored on and uses for computation. "
-            "Use `device='gpu'` to run on an Nvidia GPU."
+            "Use `device='cuda'` to run on an Nvidia GPU."


Oh, no. I think this typo is actually common across all our models... Good catch!

But do you mind fixing the others as well?

niansong1996 · 2021-11-12T21:36:03Z

summertime/model/single_doc/multilingual/mbart_model.py

@@ -93,7 +93,7 @@ def summarize(self, corpus, queries=None):

        encoded_summaries = self.model.generate(
            **batch,
-            decoder_start_token_id=self.tokenizer.lang_code_to_id[lang_code],
+            forced_bos_token_id=self.tokenizer.lang_code_to_id[lang_code],


Hmm, what's the difference?

don't think there is a difference, will switch back to decoder_start_token_id because its name is more self explanatory imo

niansong1996

Great work! Left a few comments.

A more general comment is that did you write model-specific tests (for mBART, mT5) in the tests/model_test.py? You can find more templates/examples in that file.

haileyschoelkopf · 2021-11-13T19:34:05Z

@niansong1996 should be ready for merge! I have written generic tests for multilingual models (using a Spanish language instance from MLSum) but have not written any specific tests as was done for HMNet. Will do that in another PR though!

niansong1996 · 2021-11-13T22:32:35Z

Awesome! Merging this PR now.

add mt5 model

90850b3

haileyschoelkopf changed the title ~~Nick/mt5~~ Add mT5 Nov 2, 2021

reformatting

e594938

NickSchoelkopf and others added 8 commits November 8, 2021 15:23

merge with main

30cfff2

add rest of mt5 languages to dict

1415080

use download caching

7e6b81c

reformatting

d751db8

start on readme edits

897ca45

finish first draft of multilingual model documentation

bf35ca3

reformatting

68a8635

Merge branch 'main' into nick/mT5

5528e40

haileyschoelkopf requested a review from niansong1996 November 9, 2021 19:23

niansong1996 reviewed Nov 11, 2021

View reviewed changes

niansong1996 mentioned this pull request Nov 11, 2021

Add translation pipeline model #110

Merged

NickSchoelkopf added 2 commits November 11, 2021 18:14

Merge branch 'main' into nick/mT5

51be8db

[skip-ci] fix readme typo

92ad57f

haileyschoelkopf requested a review from niansong1996 November 11, 2021 23:19

fix mex additional merge conflict

ce859d6

niansong1996 reviewed Nov 12, 2021

View reviewed changes

fix changes for merge

32a9042

niansong1996 merged commit db7b2ad into main Nov 13, 2021

haileyschoelkopf deleted the nick/mT5 branch November 14, 2021 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mT5 #98

Add mT5 #98

haileyschoelkopf commented Nov 2, 2021 •

edited

Loading

niansong1996 commented Nov 6, 2021

haileyschoelkopf commented Nov 8, 2021

niansong1996 Nov 11, 2021

niansong1996 left a comment

niansong1996 Nov 11, 2021

niansong1996 commented Nov 11, 2021

haileyschoelkopf commented Nov 11, 2021

niansong1996 Nov 12, 2021

niansong1996 Nov 12, 2021

niansong1996 Nov 12, 2021

haileyschoelkopf Nov 13, 2021

niansong1996 Nov 12, 2021

haileyschoelkopf Nov 13, 2021

niansong1996 Nov 12, 2021

haileyschoelkopf Nov 13, 2021

niansong1996 left a comment

haileyschoelkopf commented Nov 13, 2021

niansong1996 commented Nov 13, 2021

		@@ -86,6 +86,16 @@ def generate_basic_description(cls) -> str:

		return basic_description

		# TODO nick: implement this function eventually!

Add mT5 #98

Add mT5 #98

Conversation

haileyschoelkopf commented Nov 2, 2021 • edited Loading

niansong1996 commented Nov 6, 2021

haileyschoelkopf commented Nov 8, 2021

Choose a reason for hiding this comment

niansong1996 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niansong1996 commented Nov 11, 2021

haileyschoelkopf commented Nov 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niansong1996 left a comment

Choose a reason for hiding this comment

haileyschoelkopf commented Nov 13, 2021

niansong1996 commented Nov 13, 2021

haileyschoelkopf commented Nov 2, 2021 •

edited

Loading