Add REALM #13292

qqaatw · 2021-08-27T07:31:40Z

What does this PR do?

This PR adds REALM.

Original paper: https://arxiv.org/abs/2002.08909
Code and checkpoints: https://github.com/google-research/language/tree/master/language/realm

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Closes #3497

OctoberChang · 2021-08-27T22:16:08Z

Hi @qqaatw,

Thank you so much for putting effort on this PR, providing pre-trained REALM models in Pytorch Transformers API.

I am wondering whether your REALM models in pytorch can reproduce Table 2 of their original paper?

Alternatively, do you verify their tensorflow pre-train model has the same embeddings as your converted pytorch models given arbitrary input sequence?

Thanks again for the awesome work!

qqaatw · 2021-08-28T16:36:04Z

Hello @OctoberChang, thanks for your reply!

This is my first time trying to port a model from Tensorflow, so I may need some time to clarify the structure and behavior of the original model. Currently, the retriever part was successfully converted.

Regarding your concerns, I've verified the retriever's behavior by feeding the same inputs to TensorFlow and PyTorch models respectively, and then checking their outputs that are nearly identical. For now, I may not have enough resources to complete those ablation experiments sadly, but I think it can be reproduced as long as the PyTorch model's behavior is nearly the same as that of Tensorflow.

OctoberChang · 2021-08-28T17:17:44Z

Hello @OctoberChang, thanks for your reply!

This is my first time trying to port a model from Tensorflow, so I may need some time to clarify the structure and behavior of the original model. Currently, the retriever part was successfully converted.

Regarding your concerns, I've verified the retriever's behavior by feeding the same inputs to TensorFlow and PyTorch models respectively, and then checking their outputs that are nearly identical. For now, I may not have enough resources to complete those ablation experiments sadly, but I think it can be reproduced as long as the PyTorch model's behavior is nearly the same as that of Tensorflow.

Awesome! Looking forward to this PR and the pre-trained Realm models in Pytorch Transformers!

1. Remove RealmTokenizerFast 2. Update docstrings 3. Add a method to RealmTokenizer to handle candidates tokenization.

qqaatw · 2021-09-01T12:42:24Z

The reason I didn't add RealmForQuestionAnswering is the following:

The fine-tuning code is placed at another project, ORQA, which has its own paper.
The architecture of fine-tuned models is not compatible with the existing question answering architecture in transformers.

Therefore, I think residing the fine-tuning code to research_project folder or making it a new model would be more appropriate.

qqaatw · 2022-01-06T13:00:25Z

We could then think also a bit about how to best demo this model :-) Maybe in a space (depending on how big the required RAM is). Also I think a blog post could be a great idea

Both sound appealing! Let me know if I could help with this.

src/transformers/models/realm/configuration_realm.py

src/transformers/models/realm/modeling_realm.py

patrickvonplaten

This is really great work @qqaatw . Realm is a very difficult model and you've done an outstanding job here!

I've mostly left nits, but I would like to discuss two final things here cc @sgugger @LysandreJik @patil-suraj :

a) I think we should break up the abstraction dependency on BertTokenizer
b) I'm not sure whether it's a good idea to have a batch_encode_candidates method

Apart from this the PR looks good to me!

@lhoestq - could you also take a final in-depth review here? :-)

src/transformers/models/realm/tokenization_realm.py

patrickvonplaten · 2022-01-07T08:29:19Z

src/transformers/models/realm/tokenization_realm.py

+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
+
+    def batch_encode_candidates(self, text, **kwargs):


should we maybe instead overwrite the __call__ method here? Why do we have to enforce the max-length padding strategy here?

@LysandreJik @sgugger

IMO this should be handled in __call__ method. Because this is not aligned with the tokenizer API.
For example, see the LUKETokenizer which overrides __call__ to encode multiple texts
https://github.com/huggingface/transformers/blob/master/src/transformers/models/luke/tokenization_luke.py#L262

@patrickvonplaten The reason of requiring max-length padding strategy is that a single input would include many candidates. In the following example, a sample has two candidates, and there are two samples that subsume into a batch. The goal of this function is to produce an encoded output with (batch_size, num_candidates, max_length) shape for RealmScorer.

# batch_size = 2, num_candidates = 2 batch_texts = [ ["Hello world!", "Nice to meet you!"], # First sample ["The cute cat.", "The adorable dog."] # Second sample ] tokenized_texts = tokenizer.batch_encode_candidates(text, max_length=10, return_tensors="pt") print(tokenized_texts.input_ids.shape) # (2, 2, 10)

Therefore, if we don't adopt the max-length strategy and use the longest strategy instead, based on the current logic of batch_encode_candidates, the first sample will be firstly inputted into __call__ and result in an output with shape (num_candidates, longest_length_of_first_sample). Then, we do the same step for the second sample and get an output with shape (num_candidates, longest_length_of_second_sample). These two outputs cannot be stacked into a batch tensor because their dimension on the last axis is different. We could however design a more sophisticated logic to adapt the longest strategy (I can do it in this PR or in a follow-up PR).

@patrickvonplaten, @patil-suraj - This function could be integrated into __call__, but IMO I think keeping a separated function would be more clear and unequivocal to users i.e. using batch_encode_candidates when they want to encode samples containing candidates for RealmScorer; otherwise, using the regular __call__ to encode.

I'm ok with overriding the __call__ if you think this way is more aligned and intuitive though.

patrickvonplaten · 2022-01-07T08:30:31Z

tests/test_retrieval_realm.py

+            ["[CLS]", "test", "question", "[SEP]", "this", "is", "the", "fourth", "record", "[SEP]"],
+        )
+
+    def test_block_has_answer(self):


patrickvonplaten · 2022-01-07T08:30:37Z

tests/test_retrieval_realm.py

+        self.assertEqual([[-1], [6]], start_pos)
+        self.assertEqual([[-1], [7]], end_pos)
+
+    def test_save_load_pretrained(self):


patrickvonplaten · 2022-01-07T08:34:25Z

Hey @patrickvonplaten,

Currently all the tests pass, and the new block_records handling looks working well. Thank you for making these changes. Now I think we should upload the .npy file to every related model repo right? Could you help this out as the file is so huge that might take a long time to upload from my side. (should upload to the repo with suffix openqa I think)

Also, I've written a standalone checkpoint conversion script here as well as some instructions on the readme. Do you think we just provide a link directing to my repo in the model docs or should place it into transformers?

By the way, I'm not sure how to preview the generated docs website (I see this feature is under development in CONTRIBUTING.md).

I can definitely help with the uploading part once the PR is merged :-)

patil-suraj

Fantastic work @qqaatw, really great!

I have pretty much the same comments as Patrick about tokenizer abstraction and the batch_encode_candidates method. And left a few more nits :)

Thanks a lot for adding this model.

docs/source/model_doc/realm.mdx

src/transformers/models/realm/__init__.py

src/transformers/models/realm/configuration_realm.py

src/transformers/models/realm/modeling_realm.py

src/transformers/models/realm/retrieval_realm.py

patil-suraj · 2022-01-07T13:28:48Z

src/transformers/models/realm/retrieval_realm.py

+logger = logging.get_logger(__name__)
+
+
+def convert_tfrecord_to_np(block_records_path, num_block_records):


is this function used anywhere?

No good point though! I thought we could leave it as it's necessary to see how to convert between the original records on the "new" numpy methods - but would also be ok with deleting it

I use this function in the checkpoint conversion script.

As Patrick said, it shows how tf records are converted, and deleting it would be ok too.

src/transformers/models/realm/tokenization_realm.py

patil-suraj · 2022-01-07T13:42:57Z

src/transformers/models/realm/tokenization_realm.py

+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
+
+    def batch_encode_candidates(self, text, **kwargs):


IMO this should be handled in __call__ method. Because this is not aligned with the tokenizer API.
For example, see the LUKETokenizer which overrides __call__ to encode multiple texts
https://github.com/huggingface/transformers/blob/master/src/transformers/models/luke/tokenization_luke.py#L262

src/transformers/models/realm/tokenization_realm.py

patil-suraj · 2022-01-10T11:07:56Z

docs/source/model_doc/realm.mdx

+methods by a significant margin (4-16% absolute accuracy), while also providing qualitative benefits such as
+interpretability and modularity.*
+


One more nit for follow-up PR.

Since this is a complex model it would be awesome to add a usage example either here or in reserach_projects dir like RAG.

lhoestq · 2022-01-10T13:53:57Z

Amazing job ! Thanks for converting the records blocks to numpy format, it's more convenient this way. Later we can even allow to memory-map the numpy file to avoid filling up the RAM :)

patrickvonplaten · 2022-01-11T17:34:44Z

I think this PR is more or less ready to be merged. @sgugger @LysandreJik could you give it a quick look maybe?

sgugger

I think this is just missing a few more code examples, but otherwise good to go!

src/transformers/models/realm/modeling_realm.py

sgugger · 2022-01-11T17:43:34Z

src/transformers/models/realm/modeling_realm.py

+            Whether or not the evidence block has answer(s).
+
+        Returns:
+        """


Users will not directly use this model as it is an internal model of RealmForOpenQA

src/transformers/models/realm/tokenization_realm.py

LysandreJik

LGTM, thanks a lot for your impressive contribution @qqaatw!

sgugger · 2022-01-17T20:32:38Z

All test failures are due to the latest release of Tokenizers. A quick rebase on master should get rid of them.

sgugger · 2022-01-18T12:24:25Z

Thanks for all your work on this!

LSinev · 2022-03-29T09:09:31Z

src/transformers/models/realm/modeling_realm.py

+            # [batch_size, joint_seq_len]
+            marginal_gold_log_probs = joint_gold_log_prob.logsumexp(1)
+            # []
+            masked_lm_loss = -torch.nansum(torch.sum(marginal_gold_log_probs * mlm_mask) / torch.sum(mlm_mask))


@qqaatw You used nansum in your PR with REALM. nansum is available in torch>=1.7.0. But transformers per settings in setup.py (https://github.com/huggingface/transformers/blob/main/setup.py#L155) has to support torch>=1.0 (so, your PR breaks compatibility with torch<1.7.0,>=1.0). Please consider new PR with compatibility improvements.
@sgugger Mentioning you as the merger of PR. Please consider some track for CI/CD to test torch compatibility of transformers.

While the library as a whole relies on PyToch (actually >=1.3, not just 1.0) each model can have more specific requirements. Some require other dependencies (for instance Tapas requires torch-scatter) or newer versions of PyTorch.

The fact this one requires PyTorch >= 1.7 should be properly documented however, if anyone wants to make a PR :-)

REALM initial commit

e68583f

antoniolanza1996 mentioned this pull request Aug 27, 2021

REALM #3497

Closed

Retriever OK (Update new_gelu).

4d85596

qqaatw added 19 commits August 29, 2021 15:07

Encoder prediction score OK

1ff4364

Encoder pretrained model OK

baee376

Update retriever comments

7ed7265

Update docs, tests, and imports

dd3fb73

Prune unused models

927b106

Make embedder as a module RealmEmbedder

96615bd

Add RealmRetrieverOutput

dea3b2f

Update tokenization

66859e1

Pass all tests in test_modeling_realm.py

ae889ee

Prune RealmModel

1b3bba2

Update docs

766d663

Add training test.

eb1837b

Remove completed TODO

b316149

Style & Quality

1b14c70

Prune RealmModel

ce0ef70

Merge branch 'master' into add_realm

4760751

Fixup

92a6a5b

Changes:

28b8dac

1. Remove RealmTokenizerFast 2. Update docstrings 3. Add a method to RealmTokenizer to handle candidates tokenization.

Fix up

bd6d2eb

qqaatw force-pushed the add_realm branch from 59f9f74 to bd6d2eb Compare September 1, 2021 10:53

Merge branch 'master' into add_realm

089fd65

qqaatw added 2 commits September 1, 2021 21:38

Style

633e452

Merge branch 'master' into add_realm

728ef3c

Fix docstring example

06a5412

patrickvonplaten reviewed Jan 7, 2022

View reviewed changes

src/transformers/models/realm/configuration_realm.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Jan 7, 2022

View reviewed changes

src/transformers/models/realm/modeling_realm.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Jan 7, 2022

View reviewed changes

patrickvonplaten requested review from patil-suraj, LysandreJik and sgugger January 7, 2022 08:33

patil-suraj reviewed Jan 7, 2022

View reviewed changes

qqaatw added 4 commits January 9, 2022 02:07

Minor fix of retrieval test

712c4b7

Update license headers and docs

1701f70

Apply suggestions from code review

493aa10

Style

fb43dd5

patil-suraj reviewed Jan 10, 2022

View reviewed changes

sgugger approved these changes Jan 11, 2022

View reviewed changes

LysandreJik approved these changes Jan 17, 2022

View reviewed changes

sgugger and others added 4 commits January 17, 2022 08:13

Merge branch 'master' into add_realm

54ee5cb

Apply suggestions from code review

a3cbaf0

Add an example to RealmEmbedder

0f4721b

Fix

894ce5f

Merge branch 'master' into add_realm

d655e5f

sgugger merged commit 22454ae into huggingface:master Jan 18, 2022

qqaatw mentioned this pull request Jan 18, 2022

Add FastTokenizer to REALM #15211

Merged

5 tasks

LSinev reviewed Mar 29, 2022

View reviewed changes

patrickvonplaten mentioned this pull request Apr 5, 2022

Add implementation of typical sampling #15504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add REALM #13292

Add REALM #13292

qqaatw commented Aug 27, 2021 •

edited by LysandreJik

Loading

OctoberChang commented Aug 27, 2021

qqaatw commented Aug 28, 2021

OctoberChang commented Aug 28, 2021

qqaatw commented Sep 1, 2021

qqaatw commented Jan 6, 2022

patrickvonplaten left a comment

patrickvonplaten Jan 7, 2022

patrickvonplaten Jan 7, 2022

patil-suraj Jan 7, 2022

qqaatw Jan 9, 2022 •

edited

Loading

qqaatw Jan 9, 2022

patrickvonplaten Jan 7, 2022

patrickvonplaten Jan 7, 2022

patrickvonplaten commented Jan 7, 2022

patil-suraj left a comment

patil-suraj Jan 7, 2022

patrickvonplaten Jan 7, 2022

qqaatw Jan 9, 2022

patil-suraj Jan 7, 2022

patil-suraj Jan 10, 2022

lhoestq commented Jan 10, 2022

patrickvonplaten commented Jan 11, 2022

sgugger left a comment

sgugger Jan 11, 2022

qqaatw Jan 17, 2022

LysandreJik left a comment

sgugger commented Jan 17, 2022

sgugger commented Jan 18, 2022

LSinev Mar 29, 2022

sgugger Mar 29, 2022

		logger = logging.get_logger(__name__)


		def convert_tfrecord_to_np(block_records_path, num_block_records):

		methods by a significant margin (4-16% absolute accuracy), while also providing qualitative benefits such as
		interpretability and modularity.*

Add REALM #13292

Add REALM #13292

Conversation

qqaatw commented Aug 27, 2021 • edited by LysandreJik Loading

What does this PR do?

Who can review?

OctoberChang commented Aug 27, 2021

qqaatw commented Aug 28, 2021

OctoberChang commented Aug 28, 2021

qqaatw commented Sep 1, 2021

qqaatw commented Jan 6, 2022

patrickvonplaten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qqaatw Jan 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Jan 7, 2022

patil-suraj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq commented Jan 10, 2022

patrickvonplaten commented Jan 11, 2022

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

sgugger commented Jan 17, 2022

sgugger commented Jan 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qqaatw commented Aug 27, 2021 •

edited by LysandreJik

Loading

qqaatw Jan 9, 2022 •

edited

Loading