Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add REALM #13292

Merged
merged 114 commits into from
Jan 18, 2022
Merged

Add REALM #13292

merged 114 commits into from
Jan 18, 2022

Conversation

qqaatw
Copy link
Contributor

@qqaatw qqaatw commented Aug 27, 2021

What does this PR do?

This PR adds REALM.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Closes #3497

@antoniolanza1996 antoniolanza1996 mentioned this pull request Aug 27, 2021
@OctoberChang
Copy link

Hi @qqaatw,

Thank you so much for putting effort on this PR, providing pre-trained REALM models in Pytorch Transformers API.

I am wondering whether your REALM models in pytorch can reproduce Table 2 of their original paper?

Alternatively, do you verify their tensorflow pre-train model has the same embeddings as your converted pytorch models given arbitrary input sequence?

Thanks again for the awesome work!

@qqaatw
Copy link
Contributor Author

qqaatw commented Aug 28, 2021

Hello @OctoberChang, thanks for your reply!

This is my first time trying to port a model from Tensorflow, so I may need some time to clarify the structure and behavior of the original model. Currently, the retriever part was successfully converted.

Regarding your concerns, I've verified the retriever's behavior by feeding the same inputs to TensorFlow and PyTorch models respectively, and then checking their outputs that are nearly identical. For now, I may not have enough resources to complete those ablation experiments sadly, but I think it can be reproduced as long as the PyTorch model's behavior is nearly the same as that of Tensorflow.

@OctoberChang
Copy link

Hello @OctoberChang, thanks for your reply!

This is my first time trying to port a model from Tensorflow, so I may need some time to clarify the structure and behavior of the original model. Currently, the retriever part was successfully converted.

Regarding your concerns, I've verified the retriever's behavior by feeding the same inputs to TensorFlow and PyTorch models respectively, and then checking their outputs that are nearly identical. For now, I may not have enough resources to complete those ablation experiments sadly, but I think it can be reproduced as long as the PyTorch model's behavior is nearly the same as that of Tensorflow.

Awesome! Looking forward to this PR and the pre-trained Realm models in Pytorch Transformers!

@qqaatw
Copy link
Contributor Author

qqaatw commented Sep 1, 2021

The reason I didn't add RealmForQuestionAnswering is the following:

  1. The fine-tuning code is placed at another project, ORQA, which has its own paper.
  2. The architecture of fine-tuned models is not compatible with the existing question answering architecture in transformers.

Therefore, I think residing the fine-tuning code to research_project folder or making it a new model would be more appropriate.

@qqaatw
Copy link
Contributor Author

qqaatw commented Jan 6, 2022

We could then think also a bit about how to best demo this model :-) Maybe in a space (depending on how big the required RAM is). Also I think a blog post could be a great idea

Both sound appealing! Let me know if I could help with this.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great work @qqaatw . Realm is a very difficult model and you've done an outstanding job here!

I've mostly left nits, but I would like to discuss two final things here cc @sgugger @LysandreJik @patil-suraj :

  • a) I think we should break up the abstraction dependency on BertTokenizer
  • b) I'm not sure whether it's a good idea to have a batch_encode_candidates method

Apart from this the PR looks good to me!

@lhoestq - could you also take a final in-depth review here? :-)

max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION

def batch_encode_candidates(self, text, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we maybe instead overwrite the __call__ method here? Why do we have to enforce the max-length padding strategy here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should be handled in __call__ method. Because this is not aligned with the tokenizer API.
For example, see the LUKETokenizer which overrides __call__ to encode multiple texts
https://github.com/huggingface/transformers/blob/master/src/transformers/models/luke/tokenization_luke.py#L262

Copy link
Contributor Author

@qqaatw qqaatw Jan 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickvonplaten The reason of requiring max-length padding strategy is that a single input would include many candidates. In the following example, a sample has two candidates, and there are two samples that subsume into a batch. The goal of this function is to produce an encoded output with (batch_size, num_candidates, max_length) shape for RealmScorer.

# batch_size = 2, num_candidates = 2
batch_texts = [
    ["Hello world!", "Nice to meet you!"], # First sample
    ["The cute cat.", "The adorable dog."]  # Second sample
]

tokenized_texts = tokenizer.batch_encode_candidates(text, max_length=10, return_tensors="pt")

print(tokenized_texts.input_ids.shape) # (2, 2, 10)

Therefore, if we don't adopt the max-length strategy and use the longest strategy instead, based on the current logic of batch_encode_candidates, the first sample will be firstly inputted into __call__ and result in an output with shape (num_candidates, longest_length_of_first_sample). Then, we do the same step for the second sample and get an output with shape (num_candidates, longest_length_of_second_sample). These two outputs cannot be stacked into a batch tensor because their dimension on the last axis is different. We could however design a more sophisticated logic to adapt the longest strategy (I can do it in this PR or in a follow-up PR).

@patrickvonplaten, @patil-suraj - This function could be integrated into __call__, but IMO I think keeping a separated function would be more clear and unequivocal to users i.e. using batch_encode_candidates when they want to encode samples containing candidates for RealmScorer; otherwise, using the regular __call__ to encode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with overriding the __call__ if you think this way is more aligned and intuitive though.

["[CLS]", "test", "question", "[SEP]", "this", "is", "the", "fourth", "record", "[SEP]"],
)

def test_block_has_answer(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean!

self.assertEqual([[-1], [6]], start_pos)
self.assertEqual([[-1], [7]], end_pos)

def test_save_load_pretrained(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice!

@patrickvonplaten
Copy link
Contributor

Hey @patrickvonplaten,

Currently all the tests pass, and the new block_records handling looks working well. Thank you for making these changes. Now I think we should upload the .npy file to every related model repo right? Could you help this out as the file is so huge that might take a long time to upload from my side. (should upload to the repo with suffix openqa I think)

Also, I've written a standalone checkpoint conversion script here as well as some instructions on the readme. Do you think we just provide a link directing to my repo in the model docs or should place it into transformers?

By the way, I'm not sure how to preview the generated docs website (I see this feature is under development in CONTRIBUTING.md).

I can definitely help with the uploading part once the PR is merged :-)

Copy link
Contributor

@patil-suraj patil-suraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work @qqaatw, really great!

I have pretty much the same comments as Patrick about tokenizer abstraction and the batch_encode_candidates method. And left a few more nits :)

Thanks a lot for adding this model.

docs/source/model_doc/realm.mdx Outdated Show resolved Hide resolved
docs/source/model_doc/realm.mdx Outdated Show resolved Hide resolved
src/transformers/models/realm/__init__.py Outdated Show resolved Hide resolved
src/transformers/models/realm/configuration_realm.py Outdated Show resolved Hide resolved
src/transformers/models/realm/modeling_realm.py Outdated Show resolved Hide resolved
src/transformers/models/realm/retrieval_realm.py Outdated Show resolved Hide resolved
logger = logging.get_logger(__name__)


def convert_tfrecord_to_np(block_records_path, num_block_records):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this function used anywhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good point though! I thought we could leave it as it's necessary to see how to convert between the original records on the "new" numpy methods - but would also be ok with deleting it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this function in the checkpoint conversion script.

As Patrick said, it shows how tf records are converted, and deleting it would be ok too.

max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION

def batch_encode_candidates(self, text, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should be handled in __call__ method. Because this is not aligned with the tokenizer API.
For example, see the LUKETokenizer which overrides __call__ to encode multiple texts
https://github.com/huggingface/transformers/blob/master/src/transformers/models/luke/tokenization_luke.py#L262

Comment on lines +34 to +36
methods by a significant margin (4-16% absolute accuracy), while also providing qualitative benefits such as
interpretability and modularity.*

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more nit for follow-up PR.

Since this is a complex model it would be awesome to add a usage example either here or in reserach_projects dir like RAG.

@lhoestq
Copy link
Member

lhoestq commented Jan 10, 2022

Amazing job ! Thanks for converting the records blocks to numpy format, it's more convenient this way. Later we can even allow to memory-map the numpy file to avoid filling up the RAM :)

@patrickvonplaten
Copy link
Contributor

I think this PR is more or less ready to be merged. @sgugger @LysandreJik could you give it a quick look maybe?

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just missing a few more code examples, but otherwise good to go!

src/transformers/models/realm/modeling_realm.py Outdated Show resolved Hide resolved
src/transformers/models/realm/modeling_realm.py Outdated Show resolved Hide resolved
Whether or not the evidence block has answer(s).

Returns:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users will not directly use this model as it is an internal model of RealmForOpenQA

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot for your impressive contribution @qqaatw!

@sgugger
Copy link
Collaborator

sgugger commented Jan 17, 2022

All test failures are due to the latest release of Tokenizers. A quick rebase on master should get rid of them.

@sgugger sgugger merged commit 22454ae into huggingface:master Jan 18, 2022
@sgugger
Copy link
Collaborator

sgugger commented Jan 18, 2022

Thanks for all your work on this!

@qqaatw qqaatw mentioned this pull request Jan 18, 2022
5 tasks
# [batch_size, joint_seq_len]
marginal_gold_log_probs = joint_gold_log_prob.logsumexp(1)
# []
masked_lm_loss = -torch.nansum(torch.sum(marginal_gold_log_probs * mlm_mask) / torch.sum(mlm_mask))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqaatw You used nansum in your PR with REALM. nansum is available in torch>=1.7.0. But transformers per settings in setup.py (https://github.com/huggingface/transformers/blob/main/setup.py#L155) has to support torch>=1.0 (so, your PR breaks compatibility with torch<1.7.0,>=1.0). Please consider new PR with compatibility improvements.
@sgugger Mentioning you as the merger of PR. Please consider some track for CI/CD to test torch compatibility of transformers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the library as a whole relies on PyToch (actually >=1.3, not just 1.0) each model can have more specific requirements. Some require other dependencies (for instance Tapas requires torch-scatter) or newer versions of PyTorch.

The fact this one requires PyTorch >= 1.7 should be properly documented however, if anyone wants to make a PR :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REALM
8 participants