-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add REALM #13292
Add REALM #13292
Conversation
Hi @qqaatw, Thank you so much for putting effort on this PR, providing pre-trained REALM models in Pytorch Transformers API. I am wondering whether your REALM models in pytorch can reproduce Table 2 of their original paper? Alternatively, do you verify their tensorflow pre-train model has the same embeddings as your converted pytorch models given arbitrary input sequence? Thanks again for the awesome work! |
Hello @OctoberChang, thanks for your reply! This is my first time trying to port a model from Tensorflow, so I may need some time to clarify the structure and behavior of the original model. Currently, the retriever part was successfully converted. Regarding your concerns, I've verified the retriever's behavior by feeding the same inputs to TensorFlow and PyTorch models respectively, and then checking their outputs that are nearly identical. For now, I may not have enough resources to complete those ablation experiments sadly, but I think it can be reproduced as long as the PyTorch model's behavior is nearly the same as that of Tensorflow. |
Awesome! Looking forward to this PR and the pre-trained Realm models in Pytorch Transformers! |
The reason I didn't add
Therefore, I think residing the fine-tuning code to research_project folder or making it a new model would be more appropriate. |
Both sound appealing! Let me know if I could help with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great work @qqaatw . Realm is a very difficult model and you've done an outstanding job here!
I've mostly left nits, but I would like to discuss two final things here cc @sgugger @LysandreJik @patil-suraj :
- a) I think we should break up the abstraction dependency on
BertTokenizer
- b) I'm not sure whether it's a good idea to have a
batch_encode_candidates
method
Apart from this the PR looks good to me!
@lhoestq - could you also take a final in-depth review here? :-)
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES | ||
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION | ||
|
||
def batch_encode_candidates(self, text, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we maybe instead overwrite the __call__
method here? Why do we have to enforce the max-length padding strategy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this should be handled in __call__
method. Because this is not aligned with the tokenizer API.
For example, see the LUKETokenizer
which overrides __call__
to encode multiple texts
https://github.com/huggingface/transformers/blob/master/src/transformers/models/luke/tokenization_luke.py#L262
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickvonplaten The reason of requiring max-length padding strategy is that a single input would include many candidates. In the following example, a sample has two candidates, and there are two samples that subsume into a batch. The goal of this function is to produce an encoded output with (batch_size, num_candidates, max_length)
shape for RealmScorer
.
# batch_size = 2, num_candidates = 2
batch_texts = [
["Hello world!", "Nice to meet you!"], # First sample
["The cute cat.", "The adorable dog."] # Second sample
]
tokenized_texts = tokenizer.batch_encode_candidates(text, max_length=10, return_tensors="pt")
print(tokenized_texts.input_ids.shape) # (2, 2, 10)
Therefore, if we don't adopt the max-length strategy and use the longest strategy instead, based on the current logic of batch_encode_candidates
, the first sample will be firstly inputted into __call__
and result in an output with shape (num_candidates, longest_length_of_first_sample). Then, we do the same step for the second sample and get an output with shape (num_candidates, longest_length_of_second_sample). These two outputs cannot be stacked into a batch tensor because their dimension on the last axis is different. We could however design a more sophisticated logic to adapt the longest strategy (I can do it in this PR or in a follow-up PR).
@patrickvonplaten, @patil-suraj - This function could be integrated into __call__
, but IMO I think keeping a separated function would be more clear and unequivocal to users i.e. using batch_encode_candidates
when they want to encode samples containing candidates for RealmScorer
; otherwise, using the regular __call__
to encode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with overriding the __call__
if you think this way is more aligned and intuitive though.
["[CLS]", "test", "question", "[SEP]", "this", "is", "the", "fourth", "record", "[SEP]"], | ||
) | ||
|
||
def test_block_has_answer(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean!
self.assertEqual([[-1], [6]], start_pos) | ||
self.assertEqual([[-1], [7]], end_pos) | ||
|
||
def test_save_load_pretrained(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice!
I can definitely help with the uploading part once the PR is merged :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work @qqaatw, really great!
I have pretty much the same comments as Patrick about tokenizer abstraction and the batch_encode_candidates
method. And left a few more nits :)
Thanks a lot for adding this model.
logger = logging.get_logger(__name__) | ||
|
||
|
||
def convert_tfrecord_to_np(block_records_path, num_block_records): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this function used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No good point though! I thought we could leave it as it's necessary to see how to convert between the original records on the "new" numpy methods - but would also be ok with deleting it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use this function in the checkpoint conversion script.
As Patrick said, it shows how tf records are converted, and deleting it would be ok too.
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES | ||
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION | ||
|
||
def batch_encode_candidates(self, text, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this should be handled in __call__
method. Because this is not aligned with the tokenizer API.
For example, see the LUKETokenizer
which overrides __call__
to encode multiple texts
https://github.com/huggingface/transformers/blob/master/src/transformers/models/luke/tokenization_luke.py#L262
methods by a significant margin (4-16% absolute accuracy), while also providing qualitative benefits such as | ||
interpretability and modularity.* | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more nit for follow-up PR.
Since this is a complex model it would be awesome to add a usage example either here or in reserach_projects
dir like RAG.
Amazing job ! Thanks for converting the records blocks to numpy format, it's more convenient this way. Later we can even allow to memory-map the numpy file to avoid filling up the RAM :) |
I think this PR is more or less ready to be merged. @sgugger @LysandreJik could you give it a quick look maybe? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just missing a few more code examples, but otherwise good to go!
Whether or not the evidence block has answer(s). | ||
|
||
Returns: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users will not directly use this model as it is an internal model of RealmForOpenQA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot for your impressive contribution @qqaatw!
All test failures are due to the latest release of Tokenizers. A quick rebase on master should get rid of them. |
Thanks for all your work on this! |
# [batch_size, joint_seq_len] | ||
marginal_gold_log_probs = joint_gold_log_prob.logsumexp(1) | ||
# [] | ||
masked_lm_loss = -torch.nansum(torch.sum(marginal_gold_log_probs * mlm_mask) / torch.sum(mlm_mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qqaatw You used nansum
in your PR with REALM. nansum
is available in torch>=1.7.0. But transformers
per settings in setup.py (https://github.com/huggingface/transformers/blob/main/setup.py#L155) has to support torch>=1.0 (so, your PR breaks compatibility with torch<1.7.0,>=1.0). Please consider new PR with compatibility improvements.
@sgugger Mentioning you as the merger of PR. Please consider some track for CI/CD to test torch compatibility of transformers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the library as a whole relies on PyToch (actually >=1.3, not just 1.0) each model can have more specific requirements. Some require other dependencies (for instance Tapas requires torch-scatter) or newer versions of PyTorch.
The fact this one requires PyTorch >= 1.7 should be properly documented however, if anyone wants to make a PR :-)
What does this PR do?
This PR adds REALM.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Closes #3497