-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix most of the tokenizer tests. #41
Fix most of the tokenizer tests. #41
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Thanks a lot for all this work @lockon-n ! I can have a deeper look on Thursday, is that ok? |
Thanks @NielsRogge and @SaulLu! @lockon-n will work with you on this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for working on adding MarkupLM 🤗 🙌 ! Integrating the tokenizer and understanding the tests is not a simple thing.
I left a lot of questions in the comments to make sure I understood the motivation behind your changes.
@@ -999,6 +999,34 @@ def post_processor(self): | |||
], | |||
) | |||
|
|||
class MarkupLMConverter(Converter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember why Roberta's converter didn't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I compare roberta's to bert's, seems it does not add the unknown token to BPE model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yes, that's possible as Roberta's is a byte-level BPE (we can't have unknown tokens). MarkupLM's tokenizer can produce unknown tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well. We assume MarkupLM's tokenizer is exactly the same as Roberta's, so unknown tokens shouldn't exist. So the only problem is this test is originally designed for LayoutLM, which is based on Bert's tokenizer. Maybe we can ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the test causing this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's test_rust_and_python_full_tokenizers
.
You can see in the setUp()
of MarkupLMTokenizationTest
, we have an unknown token for the temporary slow tokenizer, however if using the default RoBERTa converter, the unknown token is removed when building fast tokenizer. Then ids
and rust_ids
in line 1122/1123 (test_tokenization_markuplm.py) are not the same (the first token of ids
is '<unk>', while rust_ids
does not have it).
overflowing_xpath_tags_seq = pair_xpath_tags_seq[-window_len:] | ||
overflowing_xpath_subs_seq = pair_xpath_subs_seq[-window_len:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
text = [['this', 'is', 'the'], ["how", "are", "you"]] | ||
xpaths = [["html/body"] * 4, ["html/body"] * 3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing something here: I see 3 nodes in the ['this', 'is', 'the']
list. Why do we have 4 values in the corresponding xpaths list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a typo, sorry about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! If it is not possible to have a case where the number of text nodes and xpath nodes is not the same, shouldn't an error be raised (btw I wonder how the tokenizer handles this difference currently)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think raising an error is enough, or something like assume len(text)==len(xpaths)
. In fact we do not consider such difference when designing relevant module, but xpaths longer than text may be ignored I guess.
inputs = new_tokenizer(text, xpaths=xpaths) | ||
self.assertEqual(len(inputs["input_ids"]), 2) | ||
decoded_input = new_tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True) | ||
expected_result = "this is the" | ||
expected_result = "thisisthe" # original expected result "this is the" seems contradicts to roberta-based tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behaviour led me at the time to investigate how tokenization was done in the original repository.
For context, in a Roberta tokenizer spaces are transformed into Ġ
characters and not removed from the tokenization. With Roberta's tokenizer, the tokenization of the individual words - 'this'
, 'is'
, and 'the'
- would be:
- with
add_prefix_space=False
:['this']
,['is']
, and['the']
- with
add_prefix_space=True
:['Ġthis']
,['Ġis']
, and['Ġthe']
The spaces being lost and the decoder not artificially adding spaces explains the current behaviour of the decode
method. Nevertheless, this led me to 2 questions:
- in the original code did the text in each node start with a space (even if this space was artificily added with a
add_prefix_space
argument)? In other words, was the tokenization of the nodes[['this'], ['is'], ['the']]
or was it[['Ġthis'], ['Ġis'], ['Ġthe']]
. - in the case where the text in each node does not start with a space, do we need a custom decoder to reintroduce spaces (is this possible?)? To answer this question I think the most important thing is to know if MarkupLM can solve generative tasks.
These questions led me to propose a new integration test in this PR #37 which is unfortunately based on a very old version of the library and therefore does not make the PR readable now. To create this integration test I investigate the tokenization produced by the authors of MarkUpLM on the downstream task of WebSRC. It seems to me that the integration test I proposed to add was this one:
transformers/tests/test_tokenization_markuplm.py
Lines 2196 to 2251 in 36dcacc
def test_question_websrc(self): | |
# This test aims at verifying that we obtain the same tokenization as the one defined by the authors of | |
# MarkupLM for the first example of the WebSRC downstream task | |
pretrained_name = "microsoft/markuplm-base" | |
self.maxDiff = None | |
# fmt: off | |
# Tokenizer's inputs | |
question = "What's the name of the university?" | |
nodes = ['World', 'League', 'University', 'of', 'Isfahan', 'Iran', 'University', 'of', 'Isfahan,', 'Isfahan', 'Province,', 'Esfahan,', 'Daneshgah', 'Street,', 'Iran', 'World', 'rank', '591', 'Country', 'rank', '7', 'Foundation', 'year:', '1946', 'Short', 'name:', 'UI', 'Type:', 'Public', 'Students:', '12900', 'Faculty:', '640', 'Students/Faculty', 'Ratio:', '20', ':', '1', 'Web-site:', 'ui.ac.ir', 'Region:', 'Asia', 'Location:', 'Isfahan', 'Rankings', 'Rank', 'Score', 'World', 'University', 'Ranking', '591', '40.272', 'Teaching', 'Ranking', '384', '54.226', 'Research', 'Ranking', '642', '30.792', 'International', 'Diversity', 'Ranking', '802', '10.971', 'Financial', 'Sustainability', 'Ranking', '641', '43.231', 'Additional', 'Rankings', 'Rank', 'Score', 'Reputation', 'Rankings', '552', '33.828', 'Academic', 'Rankings', '654', '38.879', 'no', 'yes'] | |
tags = [['html', 'body', 'div', 'div', 'div', 'div', 'div'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'span'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'p'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'p'], ['html', 'body', 'div', 'div', 'div', 'div', 'div'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'span'], ['html', 'body', 'div', 'div', 'div', 'div', 'div'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'span'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'a', 'div', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216]] | |
xpaths = ["/" + "/".join(tag) if isinstance(tag[0], str) else "" for tag in tags ] | |
max_length = 384 | |
# Targets | |
tokens_target = ['<s>', 'What', "'s", 'Ġthe', 'Ġname', 'Ġof', 'Ġthe', 'Ġuniversity', '?', '</s>', 'World', 'League', 'University', 'of', 'Is', 'f', 'ahan', 'Iran', 'University', 'of', 'Is', 'f', 'ahan', ',', 'Is', 'f', 'ahan', 'Prov', 'ince', ',', 'Es', 'f', 'ahan', ',', 'Dan', 'esh', 'g', 'ah', 'Street', ',', 'Iran', 'World', 'rank', '591', 'Country', 'rank', '7', 'Found', 'ation', 'year', ':', '19', '46', 'Short', 'name', ':', 'UI', 'Type', ':', 'Public', 'Students', ':', '12', '900', 'Fac', 'ulty', ':', '640', 'Students', '/', 'Fac', 'ulty', 'Rat', 'io', ':', '20', ':', '1', 'Web', '-', 'site', ':', 'ui', '.', 'ac', '.', 'ir', 'Region', ':', 'Asia', 'Location', ':', 'Is', 'f', 'ahan', 'Rank', 'ings', 'Rank', 'Score', 'World', 'University', 'R', 'anking', '591', '40', '.', '272', 'Te', 'aching', 'R', 'anking', '384', '54', '.', '226', 'Research', 'R', 'anking', '642', '30', '.', '792', 'International', 'D', 'iversity', 'R', 'anking', '802', '10', '.', '9', '71', 'Financial', 'S', 'ustain', 'ability', 'R', 'anking', '641', '43', '.', '231', 'Additional', 'Rank', 'ings', 'Rank', 'Score', 'Rep', 'utation', 'Rank', 'ings', '552', '33', '.', '8', '28', 'Ac', 'ademic', 'Rank', 'ings', '654', '38', '.', '8', '79', 'no', 'yes', '</s>'] + ['<pad>'] * 216 | |
input_ids_target = [0, 2264, 18, 5, 766, 9, 5, 2737, 116, 2, 10988, 17608, 29972, 1116, 6209, 506, 10048, 21336, 29972, 1116, 6209, 506, 10048, 6, 6209, 506, 10048, 35746, 15062, 6, 15286, 506, 10048, 6, 25887, 4891, 571, 895, 24265, 6, 21336, 10988, 40081, 40284, 36735, 40081, 406, 29991, 1258, 180, 35, 1646, 3761, 34256, 13650, 35, 18157, 40118, 35, 22649, 29413, 35, 1092, 7784, 46166, 38431, 35, 31419, 29413, 73, 46166, 38431, 40855, 1020, 35, 844, 35, 134, 27521, 12, 11953, 35, 3371, 4, 1043, 4, 853, 43575, 35, 20892, 46571, 35, 6209, 506, 10048, 46052, 1033, 46052, 36088, 10988, 29972, 500, 20327, 40284, 1749, 4, 30687, 16215, 13341, 500, 20327, 35324, 4283, 4, 29190, 27004, 500, 20327, 36159, 541, 4, 39739, 29743, 495, 31104, 500, 20327, 37674, 698, 4, 466, 5339, 35028, 104, 26661, 4484, 500, 20327, 39138, 3897, 4, 29949, 8665, 46052, 1033, 46052, 36088, 22026, 31320, 46052, 1033, 35544, 3103, 4, 398, 2517, 26145, 41551, 46052, 1033, 38138, 3170, 4, 398, 5220, 2362, 10932, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] | |
pad_xpath_tag = 216 | |
pad_xpath_tags = [pad_xpath_tag] * 50 | |
xpath_tags_seq_target = [pad_xpath_tags] * 10 + [[109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40] + [pad_xpath_tags]* 219 | |
# xpath_subs_seq_target = # to add | |
# token_type_ids_target = # to add | |
# attention_mask_target = # to add | |
# fmt: on | |
with self.subTest(f"Python version ({pretrained_name})"): | |
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name) | |
encoding_p = tokenizer_p( | |
question, | |
nodes, | |
xpaths=xpaths, | |
padding="max_length", | |
max_length=max_length, | |
) | |
tokens_p = tokenizer_p.convert_ids_to_tokens(encoding_p["input_ids"]) | |
self.assertListEqual(tokens_p, tokens_target) | |
self.assertListEqual(encoding_p["input_ids"], input_ids_target) | |
self.assertListEqual(encoding_p["xpath_tags_seq"], xpath_tags_seq_target) | |
# add the other keys | |
with self.subTest(f"Rust version ({pretrained_name})"): | |
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name) | |
encoding_r = tokenizer_r( | |
question, | |
nodes, | |
xpaths=xpaths, | |
padding="max_length", | |
max_length=max_length, | |
) | |
tokens_r = tokenizer_r.convert_ids_to_tokens(encoding_r["input_ids"]) | |
self.assertListEqual(tokens_r, tokens_target) | |
self.assertListEqual(encoding_r["input_ids"], input_ids_target) | |
self.assertListEqual(encoding_r["xpath_tags_seq"], xpath_tags_seq_target) | |
# add the other keys |
In addition to the problems I had identified in the PR message, the targets defined in the test allowed me to conclude that there was no space added at the beginning of each node (which would answer my first question above). But you would surely know better than I what you did in your original code and what should be our ground truth!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For question 1, the tokenization should be [['this'], ['is'], ['the']]
. For question 2, now MarkupLM can not handle generative questions and I think a custom decoder is not necessary.
Frankly speaking, we do not consider too much about the "prefix space". For each node, we will first do strip()
to remove the possible prefix/suffix spaces, and all things left is to use a default Roberta Tokenizer to tokenize the span and generate tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clear to me!
As MarkupLM can't do generative tasks, I think we should put a warning message when the decode
method is called saying that the decoding is experimental and subject to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's exactly what we want :)
# Ensure consistency | ||
output_txt = tokenizer.decode(toks_ids, clean_up_tokenization_spaces=False) | ||
# an extra blank will cause inconsistency: ["a","b",] & "a b" | ||
''' | ||
if " " not in output_txt and len(toks_ids) > 1: | ||
output_txt = ( | ||
tokenizer.decode([toks_ids[0]], clean_up_tokenization_spaces=False) | ||
+ " " | ||
+ tokenizer.decode(toks_ids[1:], clean_up_tokenization_spaces=False) | ||
) | ||
''' | ||
if with_prefix_space: | ||
output_txt = " " + output_txt | ||
nodes = output_txt.split(" ") | ||
xpaths = ["html/body" for i in range(len(nodes))] | ||
output_ids = tokenizer.encode(nodes, xpaths=xpaths, add_special_tokens=False) | ||
|
||
return nodes, xpaths, output_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this automatic script, I wonder if what is currently generated is the most useful. Currently nodes
and xpaths
are both composed of a single element, ['lowerstidner']
and ['html/body']
respectively. Wouldn't it be more useful to have an example here with multiple nodes and xpaths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite confused about this question, would you like to give more details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since in most use cases the input will have several nodes, I think we should also have several nodes in the example returned here. Indeed, by not having several nodes one might not be able to spot potential problems.
For example: ['lowerstidner', 'lower', 'lowersti']
and ['html/body', 'html/body/div', 'html/body/div/table']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, having multiple nodes is more common and it's better to have an example like that. But I think there's no difference if we only consider the xpath converter part.
I remember that the corresponding test in LayoutLM can generate multiple nodes, which may also be a 'bert2roberta' problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible, I think in this case it is ok to override this method to propose a sufficiently adversarial example. 😊
@@ -2216,7 +2215,7 @@ def test_markuplm_integration_test(self): | |||
nodes, xpaths = self.get_nodes_and_xpaths() | |||
|
|||
# fmt: off | |||
expected_results = {'input_ids': [101, 1037, 6881, 2135, 3231, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'bbox': [[0, 0, 0, 0], [423, 237, 440, 251], [427, 272, 441, 287], [427, 272, 441, 287], [419, 115, 437, 129], [1000, 1000, 1000, 1000], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} # noqa: E231 | |||
expected_results = {'input_ids': [0, 42891, 8331, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'xpath_tags_seq': [[216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [109, 25, 50, 120, 50, 178, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [109, 25, 50, 120, 50, 178, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216]], 'xpath_subs_seq': [[1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [109, 25, 50, 120, 50, 178, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [109, 25, 50, 120, 50, 178, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001]], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for generating the right targets! Out of curiosity, what was your workflow for generating these targets (so that I know how to do it if we want to redo it again for other examples in the futur)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see the original expected results are for LayoutLM (e.g., bbox) but not MarkupLM.
To generate it, just use a roberta tokenizer to tokenize it, and do some padding is ok. The xpath tags/subscripts part are from the get_xpath_seq
method in MarkupLMTokenizer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your feedback. I wonder if there might be a way to get the targets out of the original code.
The reason I'm wondering this is that while investigating the WebSRC downstream task I noticed that the template (=adding tokens at the beginning of a sentence, at the separation of sentences and at the end of a sentence) the pair of sentences (questions and textual nodes) was not identical to Roberta's. From what I understand:
- Roberta's template:
'<s>', tokens from sentence 1 '</s>', '</s>', tokens from sentence 2 '</s>'
- WebSRC MarkupLM's template:
'<s>', tokens from sentence 1 '</s>', tokens from sentence 2 '</s>'
But maybe in the original code several templates were also used depending on the task, or I didn't understand something...
So, in this example I wonder if the target for get_question_nodes_and_xpaths
and get_question_nodes_and_xpaths_batch
is indeed a target representative of the modeling proposed in MarkupLM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I get your point.
The current targets are wrong, and what we want is WebSRC's template, which means the element related to the second </s>
should be removed and a padding element should be added in 'input_ids'
, 'xpath_tags_seq'
, 'xpath_subs_seq'
and 'attention_mask'
. We leave 'token_type_ids'
alone as it is always all-zero. Sorry for this mistake.
The source code for WebSRC is BERT's style, and the early version for MarkupLM is also BERT's style. In other words, there is only one sep token in between. This feature is kept after we switch to RoBERTa's tokenizer. I have to admit this is our ignorance, because we do not know standard RoBERTa tokenizer inserts two consecutive sep tokens at that time. As a result, our pre-training is also conducted in this manner (one sep), so the template in fine-tuning should also be that kind.
To fix this issue, simply removing the second </s>
and its pad
in xpath tag/subs is OK in MarkupLMTokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the confirmation! There are really no worries!
Unfortunately this means that for the tokenizer fast we have to wait for a new feature that should arrive soon in the tokenizers library to be able to combine two post-processors: the one for the byte-level and the one for the template that is not the one from roberta.
@SaulLu Hi there. I have tried my best to answer the questions. Feel free to raise more lol. |
@SaulLu More feedback updated. Looking for your reply ( ,and possibly, more questions). 😄 |
Hi @NielsRogge , I think MarkupLM is almost done except that MarkupLMTokenizer will insert two between the question and the page context, so we just need to delete the second one. But the tokenizerfast may wait for some new features as @SaulLu said. |
Hi @SaulLu and @NielsRogge, how about the progress of the tokenizer fast? |
Hi @NielsRogge and @lockon-n and @wolfshow , As the feature for tokenizers fast is not ready yet, what we can do is to add the fast version of Markup LM with partial availability: the only case that will not work is if I have opened a PR showing this case here: lockon-n#1 I think this is a good compromise to add MarkUp LM to transformers. If we want to use this workaround, the |
proposal of a fix for the MarkupLM fast tokenizer
Hi @NielsRogge , I have merged the PR proposed by @SaulLu . I think MakrupLM is ready for transformers now. |
Great! 🙌 I just think we need to add an integration test for the fast tokenizer too and change the template for the slow tokenizer with the question 😄 (and add the warning discussed in this message: #41 (comment)) |
…late in slow tokenizer
@SaulLu Issues have been fixed in latest update. The integration test for fast tokenzier fails (double problem) when text-pair encoding, is that expected? I have set trim_offset as False in default. |
def test_markuplm_integration_test(self): | ||
|
||
tokenizer_p = MarkupLMTokenizer.from_pretrained("microsoft/markuplm-base") | ||
# no fast tokenizer for now | ||
# tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base") | ||
tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lockon-n , can you re-try with this change? 🙏
tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base") | |
tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base", from_slow=True) |
If it solves the issue it's just because the ŧokenizer.json
file at https://huggingface.co/microsoft/markuplm-base needs to be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still fail:(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh 😞
Could this be coming from the xpath_tags_seq
list?
I have trouble remembering how the nodes and xpaths need to be tied. I particular I don't remember why line 102 is xpaths = ["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span"]
and not xpaths = ["/html/body/div/li[1]/div/span"]
transformers/tests/models/markuplm/test_tokenization_markuplm.py
Lines 99 to 112 in 485b2ca
def get_question_nodes_and_xpaths(self): | |
question = "what's his name?" | |
nodes = ["hello world"] | |
xpaths = ["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span"] | |
return question, nodes, xpaths | |
def get_question_nodes_and_xpaths_batch(self): | |
questions = ["what's his name?", "how is he called?"] | |
nodes = [["hello world", "running"], ["hello my name is bob"]] | |
xpaths = [ | |
["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span"], | |
["/html/body/div/li[2]/div/span"], | |
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a typo, but does not leads to mistakes. The second xpath is discarded when processing I guess.
|
||
tokenizer.post_processor = processors.TemplateProcessing( | ||
single="<s> $A </s>", | ||
pair="<s> $A </s> $B </s>", | ||
special_tokens=[ | ||
("<s>", ot.convert_tokens_to_ids("<s>")), | ||
("</s>", ot.convert_tokens_to_ids("</s>")), | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I'm sorry the change on this file slipped through the cracks when I commited the files.... but indeed this is the fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SaulLu I rewrite the converter, this solve the double problem. i.e., the tokenizer.post_processor(). However, the tokenizerfast seem to tokenize "what's his name?" as ["what","'s","his","name","?"]
while the slow-tokenizer and groundtruth is ["what","'s","Ġhis","Ġname","?"]
. Seems the space problem caused by trim_offsets
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SaulLu I rewrite the converter, this solve the double problem. i.e., the tokenizer.post_processor().
🙌
However, the tokenizerfast seem to tokenize "what's his name?" as ["what","'s","his","name","?"] while as ["what","'s","Ġhis","Ġname","?"]. Seems the space problem caused by trim_offsets?
Can you share the snippet you used to see that please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run test_markuplm_integration_test()
, and you can compare the difference between encoding_p['input_ids
] and encoding_r['input_ids']
in Line 2291&2292. I convert the ids into tokens, and gets the results above.
Sorry for not using an easier way like a snippet of codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it's in the integration test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can do the job 💪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... we reach a release-ready version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so! @NielsRogge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for your effort !!!! 👏 👏 really nice experience to collaborate with such nice staffs, also excited to see our model appearing in this famous repo (a common dream for NLPers I guess 😆)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer.post_processor = processors.TemplateProcessing( | ||
single="<s> $A </s>", | ||
pair="<s> $A </s> $B </s>", | ||
special_tokens=[ | ||
("<s>", ot.convert_tokens_to_ids("<s>")), | ||
("</s>", ot.convert_tokens_to_ids("</s>")), | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer.post_processor = processors.TemplateProcessing( | |
single="<s> $A </s>", | |
pair="<s> $A </s> $B </s>", | |
special_tokens=[ | |
("<s>", ot.convert_tokens_to_ids("<s>")), | |
("</s>", ot.convert_tokens_to_ids("</s>")), | |
], | |
) | |
cls = str(self.original_tokenizer.cls_token) | |
sep = str(self.original_tokenizer.sep_token) | |
cls_token_id = self.original_tokenizer.cls_token_id | |
sep_token_id = self.original_tokenizer.sep_token_id | |
tokenizer.post_processor = processors.TemplateProcessing( | |
single="cls $A sep", | |
pair="cls $A sep $B sep", | |
special_tokens=[ | |
(cls, cls_token_id), | |
(sep, sep_token_id), | |
], | |
) |
Ok, awesome work :) can you confirm I can merge this now? |
I think yes @NielsRogge |
yes! 🙌 |
So I've merged + rebased with the main branch of Transformers, seems like there are 4 tests to be fixed. Running
returns:
|
@NielsRogge Hi, I have modify the codes to tackle these tests. Should another PR be raised? |
Yes, you can do that. Note that I've rebased the |
What does this PR do?
I have fixed most of the tokenizer tests except
test_maximum_encoding_length_pair_input
.Seems these tests are copied from LayoutLMv2, but LayoutLMv2 is based on BERT tokenizer, so some tests may not suitable for MarkupLM which is based on RoBERTa tokenizer. I tried my best to fix both the test part and the tokenizer part, with necessary annotations left. However,
test_maximum_encoding_length_pair_input
is beyond my ability, I only fixed the slow tokenizer part in this test.I will be very pleased if this commit is useful!
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@NielsRogge Feel free to ask questions or give suggestions for changes.