Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix most of the tokenizer tests. #41

Merged

Conversation

lockon-n
Copy link

@lockon-n lockon-n commented Jun 8, 2022

What does this PR do?

I have fixed most of the tokenizer tests except test_maximum_encoding_length_pair_input.
Seems these tests are copied from LayoutLMv2, but LayoutLMv2 is based on BERT tokenizer, so some tests may not suitable for MarkupLM which is based on RoBERTa tokenizer. I tried my best to fix both the test part and the tokenizer part, with necessary annotations left. However, test_maximum_encoding_length_pair_input is beyond my ability, I only fixed the slow tokenizer part in this test.
I will be very pleased if this commit is useful!

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@NielsRogge Feel free to ask questions or give suggestions for changes.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 8, 2022

The documentation is not available anymore as the PR was closed or merged.

@NielsRogge
Copy link
Owner

cc @SaulLu. For information, @lockon-n is the original author of MarkupLM. This means we probably only have a single test to fix before MarkupLM is finished.

@SaulLu
Copy link

SaulLu commented Jun 14, 2022

Thanks a lot for all this work @lockon-n ! I can have a deeper look on Thursday, is that ok?

@wolfshow
Copy link

Thanks @NielsRogge and @SaulLu! @lockon-n will work with you on this PR.

@SaulLu
Copy link

SaulLu commented Jun 20, 2022

@lockon-n @wolfshow , I'm really sorry some emergencies crept into my schedule last week and didn't have time to look at your PR. I'm putting it on my schedule tomorrow, sorry for any problems this may cause you.

Copy link

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for working on adding MarkupLM 🤗 🙌 ! Integrating the tokenizer and understanding the tests is not a simple thing.

I left a lot of questions in the comments to make sure I understood the motivation behind your changes.

@@ -999,6 +999,34 @@ def post_processor(self):
],
)

class MarkupLMConverter(Converter):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember why Roberta's converter didn't work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compare roberta's to bert's, seems it does not add the unknown token to BPE model.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yes, that's possible as Roberta's is a byte-level BPE (we can't have unknown tokens). MarkupLM's tokenizer can produce unknown tokens?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well. We assume MarkupLM's tokenizer is exactly the same as Roberta's, so unknown tokens shouldn't exist. So the only problem is this test is originally designed for LayoutLM, which is based on Bert's tokenizer. Maybe we can ignore it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the test causing this issue?

Copy link
Author

@lockon-n lockon-n Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's test_rust_and_python_full_tokenizers.

You can see in the setUp() of MarkupLMTokenizationTest, we have an unknown token for the temporary slow tokenizer, however if using the default RoBERTa converter, the unknown token is removed when building fast tokenizer. Then ids and rust_ids in line 1122/1123 (test_tokenization_markuplm.py) are not the same (the first token of ids is '<unk>', while rust_ids does not have it).

Comment on lines +1353 to +1354
overflowing_xpath_tags_seq = pair_xpath_tags_seq[-window_len:]
overflowing_xpath_subs_seq = pair_xpath_subs_seq[-window_len:]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

Comment on lines +1394 to +1395
text = [['this', 'is', 'the'], ["how", "are", "you"]]
xpaths = [["html/body"] * 4, ["html/body"] * 3]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing something here: I see 3 nodes in the ['this', 'is', 'the'] list. Why do we have 4 values in the corresponding xpaths list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a typo, sorry about that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! If it is not possible to have a case where the number of text nodes and xpath nodes is not the same, shouldn't an error be raised (btw I wonder how the tokenizer handles this difference currently)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think raising an error is enough, or something like assume len(text)==len(xpaths). In fact we do not consider such difference when designing relevant module, but xpaths longer than text may be ignored I guess.

inputs = new_tokenizer(text, xpaths=xpaths)
self.assertEqual(len(inputs["input_ids"]), 2)
decoded_input = new_tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
expected_result = "this is the"
expected_result = "thisisthe" # original expected result "this is the" seems contradicts to roberta-based tokenizer
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour led me at the time to investigate how tokenization was done in the original repository.

For context, in a Roberta tokenizer spaces are transformed into Ġ characters and not removed from the tokenization. With Roberta's tokenizer, the tokenization of the individual words - 'this', 'is', and 'the'- would be:

  • with add_prefix_space=False: ['this'], ['is'], and ['the']
  • with add_prefix_space=True: ['Ġthis'], ['Ġis'], and ['Ġthe']

The spaces being lost and the decoder not artificially adding spaces explains the current behaviour of the decode method. Nevertheless, this led me to 2 questions:

  1. in the original code did the text in each node start with a space (even if this space was artificily added with a add_prefix_space argument)? In other words, was the tokenization of the nodes [['this'], ['is'], ['the']] or was it [['Ġthis'], ['Ġis'], ['Ġthe']].
  2. in the case where the text in each node does not start with a space, do we need a custom decoder to reintroduce spaces (is this possible?)? To answer this question I think the most important thing is to know if MarkupLM can solve generative tasks.

These questions led me to propose a new integration test in this PR #37 which is unfortunately based on a very old version of the library and therefore does not make the PR readable now. To create this integration test I investigate the tokenization produced by the authors of MarkUpLM on the downstream task of WebSRC. It seems to me that the integration test I proposed to add was this one:

def test_question_websrc(self):
# This test aims at verifying that we obtain the same tokenization as the one defined by the authors of
# MarkupLM for the first example of the WebSRC downstream task
pretrained_name = "microsoft/markuplm-base"
self.maxDiff = None
# fmt: off
# Tokenizer's inputs
question = "What's the name of the university?"
nodes = ['World', 'League', 'University', 'of', 'Isfahan', 'Iran', 'University', 'of', 'Isfahan,', 'Isfahan', 'Province,', 'Esfahan,', 'Daneshgah', 'Street,', 'Iran', 'World', 'rank', '591', 'Country', 'rank', '7', 'Foundation', 'year:', '1946', 'Short', 'name:', 'UI', 'Type:', 'Public', 'Students:', '12900', 'Faculty:', '640', 'Students/Faculty', 'Ratio:', '20', ':', '1', 'Web-site:', 'ui.ac.ir', 'Region:', 'Asia', 'Location:', 'Isfahan', 'Rankings', 'Rank', 'Score', 'World', 'University', 'Ranking', '591', '40.272', 'Teaching', 'Ranking', '384', '54.226', 'Research', 'Ranking', '642', '30.792', 'International', 'Diversity', 'Ranking', '802', '10.971', 'Financial', 'Sustainability', 'Ranking', '641', '43.231', 'Additional', 'Rankings', 'Rank', 'Score', 'Reputation', 'Rankings', '552', '33.828', 'Academic', 'Rankings', '654', '38.879', 'no', 'yes']
tags = [['html', 'body', 'div', 'div', 'div', 'div', 'div'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'span'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'p'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'p'], ['html', 'body', 'div', 'div', 'div', 'div', 'div'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'span'], ['html', 'body', 'div', 'div', 'div', 'div', 'div'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'span'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'a', 'div', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216]]
xpaths = ["/" + "/".join(tag) if isinstance(tag[0], str) else "" for tag in tags ]
max_length = 384
# Targets
tokens_target = ['<s>', 'What', "'s", 'Ġthe', 'Ġname', 'Ġof', 'Ġthe', 'Ġuniversity', '?', '</s>', 'World', 'League', 'University', 'of', 'Is', 'f', 'ahan', 'Iran', 'University', 'of', 'Is', 'f', 'ahan', ',', 'Is', 'f', 'ahan', 'Prov', 'ince', ',', 'Es', 'f', 'ahan', ',', 'Dan', 'esh', 'g', 'ah', 'Street', ',', 'Iran', 'World', 'rank', '591', 'Country', 'rank', '7', 'Found', 'ation', 'year', ':', '19', '46', 'Short', 'name', ':', 'UI', 'Type', ':', 'Public', 'Students', ':', '12', '900', 'Fac', 'ulty', ':', '640', 'Students', '/', 'Fac', 'ulty', 'Rat', 'io', ':', '20', ':', '1', 'Web', '-', 'site', ':', 'ui', '.', 'ac', '.', 'ir', 'Region', ':', 'Asia', 'Location', ':', 'Is', 'f', 'ahan', 'Rank', 'ings', 'Rank', 'Score', 'World', 'University', 'R', 'anking', '591', '40', '.', '272', 'Te', 'aching', 'R', 'anking', '384', '54', '.', '226', 'Research', 'R', 'anking', '642', '30', '.', '792', 'International', 'D', 'iversity', 'R', 'anking', '802', '10', '.', '9', '71', 'Financial', 'S', 'ustain', 'ability', 'R', 'anking', '641', '43', '.', '231', 'Additional', 'Rank', 'ings', 'Rank', 'Score', 'Rep', 'utation', 'Rank', 'ings', '552', '33', '.', '8', '28', 'Ac', 'ademic', 'Rank', 'ings', '654', '38', '.', '8', '79', 'no', 'yes', '</s>'] + ['<pad>'] * 216
input_ids_target = [0, 2264, 18, 5, 766, 9, 5, 2737, 116, 2, 10988, 17608, 29972, 1116, 6209, 506, 10048, 21336, 29972, 1116, 6209, 506, 10048, 6, 6209, 506, 10048, 35746, 15062, 6, 15286, 506, 10048, 6, 25887, 4891, 571, 895, 24265, 6, 21336, 10988, 40081, 40284, 36735, 40081, 406, 29991, 1258, 180, 35, 1646, 3761, 34256, 13650, 35, 18157, 40118, 35, 22649, 29413, 35, 1092, 7784, 46166, 38431, 35, 31419, 29413, 73, 46166, 38431, 40855, 1020, 35, 844, 35, 134, 27521, 12, 11953, 35, 3371, 4, 1043, 4, 853, 43575, 35, 20892, 46571, 35, 6209, 506, 10048, 46052, 1033, 46052, 36088, 10988, 29972, 500, 20327, 40284, 1749, 4, 30687, 16215, 13341, 500, 20327, 35324, 4283, 4, 29190, 27004, 500, 20327, 36159, 541, 4, 39739, 29743, 495, 31104, 500, 20327, 37674, 698, 4, 466, 5339, 35028, 104, 26661, 4484, 500, 20327, 39138, 3897, 4, 29949, 8665, 46052, 1033, 46052, 36088, 22026, 31320, 46052, 1033, 35544, 3103, 4, 398, 2517, 26145, 41551, 46052, 1033, 38138, 3170, 4, 398, 5220, 2362, 10932, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
pad_xpath_tag = 216
pad_xpath_tags = [pad_xpath_tag] * 50
xpath_tags_seq_target = [pad_xpath_tags] * 10 + [[109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40] + [pad_xpath_tags]* 219
# xpath_subs_seq_target = # to add
# token_type_ids_target = # to add
# attention_mask_target = # to add
# fmt: on
with self.subTest(f"Python version ({pretrained_name})"):
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name)
encoding_p = tokenizer_p(
question,
nodes,
xpaths=xpaths,
padding="max_length",
max_length=max_length,
)
tokens_p = tokenizer_p.convert_ids_to_tokens(encoding_p["input_ids"])
self.assertListEqual(tokens_p, tokens_target)
self.assertListEqual(encoding_p["input_ids"], input_ids_target)
self.assertListEqual(encoding_p["xpath_tags_seq"], xpath_tags_seq_target)
# add the other keys
with self.subTest(f"Rust version ({pretrained_name})"):
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name)
encoding_r = tokenizer_r(
question,
nodes,
xpaths=xpaths,
padding="max_length",
max_length=max_length,
)
tokens_r = tokenizer_r.convert_ids_to_tokens(encoding_r["input_ids"])
self.assertListEqual(tokens_r, tokens_target)
self.assertListEqual(encoding_r["input_ids"], input_ids_target)
self.assertListEqual(encoding_r["xpath_tags_seq"], xpath_tags_seq_target)
# add the other keys

In addition to the problems I had identified in the PR message, the targets defined in the test allowed me to conclude that there was no space added at the beginning of each node (which would answer my first question above). But you would surely know better than I what you did in your original code and what should be our ground truth!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For question 1, the tokenization should be [['this'], ['is'], ['the']]. For question 2, now MarkupLM can not handle generative questions and I think a custom decoder is not necessary.

Frankly speaking, we do not consider too much about the "prefix space". For each node, we will first do strip() to remove the possible prefix/suffix spaces, and all things left is to use a default Roberta Tokenizer to tokenize the span and generate tokens.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clear to me!

As MarkupLM can't do generative tasks, I think we should put a warning message when the decode method is called saying that the decoding is experimental and subject to change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's exactly what we want :)

Comment on lines 1662 to 1678
# Ensure consistency
output_txt = tokenizer.decode(toks_ids, clean_up_tokenization_spaces=False)
# an extra blank will cause inconsistency: ["a","b",] & "a b"
'''
if " " not in output_txt and len(toks_ids) > 1:
output_txt = (
tokenizer.decode([toks_ids[0]], clean_up_tokenization_spaces=False)
+ " "
+ tokenizer.decode(toks_ids[1:], clean_up_tokenization_spaces=False)
)
'''
if with_prefix_space:
output_txt = " " + output_txt
nodes = output_txt.split(" ")
xpaths = ["html/body" for i in range(len(nodes))]
output_ids = tokenizer.encode(nodes, xpaths=xpaths, add_special_tokens=False)

return nodes, xpaths, output_ids
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this automatic script, I wonder if what is currently generated is the most useful. Currently nodes and xpaths are both composed of a single element, ['lowerstidner'] and ['html/body'] respectively. Wouldn't it be more useful to have an example here with multiple nodes and xpaths?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite confused about this question, would you like to give more details?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since in most use cases the input will have several nodes, I think we should also have several nodes in the example returned here. Indeed, by not having several nodes one might not be able to spot potential problems.

For example: ['lowerstidner', 'lower', 'lowersti'] and ['html/body', 'html/body/div', 'html/body/div/table']

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, having multiple nodes is more common and it's better to have an example like that. But I think there's no difference if we only consider the xpath converter part.

I remember that the corresponding test in LayoutLM can generate multiple nodes, which may also be a 'bert2roberta' problem.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible, I think in this case it is ok to override this method to propose a sufficiently adversarial example. 😊

@@ -2216,7 +2215,7 @@ def test_markuplm_integration_test(self):
nodes, xpaths = self.get_nodes_and_xpaths()

# fmt: off
expected_results = {'input_ids': [101, 1037, 6881, 2135, 3231, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'bbox': [[0, 0, 0, 0], [423, 237, 440, 251], [427, 272, 441, 287], [427, 272, 441, 287], [419, 115, 437, 129], [1000, 1000, 1000, 1000], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} # noqa: E231
expected_results = {'input_ids': [0, 42891, 8331, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'xpath_tags_seq': [[216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [109, 25, 50, 120, 50, 178, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [109, 25, 50, 120, 50, 178, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216]], 'xpath_subs_seq': [[1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [109, 25, 50, 120, 50, 178, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [109, 25, 50, 120, 50, 178, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001]], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for generating the right targets! Out of curiosity, what was your workflow for generating these targets (so that I know how to do it if we want to redo it again for other examples in the futur)?

Copy link
Author

@lockon-n lockon-n Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see the original expected results are for LayoutLM (e.g., bbox) but not MarkupLM.
To generate it, just use a roberta tokenizer to tokenize it, and do some padding is ok. The xpath tags/subscripts part are from the get_xpath_seq method in MarkupLMTokenizer.

Copy link

@SaulLu SaulLu Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your feedback. I wonder if there might be a way to get the targets out of the original code.

The reason I'm wondering this is that while investigating the WebSRC downstream task I noticed that the template (=adding tokens at the beginning of a sentence, at the separation of sentences and at the end of a sentence) the pair of sentences (questions and textual nodes) was not identical to Roberta's. From what I understand:

  • Roberta's template: '<s>', tokens from sentence 1 '</s>', '</s>', tokens from sentence 2 '</s>'
  • WebSRC MarkupLM's template: '<s>', tokens from sentence 1 '</s>', tokens from sentence 2 '</s>'

But maybe in the original code several templates were also used depending on the task, or I didn't understand something...

So, in this example I wonder if the target for get_question_nodes_and_xpaths and get_question_nodes_and_xpaths_batch is indeed a target representative of the modeling proposed in MarkupLM.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I get your point.

The current targets are wrong, and what we want is WebSRC's template, which means the element related to the second </s> should be removed and a padding element should be added in 'input_ids', 'xpath_tags_seq', 'xpath_subs_seq' and 'attention_mask'. We leave 'token_type_ids' alone as it is always all-zero. Sorry for this mistake.

The source code for WebSRC is BERT's style, and the early version for MarkupLM is also BERT's style. In other words, there is only one sep token in between. This feature is kept after we switch to RoBERTa's tokenizer. I have to admit this is our ignorance, because we do not know standard RoBERTa tokenizer inserts two consecutive sep tokens at that time. As a result, our pre-training is also conducted in this manner (one sep), so the template in fine-tuning should also be that kind.

To fix this issue, simply removing the second </s> and its pad in xpath tag/subs is OK in MarkupLMTokenizer.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the confirmation! There are really no worries!

Unfortunately this means that for the tokenizer fast we have to wait for a new feature that should arrive soon in the tokenizers library to be able to combine two post-processors: the one for the byte-level and the one for the template that is not the one from roberta.

@lockon-n
Copy link
Author

@SaulLu Hi there. I have tried my best to answer the questions. Feel free to raise more lol.

@lockon-n
Copy link
Author

@SaulLu More feedback updated. Looking for your reply ( ,and possibly, more questions). 😄

@NielsRogge
Copy link
Owner

Hi @lockon-n @SaulLu, what's the status of MarkupLM? Seems like there are only a small amount of things to be resolved to finish the model.

@lockon-n
Copy link
Author

lockon-n commented Aug 29, 2022

Hi @NielsRogge , I think MarkupLM is almost done except that MarkupLMTokenizer will insert two between the question and the page context, so we just need to delete the second one. But the tokenizerfast may wait for some new features as @SaulLu said.

@wolfshow
Copy link

wolfshow commented Sep 9, 2022

Hi @SaulLu and @NielsRogge, how about the progress of the tokenizer fast?

@SaulLu
Copy link

SaulLu commented Sep 9, 2022

Hi @NielsRogge and @lockon-n and @wolfshow ,

As the feature for tokenizers fast is not ready yet, what we can do is to add the fast version of Markup LM with partial availability: the only case that will not work is if trim_offset is set to True (which is the default behavior currently but I don't think it's really necessary). As soon as the feature is completed in tokenizers we can add this functionality again.

I have opened a PR showing this case here: lockon-n#1

I think this is a good compromise to add MarkUp LM to transformers. If we want to use this workaround, the tokenizer.json file of the hub will need to be updated too.

proposal of a fix for the MarkupLM fast tokenizer
@lockon-n
Copy link
Author

lockon-n commented Sep 9, 2022

Hi @NielsRogge , I have merged the PR proposed by @SaulLu . I think MakrupLM is ready for transformers now.

@SaulLu
Copy link

SaulLu commented Sep 9, 2022

Great! 🙌

I just think we need to add an integration test for the fast tokenizer too and change the template for the slow tokenizer with the question 😄 (and add the warning discussed in this message: #41 (comment))

@lockon-n
Copy link
Author

lockon-n commented Sep 9, 2022

@SaulLu Issues have been fixed in latest update. The integration test for fast tokenzier fails (double problem) when text-pair encoding, is that expected? I have set trim_offset as False in default.

def test_markuplm_integration_test(self):

tokenizer_p = MarkupLMTokenizer.from_pretrained("microsoft/markuplm-base")
# no fast tokenizer for now
# tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base")
tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lockon-n , can you re-try with this change? 🙏

Suggested change
tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base")
tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base", from_slow=True)

If it solves the issue it's just because the ŧokenizer.json file at https://huggingface.co/microsoft/markuplm-base needs to be updated

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still fail:(

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh 😞

Could this be coming from the xpath_tags_seq list?

I have trouble remembering how the nodes and xpaths need to be tied. I particular I don't remember why line 102 is xpaths = ["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span"] and not xpaths = ["/html/body/div/li[1]/div/span"]

def get_question_nodes_and_xpaths(self):
question = "what's his name?"
nodes = ["hello world"]
xpaths = ["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span"]
return question, nodes, xpaths
def get_question_nodes_and_xpaths_batch(self):
questions = ["what's his name?", "how is he called?"]
nodes = [["hello world", "running"], ["hello my name is bob"]]
xpaths = [
["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span"],
["/html/body/div/li[2]/div/span"],
]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a typo, but does not leads to mistakes. The second xpath is discarded when processing I guess.

Comment on lines 1030 to 1038

tokenizer.post_processor = processors.TemplateProcessing(
single="<s> $A </s>",
pair="<s> $A </s> $B </s>",
special_tokens=[
("<s>", ot.convert_tokens_to_ids("<s>")),
("</s>", ot.convert_tokens_to_ids("</s>")),
],
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I'm sorry the change on this file slipped through the cracks when I commited the files.... but indeed this is the fix!

Copy link
Author

@lockon-n lockon-n Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaulLu I rewrite the converter, this solve the double problem. i.e., the tokenizer.post_processor(). However, the tokenizerfast seem to tokenize "what's his name?" as ["what","'s","his","name","?"] while the slow-tokenizer and groundtruth is ["what","'s","Ġhis","Ġname","?"]. Seems the space problem caused by trim_offsets?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaulLu I rewrite the converter, this solve the double problem. i.e., the tokenizer.post_processor().

🙌

However, the tokenizerfast seem to tokenize "what's his name?" as ["what","'s","his","name","?"] while as ["what","'s","Ġhis","Ġname","?"]. Seems the space problem caused by trim_offsets?

Can you share the snippet you used to see that please?

Copy link
Author

@lockon-n lockon-n Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run test_markuplm_integration_test(), and you can compare the difference between encoding_p['input_ids] and encoding_r['input_ids'] in Line 2291&2292. I convert the ids into tokens, and gets the results above.
Sorry for not using an easier way like a snippet of codes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it's in the integration test

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can do the job 💪

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... we reach a release-ready version?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so! @NielsRogge

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your effort !!!! 👏 👏 really nice experience to collaborate with such nice staffs, also excited to see our model appearing in this famous repo (a common dream for NLPers I guess 😆)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! @SaulLu @lockon-n

Comment on lines 1031 to 1038
tokenizer.post_processor = processors.TemplateProcessing(
single="<s> $A </s>",
pair="<s> $A </s> $B </s>",
special_tokens=[
("<s>", ot.convert_tokens_to_ids("<s>")),
("</s>", ot.convert_tokens_to_ids("</s>")),
],
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokenizer.post_processor = processors.TemplateProcessing(
single="<s> $A </s>",
pair="<s> $A </s> $B </s>",
special_tokens=[
("<s>", ot.convert_tokens_to_ids("<s>")),
("</s>", ot.convert_tokens_to_ids("</s>")),
],
)
cls = str(self.original_tokenizer.cls_token)
sep = str(self.original_tokenizer.sep_token)
cls_token_id = self.original_tokenizer.cls_token_id
sep_token_id = self.original_tokenizer.sep_token_id
tokenizer.post_processor = processors.TemplateProcessing(
single="cls $A sep",
pair="cls $A sep $B sep",
special_tokens=[
(cls, cls_token_id),
(sep, sep_token_id),
],
)

@NielsRogge
Copy link
Owner

Ok, awesome work :) can you confirm I can merge this now?

@lockon-n
Copy link
Author

I think yes @NielsRogge

@SaulLu
Copy link

SaulLu commented Sep 12, 2022

can you confirm I can merge this now?

yes! 🙌

@NielsRogge NielsRogge merged commit a555fd7 into NielsRogge:modeling_markuplm_bis Sep 12, 2022
@NielsRogge
Copy link
Owner

NielsRogge commented Sep 12, 2022

So I've merged + rebased with the main branch of Transformers, seems like there are 4 tests to be fixed. Running

RUN_SLOW=yes pytest tests/models/markuplm/test_tokenization_markuplm.py

returns:

FAILED tests/models/markuplm/test_tokenization_markuplm.py::MarkupLMTokenizationTest::test_max_length_equal - AssertionError: 508 != 509
FAILED tests/models/markuplm/test_tokenization_markuplm.py::MarkupLMTokenizationTest::test_num_special_tokens_to_add_equal - AssertionError: 4 != 3
FAILED tests/models/markuplm/test_tokenization_markuplm.py::MarkupLMTokenizationTest::test_padding - AssertionError: Sequences differ: [1, 1[52 chars] 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
FAILED tests/models/markuplm/test_tokenization_markuplm.py::MarkupLMTokenizationTest::test_padding_warning_message_fast_tokenizer - ValueError: Nodes must be of type `List[str]` ...

@lockon-n
Copy link
Author

@NielsRogge Hi, I have modify the codes to tackle these tests. Should another PR be raised?

@NielsRogge
Copy link
Owner

Yes, you can do that. Note that I've rebased the modeling_markuplm_bis branch with the main branch of Transformers.

@SaulLu
Copy link

SaulLu commented Sep 21, 2022

FYI: huggingface#19139 (review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants