Fix most of the tokenizer tests. #41

lockon-n · 2022-06-08T08:30:07Z

What does this PR do?

I have fixed most of the tokenizer tests except test_maximum_encoding_length_pair_input.
Seems these tests are copied from LayoutLMv2, but LayoutLMv2 is based on BERT tokenizer, so some tests may not suitable for MarkupLM which is based on RoBERTa tokenizer. I tried my best to fix both the test part and the tokenizer part, with necessary annotations left. However, test_maximum_encoding_length_pair_input is beyond my ability, I only fixed the slow tokenizer part in this test.
I will be very pleased if this commit is useful!

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@NielsRogge Feel free to ask questions or give suggestions for changes.

HuggingFaceDocBuilderDev · 2022-06-08T08:41:50Z

The documentation is not available anymore as the PR was closed or merged.

NielsRogge · 2022-06-14T08:48:14Z

cc @SaulLu. For information, @lockon-n is the original author of MarkupLM. This means we probably only have a single test to fix before MarkupLM is finished.

SaulLu · 2022-06-14T18:39:25Z

Thanks a lot for all this work @lockon-n ! I can have a deeper look on Thursday, is that ok?

wolfshow · 2022-06-17T06:01:55Z

Thanks @NielsRogge and @SaulLu! @lockon-n will work with you on this PR.

SaulLu · 2022-06-20T11:09:14Z

@lockon-n @wolfshow , I'm really sorry some emergencies crept into my schedule last week and didn't have time to look at your PR. I'm putting it on my schedule tomorrow, sorry for any problems this may cause you.

SaulLu

Thank you so much for working on adding MarkupLM 🤗 🙌 ! Integrating the tokenizer and understanding the tests is not a simple thing.

I left a lot of questions in the comments to make sure I understood the motivation behind your changes.

SaulLu · 2022-06-21T20:12:16Z

src/transformers/convert_slow_tokenizer.py

@@ -999,6 +999,34 @@ def post_processor(self):
 ],
 )

+class MarkupLMConverter(Converter):


Do you remember why Roberta's converter didn't work?

I compare roberta's to bert's, seems it does not add the unknown token to BPE model.

Thanks! Yes, that's possible as Roberta's is a byte-level BPE (we can't have unknown tokens). MarkupLM's tokenizer can produce unknown tokens?

Well. We assume MarkupLM's tokenizer is exactly the same as Roberta's, so unknown tokens shouldn't exist. So the only problem is this test is originally designed for LayoutLM, which is based on Bert's tokenizer. Maybe we can ignore it.

What is the test causing this issue?

It's test_rust_and_python_full_tokenizers.

You can see in the setUp() of MarkupLMTokenizationTest, we have an unknown token for the temporary slow tokenizer, however if using the default RoBERTa converter, the unknown token is removed when building fast tokenizer. Then ids and rust_ids in line 1122/1123 (test_tokenization_markuplm.py) are not the same (the first token of ids is '<unk>', while rust_ids does not have it).

SaulLu · 2022-06-21T20:13:36Z

src/transformers/models/markuplm/tokenization_markuplm.py

+ overflowing_xpath_tags_seq = pair_xpath_tags_seq[-window_len:]
+ overflowing_xpath_subs_seq = pair_xpath_subs_seq[-window_len:]


Great catch!

SaulLu · 2022-06-21T20:23:21Z

tests/models/markuplm/test_tokenization_markuplm.py

+ text = [['this', 'is', 'the'], ["how", "are", "you"]]
+ xpaths = [["html/body"] * 4, ["html/body"] * 3]


I think I'm missing something here: I see 3 nodes in the ['this', 'is', 'the'] list. Why do we have 4 values in the corresponding xpaths list?

This should be a typo, sorry about that.

No problem! If it is not possible to have a case where the number of text nodes and xpath nodes is not the same, shouldn't an error be raised (btw I wonder how the tokenizer handles this difference currently)?

I think raising an error is enough, or something like assume len(text)==len(xpaths). In fact we do not consider such difference when designing relevant module, but xpaths longer than text may be ignored I guess.

SaulLu · 2022-06-21T20:52:19Z

tests/models/markuplm/test_tokenization_markuplm.py

 inputs = new_tokenizer(text, xpaths=xpaths)
 self.assertEqual(len(inputs["input_ids"]), 2)
 decoded_input = new_tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
- expected_result = "this is the"
+ expected_result = "thisisthe" # original expected result "this is the" seems contradicts to roberta-based tokenizer


This behaviour led me at the time to investigate how tokenization was done in the original repository.

For context, in a Roberta tokenizer spaces are transformed into Ġ characters and not removed from the tokenization. With Roberta's tokenizer, the tokenization of the individual words - 'this', 'is', and 'the'- would be:

with add_prefix_space=False: ['this'], ['is'], and ['the']

with add_prefix_space=True: ['Ġthis'], ['Ġis'], and ['Ġthe']

The spaces being lost and the decoder not artificially adding spaces explains the current behaviour of the decode method. Nevertheless, this led me to 2 questions:

in the original code did the text in each node start with a space (even if this space was artificily added with a add_prefix_space argument)? In other words, was the tokenization of the nodes [['this'], ['is'], ['the']] or was it [['Ġthis'], ['Ġis'], ['Ġthe']].

in the case where the text in each node does not start with a space, do we need a custom decoder to reintroduce spaces (is this possible?)? To answer this question I think the most important thing is to know if MarkupLM can solve generative tasks.

These questions led me to propose a new integration test in this PR #37 which is unfortunately based on a very old version of the library and therefore does not make the PR readable now. To create this integration test I investigate the tokenization produced by the authors of MarkUpLM on the downstream task of WebSRC. It seems to me that the integration test I proposed to add was this one:

transformers/tests/test_tokenization_markuplm.py

Lines 2196 to 2251 in 36dcacc

def test_question_websrc(self):

# This test aims at verifying that we obtain the same tokenization as the one defined by the authors of

# MarkupLM for the first example of the WebSRC downstream task

pretrained_name = "microsoft/markuplm-base"

self.maxDiff = None

# fmt: off

# Tokenizer's inputs

question = "What's the name of the university?"

nodes = ['World', 'League', 'University', 'of', 'Isfahan', 'Iran', 'University', 'of', 'Isfahan,', 'Isfahan', 'Province,', 'Esfahan,', 'Daneshgah', 'Street,', 'Iran', 'World', 'rank', '591', 'Country', 'rank', '7', 'Foundation', 'year:', '1946', 'Short', 'name:', 'UI', 'Type:', 'Public', 'Students:', '12900', 'Faculty:', '640', 'Students/Faculty', 'Ratio:', '20', ':', '1', 'Web-site:', 'ui.ac.ir', 'Region:', 'Asia', 'Location:', 'Isfahan', 'Rankings', 'Rank', 'Score', 'World', 'University', 'Ranking', '591', '40.272', 'Teaching', 'Ranking', '384', '54.226', 'Research', 'Ranking', '642', '30.792', 'International', 'Diversity', 'Ranking', '802', '10.971', 'Financial', 'Sustainability', 'Ranking', '641', '43.231', 'Additional', 'Rankings', 'Rank', 'Score', 'Reputation', 'Rankings', '552', '33.828', 'Academic', 'Rankings', '654', '38.879', 'no', 'yes']

tags = [['html', 'body', 'div', 'div', 'div', 'div', 'div'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'span'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'p'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'p'], ['html', 'body', 'div', 'div', 'div', 'div', 'div'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'span'], ['html', 'body', 'div', 'div', 'div', 'div', 'div'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'span'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'a', 'div', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216]]

xpaths = ["/" + "/".join(tag) if isinstance(tag[0], str) else "" for tag in tags ]

max_length = 384

# Targets

tokens_target = ['<s>', 'What', "'s", 'Ġthe', 'Ġname', 'Ġof', 'Ġthe', 'Ġuniversity', '?', '</s>', 'World', 'League', 'University', 'of', 'Is', 'f', 'ahan', 'Iran', 'University', 'of', 'Is', 'f', 'ahan', ',', 'Is', 'f', 'ahan', 'Prov', 'ince', ',', 'Es', 'f', 'ahan', ',', 'Dan', 'esh', 'g', 'ah', 'Street', ',', 'Iran', 'World', 'rank', '591', 'Country', 'rank', '7', 'Found', 'ation', 'year', ':', '19', '46', 'Short', 'name', ':', 'UI', 'Type', ':', 'Public', 'Students', ':', '12', '900', 'Fac', 'ulty', ':', '640', 'Students', '/', 'Fac', 'ulty', 'Rat', 'io', ':', '20', ':', '1', 'Web', '-', 'site', ':', 'ui', '.', 'ac', '.', 'ir', 'Region', ':', 'Asia', 'Location', ':', 'Is', 'f', 'ahan', 'Rank', 'ings', 'Rank', 'Score', 'World', 'University', 'R', 'anking', '591', '40', '.', '272', 'Te', 'aching', 'R', 'anking', '384', '54', '.', '226', 'Research', 'R', 'anking', '642', '30', '.', '792', 'International', 'D', 'iversity', 'R', 'anking', '802', '10', '.', '9', '71', 'Financial', 'S', 'ustain', 'ability', 'R', 'anking', '641', '43', '.', '231', 'Additional', 'Rank', 'ings', 'Rank', 'Score', 'Rep', 'utation', 'Rank', 'ings', '552', '33', '.', '8', '28', 'Ac', 'ademic', 'Rank', 'ings', '654', '38', '.', '8', '79', 'no', 'yes', '</s>'] + ['<pad>'] * 216

input_ids_target = [0, 2264, 18, 5, 766, 9, 5, 2737, 116, 2, 10988, 17608, 29972, 1116, 6209, 506, 10048, 21336, 29972, 1116, 6209, 506, 10048, 6, 6209, 506, 10048, 35746, 15062, 6, 15286, 506, 10048, 6, 25887, 4891, 571, 895, 24265, 6, 21336, 10988, 40081, 40284, 36735, 40081, 406, 29991, 1258, 180, 35, 1646, 3761, 34256, 13650, 35, 18157, 40118, 35, 22649, 29413, 35, 1092, 7784, 46166, 38431, 35, 31419, 29413, 73, 46166, 38431, 40855, 1020, 35, 844, 35, 134, 27521, 12, 11953, 35, 3371, 4, 1043, 4, 853, 43575, 35, 20892, 46571, 35, 6209, 506, 10048, 46052, 1033, 46052, 36088, 10988, 29972, 500, 20327, 40284, 1749, 4, 30687, 16215, 13341, 500, 20327, 35324, 4283, 4, 29190, 27004, 500, 20327, 36159, 541, 4, 39739, 29743, 495, 31104, 500, 20327, 37674, 698, 4, 466, 5339, 35028, 104, 26661, 4484, 500, 20327, 39138, 3897, 4, 29949, 8665, 46052, 1033, 46052, 36088, 22026, 31320, 46052, 1033, 35544, 3103, 4, 398, 2517, 26145, 41551, 46052, 1033, 38138, 3170, 4, 398, 5220, 2362, 10932, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

pad_xpath_tag = 216

pad_xpath_tags = [pad_xpath_tag] * 50

xpath_tags_seq_target = [pad_xpath_tags] * 10 + [[109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40] + [pad_xpath_tags]* 219

# xpath_subs_seq_target = # to add

# token_type_ids_target = # to add

# attention_mask_target = # to add

# fmt: on

with self.subTest(f"Python version ({pretrained_name})"):

tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name)

encoding_p = tokenizer_p(

question,

nodes,

xpaths=xpaths,

padding="max_length",

max_length=max_length,

)

tokens_p = tokenizer_p.convert_ids_to_tokens(encoding_p["input_ids"])

self.assertListEqual(tokens_p, tokens_target)

self.assertListEqual(encoding_p["input_ids"], input_ids_target)

self.assertListEqual(encoding_p["xpath_tags_seq"], xpath_tags_seq_target)

# add the other keys

with self.subTest(f"Rust version ({pretrained_name})"):

tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name)

encoding_r = tokenizer_r(

question,

nodes,

xpaths=xpaths,

padding="max_length",

max_length=max_length,

)

tokens_r = tokenizer_r.convert_ids_to_tokens(encoding_r["input_ids"])

self.assertListEqual(tokens_r, tokens_target)

self.assertListEqual(encoding_r["input_ids"], input_ids_target)

self.assertListEqual(encoding_r["xpath_tags_seq"], xpath_tags_seq_target)

# add the other keys

In addition to the problems I had identified in the PR message, the targets defined in the test allowed me to conclude that there was no space added at the beginning of each node (which would answer my first question above). But you would surely know better than I what you did in your original code and what should be our ground truth!

For question 1, the tokenization should be [['this'], ['is'], ['the']]. For question 2, now MarkupLM can not handle generative questions and I think a custom decoder is not necessary.

Frankly speaking, we do not consider too much about the "prefix space". For each node, we will first do strip() to remove the possible prefix/suffix spaces, and all things left is to use a default Roberta Tokenizer to tokenize the span and generate tokens.

Very clear to me!

As MarkupLM can't do generative tasks, I think we should put a warning message when the decode method is called saying that the decoding is experimental and subject to change.

Sure, that's exactly what we want :)

SaulLu · 2022-06-21T21:18:42Z

tests/models/markuplm/test_tokenization_markuplm.py

 # Ensure consistency
 output_txt = tokenizer.decode(toks_ids, clean_up_tokenization_spaces=False)
+ # an extra blank will cause inconsistency: ["a","b",] & "a b"
+ '''
 if " " not in output_txt and len(toks_ids) > 1:
 output_txt = (
 tokenizer.decode([toks_ids[0]], clean_up_tokenization_spaces=False)
 + " "
 + tokenizer.decode(toks_ids[1:], clean_up_tokenization_spaces=False)
 )
+ '''
 if with_prefix_space:
 output_txt = " " + output_txt
 nodes = output_txt.split(" ")
 xpaths = ["html/body" for i in range(len(nodes))]
 output_ids = tokenizer.encode(nodes, xpaths=xpaths, add_special_tokens=False)
-
 return nodes, xpaths, output_ids


For this automatic script, I wonder if what is currently generated is the most useful. Currently nodes and xpaths are both composed of a single element, ['lowerstidner'] and ['html/body'] respectively. Wouldn't it be more useful to have an example here with multiple nodes and xpaths?

I'm quite confused about this question, would you like to give more details?

Since in most use cases the input will have several nodes, I think we should also have several nodes in the example returned here. Indeed, by not having several nodes one might not be able to spot potential problems.

For example: ['lowerstidner', 'lower', 'lowersti'] and ['html/body', 'html/body/div', 'html/body/div/table']

Yes, having multiple nodes is more common and it's better to have an example like that. But I think there's no difference if we only consider the xpath converter part.

I remember that the corresponding test in LayoutLM can generate multiple nodes, which may also be a 'bert2roberta' problem.

It is possible, I think in this case it is ok to override this method to propose a sufficiently adversarial example. 😊

SaulLu · 2022-06-21T21:20:25Z

tests/models/markuplm/test_tokenization_markuplm.py

@@ -2216,7 +2215,7 @@ def test_markuplm_integration_test(self):
 nodes, xpaths = self.get_nodes_and_xpaths()

 # fmt: off
- expected_results = {'input_ids': [101, 1037, 6881, 2135, 3231, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'bbox': [[0, 0, 0, 0], [423, 237, 440, 251], [427, 272, 441, 287], [427, 272, 441, 287], [419, 115, 437, 129], [1000, 1000, 1000, 1000], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} # noqa: E231
+ expected_results = {'input_ids': [0, 42891, 8331, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'xpath_tags_seq': [[216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [109, 25, 50, 120, 50, 178, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [109, 25, 50, 120, 50, 178, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216]], 'xpath_subs_seq': [[1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [109, 25, 50, 120, 50, 178, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [109, 25, 50, 120, 50, 178, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001], [1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001]], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


Thanks for generating the right targets! Out of curiosity, what was your workflow for generating these targets (so that I know how to do it if we want to redo it again for other examples in the futur)?

As you can see the original expected results are for LayoutLM (e.g., bbox) but not MarkupLM.
To generate it, just use a roberta tokenizer to tokenize it, and do some padding is ok. The xpath tags/subscripts part are from the get_xpath_seq method in MarkupLMTokenizer.

Thank you very much for your feedback. I wonder if there might be a way to get the targets out of the original code.

The reason I'm wondering this is that while investigating the WebSRC downstream task I noticed that the template (=adding tokens at the beginning of a sentence, at the separation of sentences and at the end of a sentence) the pair of sentences (questions and textual nodes) was not identical to Roberta's. From what I understand:

Roberta's template: '<s>', tokens from sentence 1 '</s>', '</s>', tokens from sentence 2 '</s>'

WebSRC MarkupLM's template: '<s>', tokens from sentence 1 '</s>', tokens from sentence 2 '</s>'

But maybe in the original code several templates were also used depending on the task, or I didn't understand something...

So, in this example I wonder if the target for get_question_nodes_and_xpaths and get_question_nodes_and_xpaths_batch is indeed a target representative of the modeling proposed in MarkupLM.

Oh I get your point.

The current targets are wrong, and what we want is WebSRC's template, which means the element related to the second </s> should be removed and a padding element should be added in 'input_ids', 'xpath_tags_seq', 'xpath_subs_seq' and 'attention_mask'. We leave 'token_type_ids' alone as it is always all-zero. Sorry for this mistake.

The source code for WebSRC is BERT's style, and the early version for MarkupLM is also BERT's style. In other words, there is only one sep token in between. This feature is kept after we switch to RoBERTa's tokenizer. I have to admit this is our ignorance, because we do not know standard RoBERTa tokenizer inserts two consecutive sep tokens at that time. As a result, our pre-training is also conducted in this manner (one sep), so the template in fine-tuning should also be that kind.

To fix this issue, simply removing the second </s> and its pad in xpath tag/subs is OK in MarkupLMTokenizer.

Thank you very much for the confirmation! There are really no worries!

Unfortunately this means that for the tokenizer fast we have to wait for a new feature that should arrive soon in the tokenizers library to be able to combine two post-processors: the one for the byte-level and the one for the template that is not the one from roberta.

lockon-n · 2022-06-22T06:15:45Z

@SaulLu Hi there. I have tried my best to answer the questions. Feel free to raise more lol.

lockon-n · 2022-06-27T14:55:09Z

@SaulLu More feedback updated. Looking for your reply ( ,and possibly, more questions). 😄

NielsRogge · 2022-08-25T12:45:25Z

Hi @lockon-n @SaulLu, what's the status of MarkupLM? Seems like there are only a small amount of things to be resolved to finish the model.

lockon-n · 2022-08-29T10:02:56Z

Hi @NielsRogge , I think MarkupLM is almost done except that MarkupLMTokenizer will insert two between the question and the page context, so we just need to delete the second one. But the tokenizerfast may wait for some new features as @SaulLu said.

wolfshow · 2022-09-09T08:30:58Z

Hi @SaulLu and @NielsRogge, how about the progress of the tokenizer fast?

SaulLu · 2022-09-09T10:19:20Z

Hi @NielsRogge and @lockon-n and @wolfshow ,

As the feature for tokenizers fast is not ready yet, what we can do is to add the fast version of Markup LM with partial availability: the only case that will not work is if trim_offset is set to True (which is the default behavior currently but I don't think it's really necessary). As soon as the feature is completed in tokenizers we can add this functionality again.

I have opened a PR showing this case here: lockon-n#1

I think this is a good compromise to add MarkUp LM to transformers. If we want to use this workaround, the tokenizer.json file of the hub will need to be updated too.

proposal of a fix for the MarkupLM fast tokenizer

lockon-n · 2022-09-09T10:51:51Z

Hi @NielsRogge , I have merged the PR proposed by @SaulLu . I think MakrupLM is ready for transformers now.

SaulLu · 2022-09-09T10:54:51Z

Great! 🙌

I just think we need to add an integration test for the fast tokenizer too and change the template for the slow tokenizer with the question 😄 (and add the warning discussed in this message: #41 (comment))

…late in slow tokenizer

lockon-n · 2022-09-09T12:32:30Z

@SaulLu Issues have been fixed in latest update. The integration test for fast tokenzier fails (double problem) when text-pair encoding, is that expected? I have set trim_offset as False in default.

SaulLu · 2022-09-09T12:35:32Z

tests/models/markuplm/test_tokenization_markuplm.py

 def test_markuplm_integration_test(self):

 tokenizer_p = MarkupLMTokenizer.from_pretrained("microsoft/markuplm-base")
 # no fast tokenizer for now
- # tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base")
+ tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base")


@lockon-n , can you re-try with this change? 🙏

Suggested change

tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base")

tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base", from_slow=True)

If it solves the issue it's just because the ŧokenizer.json file at https://huggingface.co/microsoft/markuplm-base needs to be updated

Still fail:(

Oh 😞

Could this be coming from the xpath_tags_seq list?

I have trouble remembering how the nodes and xpaths need to be tied. I particular I don't remember why line 102 is xpaths = ["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span"] and not xpaths = ["/html/body/div/li[1]/div/span"]

transformers/tests/models/markuplm/test_tokenization_markuplm.py

Lines 99 to 112 in 485b2ca

def get_question_nodes_and_xpaths(self):

question = "what's his name?"

nodes = ["hello world"]

xpaths = ["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span"]

return question, nodes, xpaths

def get_question_nodes_and_xpaths_batch(self):

questions = ["what's his name?", "how is he called?"]

nodes = [["hello world", "running"], ["hello my name is bob"]]

xpaths = [

["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span"],

["/html/body/div/li[2]/div/span"],

]

This seems to be a typo, but does not leads to mistakes. The second xpath is discarded when processing I guess.

SaulLu · 2022-09-09T15:24:29Z

src/transformers/convert_slow_tokenizer.py

+
+ tokenizer.post_processor = processors.TemplateProcessing(
+ single="<s> $A </s>",
+ pair="<s> $A </s> $B </s>",
+ special_tokens=[
+ ("<s>", ot.convert_tokens_to_ids("<s>")),
+ ("</s>", ot.convert_tokens_to_ids("</s>")),
+ ],
+ )


oh I'm sorry the change on this file slipped through the cracks when I commited the files.... but indeed this is the fix!

@SaulLu I rewrite the converter, this solve the double problem. i.e., the tokenizer.post_processor(). However, the tokenizerfast seem to tokenize "what's his name?" as ["what","'s","his","name","?"] while the slow-tokenizer and groundtruth is ["what","'s","Ġhis","Ġname","?"]. Seems the space problem caused by trim_offsets?

@SaulLu I rewrite the converter, this solve the double problem. i.e., the tokenizer.post_processor().

🙌

However, the tokenizerfast seem to tokenize "what's his name?" as ["what","'s","his","name","?"] while as ["what","'s","Ġhis","Ġname","?"]. Seems the space problem caused by trim_offsets?

Can you share the snippet you used to see that please?

run test_markuplm_integration_test(), and you can compare the difference between encoding_p['input_ids] and encoding_r['input_ids'] in Line 2291&2292. I convert the ids into tokens, and gets the results above.
Sorry for not using an easier way like a snippet of codes.

I see, it's in the integration test

It can do the job 💪

So... we reach a release-ready version?

I think so! @NielsRogge

Thanks so much for your effort !!!! 👏 👏 really nice experience to collaborate with such nice staffs, also excited to see our model appearing in this famous repo (a common dream for NLPers I guess 😆)

Fantastic! @SaulLu @lockon-n

SaulLu · 2022-09-09T15:28:21Z

src/transformers/convert_slow_tokenizer.py

+ tokenizer.post_processor = processors.TemplateProcessing(
+ single="<s> $A </s>",
+ pair="<s> $A </s> $B </s>",
+ special_tokens=[
+ ("<s>", ot.convert_tokens_to_ids("<s>")),
+ ("</s>", ot.convert_tokens_to_ids("</s>")),
+ ],
+ )


Suggested change

tokenizer.post_processor = processors.TemplateProcessing(

single="<s> $A </s>",

pair="<s> $A </s> $B </s>",

special_tokens=[

("<s>", ot.convert_tokens_to_ids("<s>")),

("</s>", ot.convert_tokens_to_ids("</s>")),

],

)

cls = str(self.original_tokenizer.cls_token)

sep = str(self.original_tokenizer.sep_token)

cls_token_id = self.original_tokenizer.cls_token_id

sep_token_id = self.original_tokenizer.sep_token_id

tokenizer.post_processor = processors.TemplateProcessing(

single="cls $A sep",

pair="cls $A sep $B sep",

special_tokens=[

(cls, cls_token_id),

(sep, sep_token_id),

],

)

NielsRogge · 2022-09-12T08:23:26Z

Ok, awesome work :) can you confirm I can merge this now?

lockon-n · 2022-09-12T08:33:04Z

I think yes @NielsRogge

SaulLu · 2022-09-12T08:33:12Z

can you confirm I can merge this now?

yes! 🙌

NielsRogge · 2022-09-12T12:20:34Z

So I've merged + rebased with the main branch of Transformers, seems like there are 4 tests to be fixed. Running

RUN_SLOW=yes pytest tests/models/markuplm/test_tokenization_markuplm.py

returns:

FAILED tests/models/markuplm/test_tokenization_markuplm.py::MarkupLMTokenizationTest::test_max_length_equal - AssertionError: 508 != 509
FAILED tests/models/markuplm/test_tokenization_markuplm.py::MarkupLMTokenizationTest::test_num_special_tokens_to_add_equal - AssertionError: 4 != 3
FAILED tests/models/markuplm/test_tokenization_markuplm.py::MarkupLMTokenizationTest::test_padding - AssertionError: Sequences differ: [1, 1[52 chars] 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
FAILED tests/models/markuplm/test_tokenization_markuplm.py::MarkupLMTokenizationTest::test_padding_warning_message_fast_tokenizer - ValueError: Nodes must be of type `List[str]` ...

lockon-n · 2022-09-12T13:50:37Z

@NielsRogge Hi, I have modify the codes to tackle these tests. Should another PR be raised?

NielsRogge · 2022-09-12T13:58:24Z

Yes, you can do that. Note that I've rebased the modeling_markuplm_bis branch with the main branch of Transformers.

SaulLu · 2022-09-21T12:03:33Z

FYI: huggingface#19139 (review)

Fix most of the tokenizer tests.

746fd1a

SaulLu reviewed Jun 21, 2022

View reviewed changes

propose solution for fast

5146a28

Merge pull request #1 from SaulLu/lockon_modeling_markuplm_bis

5cc9282

proposal of a fix for the MarkupLM fast tokenizer

add: integration test for fasttokenizer, warning for decode, fix temp…

13f4ad2

…late in slow tokenizer

SaulLu reviewed Sep 9, 2022

View reviewed changes

add: modify markuplmconverter

969e32e

SaulLu reviewed Sep 9, 2022

View reviewed changes

add: some modify on converter and tokenizerfast

6425f28

NielsRogge merged commit a555fd7 into NielsRogge:modeling_markuplm_bis Sep 12, 2022

		overflowing_xpath_tags_seq = pair_xpath_tags_seq[-window_len:]
		overflowing_xpath_subs_seq = pair_xpath_subs_seq[-window_len:]

		text = [['this', 'is', 'the'], ["how", "are", "you"]]
		xpaths = [["html/body"] * 4, ["html/body"] * 3]

	def test_question_websrc(self):
	# This test aims at verifying that we obtain the same tokenization as the one defined by the authors of
	# MarkupLM for the first example of the WebSRC downstream task
	pretrained_name = "microsoft/markuplm-base"
	self.maxDiff = None

	# fmt: off
	# Tokenizer's inputs
	question = "What's the name of the university?"
	nodes = ['World', 'League', 'University', 'of', 'Isfahan', 'Iran', 'University', 'of', 'Isfahan,', 'Isfahan', 'Province,', 'Esfahan,', 'Daneshgah', 'Street,', 'Iran', 'World', 'rank', '591', 'Country', 'rank', '7', 'Foundation', 'year:', '1946', 'Short', 'name:', 'UI', 'Type:', 'Public', 'Students:', '12900', 'Faculty:', '640', 'Students/Faculty', 'Ratio:', '20', ':', '1', 'Web-site:', 'ui.ac.ir', 'Region:', 'Asia', 'Location:', 'Isfahan', 'Rankings', 'Rank', 'Score', 'World', 'University', 'Ranking', '591', '40.272', 'Teaching', 'Ranking', '384', '54.226', 'Research', 'Ranking', '642', '30.792', 'International', 'Diversity', 'Ranking', '802', '10.971', 'Financial', 'Sustainability', 'Ranking', '641', '43.231', 'Additional', 'Rankings', 'Rank', 'Score', 'Reputation', 'Rankings', '552', '33.828', 'Academic', 'Rankings', '654', '38.879', 'no', 'yes']
	tags = [['html', 'body', 'div', 'div', 'div', 'div', 'div'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'span'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'p'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'p'], ['html', 'body', 'div', 'div', 'div', 'div', 'div'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'span'], ['html', 'body', 'div', 'div', 'div', 'div', 'div'], ['html', 'body', 'div', 'div', 'div', 'div', 'div', 'span'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'a', 'div', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dd'], ['html', 'body', 'div', 'div', 'div', 'div', 'dl', 'dt'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'th'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], ['html', 'body', 'div', 'div', 'div', 'div', 'table', 'tbody', 'tr', 'td'], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216], [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216]]
	xpaths = ["/" + "/".join(tag) if isinstance(tag[0], str) else "" for tag in tags ]
	max_length = 384

	# Targets
	tokens_target = ['<s>', 'What', "'s", 'Ġthe', 'Ġname', 'Ġof', 'Ġthe', 'Ġuniversity', '?', '</s>', 'World', 'League', 'University', 'of', 'Is', 'f', 'ahan', 'Iran', 'University', 'of', 'Is', 'f', 'ahan', ',', 'Is', 'f', 'ahan', 'Prov', 'ince', ',', 'Es', 'f', 'ahan', ',', 'Dan', 'esh', 'g', 'ah', 'Street', ',', 'Iran', 'World', 'rank', '591', 'Country', 'rank', '7', 'Found', 'ation', 'year', ':', '19', '46', 'Short', 'name', ':', 'UI', 'Type', ':', 'Public', 'Students', ':', '12', '900', 'Fac', 'ulty', ':', '640', 'Students', '/', 'Fac', 'ulty', 'Rat', 'io', ':', '20', ':', '1', 'Web', '-', 'site', ':', 'ui', '.', 'ac', '.', 'ir', 'Region', ':', 'Asia', 'Location', ':', 'Is', 'f', 'ahan', 'Rank', 'ings', 'Rank', 'Score', 'World', 'University', 'R', 'anking', '591', '40', '.', '272', 'Te', 'aching', 'R', 'anking', '384', '54', '.', '226', 'Research', 'R', 'anking', '642', '30', '.', '792', 'International', 'D', 'iversity', 'R', 'anking', '802', '10', '.', '9', '71', 'Financial', 'S', 'ustain', 'ability', 'R', 'anking', '641', '43', '.', '231', 'Additional', 'Rank', 'ings', 'Rank', 'Score', 'Rep', 'utation', 'Rank', 'ings', '552', '33', '.', '8', '28', 'Ac', 'ademic', 'Rank', 'ings', '654', '38', '.', '8', '79', 'no', 'yes', '</s>'] + ['<pad>'] * 216
	input_ids_target = [0, 2264, 18, 5, 766, 9, 5, 2737, 116, 2, 10988, 17608, 29972, 1116, 6209, 506, 10048, 21336, 29972, 1116, 6209, 506, 10048, 6, 6209, 506, 10048, 35746, 15062, 6, 15286, 506, 10048, 6, 25887, 4891, 571, 895, 24265, 6, 21336, 10988, 40081, 40284, 36735, 40081, 406, 29991, 1258, 180, 35, 1646, 3761, 34256, 13650, 35, 18157, 40118, 35, 22649, 29413, 35, 1092, 7784, 46166, 38431, 35, 31419, 29413, 73, 46166, 38431, 40855, 1020, 35, 844, 35, 134, 27521, 12, 11953, 35, 3371, 4, 1043, 4, 853, 43575, 35, 20892, 46571, 35, 6209, 506, 10048, 46052, 1033, 46052, 36088, 10988, 29972, 500, 20327, 40284, 1749, 4, 30687, 16215, 13341, 500, 20327, 35324, 4283, 4, 29190, 27004, 500, 20327, 36159, 541, 4, 39739, 29743, 495, 31104, 500, 20327, 37674, 698, 4, 466, 5339, 35028, 104, 26661, 4484, 500, 20327, 39138, 3897, 4, 29949, 8665, 46052, 1033, 46052, 36088, 22026, 31320, 46052, 1033, 35544, 3103, 4, 398, 2517, 26145, 41551, 46052, 1033, 38138, 3170, 4, 398, 5220, 2362, 10932, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
	pad_xpath_tag = 216
	pad_xpath_tags = [pad_xpath_tag] * 50
	xpath_tags_seq_target = [pad_xpath_tags] * 10 + [[109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 148] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50, 178] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 50] + [pad_xpath_tag] * 43, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 0, 50, 52] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 42] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 51, 52] + [pad_xpath_tag] * 42, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 197] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40, [109, 25, 50, 50, 50, 50, 189, 190, 201, 191] + [pad_xpath_tag] * 40] + [pad_xpath_tags]* 219
	# xpath_subs_seq_target = # to add
	# token_type_ids_target = # to add
	# attention_mask_target = # to add
	# fmt: on

	with self.subTest(f"Python version ({pretrained_name})"):
	tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name)

	encoding_p = tokenizer_p(
	question,
	nodes,
	xpaths=xpaths,
	padding="max_length",
	max_length=max_length,
	)
	tokens_p = tokenizer_p.convert_ids_to_tokens(encoding_p["input_ids"])
	self.assertListEqual(tokens_p, tokens_target)
	self.assertListEqual(encoding_p["input_ids"], input_ids_target)
	self.assertListEqual(encoding_p["xpath_tags_seq"], xpath_tags_seq_target)
	# add the other keys

	with self.subTest(f"Rust version ({pretrained_name})"):
	tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name)

	encoding_r = tokenizer_r(
	question,
	nodes,
	xpaths=xpaths,
	padding="max_length",
	max_length=max_length,
	)
	tokens_r = tokenizer_r.convert_ids_to_tokens(encoding_r["input_ids"])
	self.assertListEqual(tokens_r, tokens_target)
	self.assertListEqual(encoding_r["input_ids"], input_ids_target)
	self.assertListEqual(encoding_r["xpath_tags_seq"], xpath_tags_seq_target)
	# add the other keys

	tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base")
	tokenizer_r = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base", from_slow=True)

	def get_question_nodes_and_xpaths(self):
	question = "what's his name?"
	nodes = ["hello world"]
	xpaths = ["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span"]

	return question, nodes, xpaths

	def get_question_nodes_and_xpaths_batch(self):
	questions = ["what's his name?", "how is he called?"]
	nodes = [["hello world", "running"], ["hello my name is bob"]]
	xpaths = [
	["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span"],
	["/html/body/div/li[2]/div/span"],
	]

Fix most of the tokenizer tests. #41

Fix most of the tokenizer tests. #41

Conversation

lockon-n commented Jun 8, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Jun 8, 2022 • edited Loading

NielsRogge commented Jun 14, 2022

SaulLu commented Jun 14, 2022

wolfshow commented Jun 17, 2022

SaulLu commented Jun 20, 2022

SaulLu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lockon-n Jun 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lockon-n Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

SaulLu Jun 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lockon-n commented Jun 22, 2022

lockon-n commented Jun 27, 2022

NielsRogge commented Aug 25, 2022

lockon-n commented Aug 29, 2022 • edited Loading

wolfshow commented Sep 9, 2022

SaulLu commented Sep 9, 2022

lockon-n commented Sep 9, 2022

SaulLu commented Sep 9, 2022

lockon-n commented Sep 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lockon-n Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lockon-n Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge commented Sep 12, 2022

lockon-n commented Sep 12, 2022

SaulLu commented Sep 12, 2022

NielsRogge commented Sep 12, 2022 • edited Loading

lockon-n commented Sep 12, 2022

NielsRogge commented Sep 12, 2022

SaulLu commented Sep 21, 2022

lockon-n commented Jun 8, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 8, 2022 •

edited

Loading

lockon-n Jun 27, 2022 •

edited

Loading

lockon-n Jun 22, 2022 •

edited

Loading

SaulLu Jun 27, 2022 •

edited

Loading

lockon-n commented Aug 29, 2022 •

edited

Loading

lockon-n Sep 9, 2022 •

edited

Loading

lockon-n Sep 9, 2022 •

edited

Loading

NielsRogge commented Sep 12, 2022 •

edited

Loading