add a test to match author's tokenization #37

SaulLu · 2022-04-13T08:05:24Z

What does this PR do?

I propose to add a test that aims to verify that we are able to reproduce the tokenization produced by the authors of MarkUpLM on the downstream task of WebSRC.

To design this test, I therefore ran the authors' code on the WebSRC downstream task and isolated:

the xpaths in the get_xpath_and_treeid4tokens method

https://github.com/microsoft/unilm/blob/e4929f812398207b7fefb4dda6e9debcb8ce34b9/markuplm/examples/fine_tuning/run_websrc/utils.py#L256-L284

the other targets in the feature creation stage (in the convert_examples_to_features method):

https://github.com/microsoft/unilm/blob/e4929f812398207b7fefb4dda6e9debcb8ce34b9/markuplm/examples/fine_tuning/run_websrc/utils.py#L664-L684

This test already reveals some differences:

On the slow tokenizer the template on the phrase pair is not the same. The authors have <s> question </s> content </s> and we have <s> question </s></s> content .
The slow tokenizer loses the last tokens and replaces them with pad tokens
The fast tokenizer distorts the tokenization of the question

SaulLu · 2022-04-13T08:06:17Z

@NielsRogge , I can't add you as a reviewer to this PR 🙂 . I would love to have your review on it

Niels Rogge and others added 15 commits February 18, 2022 12:52

First draft

64defb2

Make basic test work

eb97daf

Fix most tokenizer tests

ae8e93b

More improvements

a062287

Make more tests pass

48668b2

Fix more tests

a6ee5f6

Fix some code quality

f728bc3

Improve truncation

e3463ca

Implement feature extractor

8b3b7c9

Improve feature extractor and add tests

aeb7c55

Improve feature extractor tests

1053726

Fix pair_input test partly

bd9e9b8

Add fast tokenizer

387e52d

Improve implementation

fd1db5e

add a test to match author's tokenization

36dcacc

NielsRogge force-pushed the modeling_markuplm_bis branch from fd1db5e to f05b861 Compare June 6, 2022 13:24

SaulLu mentioned this pull request Jun 21, 2022

Fix most of the tokenizer tests. #41

Merged

5 tasks

NielsRogge force-pushed the modeling_markuplm_bis branch 2 times, most recently from 47c172c to c79adea Compare September 15, 2022 07:37

NielsRogge force-pushed the modeling_markuplm_bis branch 3 times, most recently from 5387164 to 738a608 Compare September 29, 2022 09:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a test to match author's tokenization #37

add a test to match author's tokenization #37

SaulLu commented Apr 13, 2022

SaulLu commented Apr 13, 2022

add a test to match author's tokenization #37

Are you sure you want to change the base?

add a test to match author's tokenization #37

Conversation

SaulLu commented Apr 13, 2022

What does this PR do?

SaulLu commented Apr 13, 2022