Add UDOP #22940

NielsRogge · 2023-04-22T15:47:25Z

What does this PR do?

This PR adds UDOP as described in Unifying Vision, Text, and Layout for Universal Document Processing.

The model can be seen as an encoder-decoder Transformer with LayoutLMv3 as encoder and a T5 text decoder.

Fixes #20650

To do:

fix tests/models/udop/test_processor_udop.py::UdopProcessorTest::test_save_load_pretrained_default
include pytesseract decodings in processor test
check forward signature of the model as we can't change this afterwards
update organization to microsoft, replace ArthurZ/udop everywhere by an official UDOP checkpoint

HuggingFaceDocBuilderDev · 2023-04-22T16:32:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

plamb-viso · 2023-05-01T18:11:33Z

hi @NielsRogge thank you for pushing this PR. I haven't had the chance to try yet, but I'm curious if you have an example or have tried to perform a torch.jit.trace or onnx conversion on UDOP yet? I know with the previous PR that was where I got stuck.

dtiarks · 2023-05-03T09:40:34Z

@plamb-viso My impression was always that tracing Encoder-Decoder models (e.g. BART) works fine but exporting to ONNX is challenging using jit.trace. There's a research example for BART on how to do that: Bart + Beam Search to ONNX

I think this part of the reason the ONNX export is now offloaded into optimum: #14222 (comment)

plamb-viso · 2023-05-08T13:57:13Z

Just want to make sure with the UdopProcessor that we need to manually add the task to each input string. For e.g. if I'm doing document classification, I need to add document classification. and [0,0,0,0] to my words and bboxes before they go through the processor

For e.g.:

        prompt_text = ['document', 'classification.']
        prompt_boxes = [[0,0,0,0],[0,0,0,0]]
        processor.tokenizer(text=prompt_text, boxes=prompt_boxes)

And prepend these input_ids/boxes to the input_ids/boxes that come out of the processor

(Note that i am using apply_ocr=False)

plamb-viso · 2023-05-08T15:58:57Z

Also curious how we should encode the label of a training example. Is it a part of the inputs to UdopProcessor?

The I-Code example appears to do it like this

src/transformers/models/udop/tokenization_udop.py

plamb-viso · 2023-05-12T15:28:45Z

thanks @dtiarks looks like a key component of that script is the BartBeamSearchGenerator which allows you to convert it to torchscript. Will UDOP have something like this?

I tried some of the naive steps I tried in this comment for tracing this new UDOP PR. Looks like the same issues remain. Curious if we'll get a test/example of tracing/compiling/onnx exporting the model either here or in optimum?

EDIT just a naive try at onnx export in optimum:
KeyError: "udop is not supported yet.

And just for completeness, a torch.onnx.export gives:

RuntimeError: 0 INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/jit/ir/alias_analysis.cpp":621, please report a bug to PyTorch. We don't have an op for aten::full_like but it isn't a special case.  Argument types: Tensor, bool, int, int, Device, bool, NoneType,

Candidates:
	aten::full_like(Tensor self, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
	aten::full_like.out(Tensor self, Scalar fill_value, *, MemoryFormat? memory_format=None, Tensor(a!) out) -> Tensor(a!)

regisss · 2023-05-12T16:48:36Z

@plamb-viso Here is the guide to add ONNX export support for a new architecture in Optimum: https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute
Feel free to open a PR there and we'll help you if you encounter any issue 🙂

ArthurZucker

I just looked at the tokenization, I didn't check the original codebase, but I think we should not touch the convert_to_token_id function. My reasoning is that this function should usually be called after tokenize, and only on tokens that are part of the original vocab.
Thus I also think that the addition of special tokens when initialising a model should not rely on this functions (special tokens are not part of the vocab, but the additional_special_tokens setter uses it. I'll see if this is something that can be easily adapted

src/transformers/models/udop/tokenization_udop.py

ArthurZucker

Some of the failing tests are also due to the typos and wrong slow to fast converter

src/transformers/convert_slow_tokenizer.py

Jordy-VL · 2023-06-06T13:48:00Z

Highly anticipating this release! :) Keep up the great work

github-actions · 2023-06-30T15:02:35Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

plamb-viso · 2023-06-30T15:17:27Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Definitely still highly interested in this work

dtiarks · 2023-06-30T17:44:33Z

@ArthurZucker does #24565 fix the remaining issues of this PR?

ArthurZucker · 2023-07-01T07:26:41Z

not sure it does no! The added tokens was the issue if I remember correctly

dtiarks · 2023-07-02T11:23:43Z

Ok. The question is how we can move this PR forward? @plamb-viso, @Jordy-VL, I (and probably others) are still definitely interested in this.

@NielsRogge are you aware of other issues blocking this PR or do you have other priorities at the moment?

ArthurZucker · 2023-07-03T06:39:42Z

My current priority is #24629, then it will be the tokenizer PR which seems to be the last blocking factor. In the mean time I think that it should be good to get all the tests green and ask for a review to make it ready for a final one! The tokenizer can be updated after wards 🤗 sorry for the wait 😓

dtiarks · 2023-07-03T08:45:20Z

No worries @ArthurZucker ☺️. My comment was not meant to push anyone. I was just interested if I could contribute to speed up the process.

NielsRogge · 2023-07-03T13:16:57Z

@ArthurZucker the tokenizer is the only thing left to make all tests green. The PR is ready other than that. The only issue that is remaining are the sentinel tokens that the UDOP author defined (T5 has 100 of them, UDOP a lot more). Those are actually only relevant during pre-training, not during fine-tuning. Hence the model is already perfectly usable.

I can only assign core maintainers for review when the CI is more or less green, so will do that once the tokenizer issue is fixed.

AleRosae · 2023-07-04T10:09:20Z

Hi @NielsRogge, are you planning to do one of your wonderful notebook tutorials once this PR is closed? I'm rather curios on how can we approach a token-classification task with a encoder-decoder architecture such as UDOP :)

Jordy-VL · 2023-07-04T10:57:45Z

Hi @NielsRogge, are you planning to do one of your wonderful notebook tutorials once this PR is closed? I'm rather curios on how can we approach a token-classification task with a encoder-decoder architecture such as UDOP :)

You can already check pix2struct ;)

ArthurZucker · 2023-07-05T02:15:52Z

Ok! Let me have a second look at the tokenizer then! There are quite a few issues currently with spm and AddedToken being taken care of!

ArthurZucker

I'm pushing the changes need, will let you fix the tests. Overall, extra id things should be handled with extra care see #24622.
Otherwise pushed a tokenizer here:

src/transformers/models/udop/tokenization_udop.py

src/transformers/convert_slow_tokenizer.py

src/transformers/models/udop/tokenization_udop_fast.py

src/transformers/models/udop/tokenization_udop.py

ArthurZucker · 2023-07-05T06:24:49Z

You have to manually add the tokens, and that can't be done in the init with the current API, but this allows us to remove the crazy regex in encoding.

sromoam · 2023-07-17T16:15:34Z

Eagerly anticipating this PR being merged. Is there any information on priority of this work and rough timelines? Thank you @ArthurZucker and @NielsRogge for your great work.

ArthurZucker

Thanks!

.circleci/create_circleci_config.py

src/transformers/models/udop/modeling_udop.py

NielsRogge · 2024-03-04T17:51:27Z

It's finally here folks!

3 checkpoints are on the hub:

https://huggingface.co/microsoft/udop-large
https://huggingface.co/microsoft/udop-large-512
https://huggingface.co/microsoft/udop-large-512-300k (this is the best one)

plamb-viso · 2024-03-04T17:56:15Z

incredible, thank you. Hope you're able to do one of your amazing tutorials at some point as well

NielsRogge · 2024-03-04T18:23:34Z

Oh yes I'll upload the notebooks now ;) https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP

ageorgiev97 · 2024-03-05T10:31:29Z

Thanks a lot for the amazing work! Is there any chance that you upload question and answering fine tuning? Or give some ideas how it should be done.

felixvor · 2024-03-06T09:43:48Z

Thank you very much for your work! We are very excited to experiment with UDOP!

plamb-viso · 2024-03-07T14:32:33Z

For anyone following this thread, I looked today and saw that @NielsRogge added some UDOP tutorials: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP

* First draft * More improvements * More improvements * More fixes * Fix copies * More improvements * More fixes * More improvements * Convert checkpoint * More improvements, set up tests * Fix more tests * Add UdopModel * More improvements * Fix equivalence test * More fixes * Redesign model * Extend conversion script * Use real inputs for conversion script * Add image processor * Improve conversion script * Add UdopTokenizer * Add fast tokenizer * Add converter * Update README's * Add processor * Add fully fledged tokenizer * Add fast tokenizer * Use processor in conversion script * Add tokenizer tests * Fix one more test * Fix more tests * Fix tokenizer tests * Enable fast tokenizer tests * Fix more tests * Fix additional_special_tokens of fast tokenizer * Fix tokenizer tests * Fix more tests * Fix equivalence test * Rename image to pixel_values * Rename seg_data to bbox * More renamings * Remove vis_special_token * More improvements * Add docs * Fix copied from * Update slow tokenizer * Update fast tokenizer design * Make text input optional * Add first draft of processor tests * Fix more processor tests * Fix decoder_start_token_id * Fix test_initialization * Add integration test * More improvements * Improve processor, add test * Add more copied from * Add more copied from * Add more copied from * Add more copied from * Remove print statement * Update README and auto mapping * Delete files * Delete another file * Remove code * Fix test * Fix docs * Remove asserts * Add doc tests * Include UDOP in exotic model tests * Add expected tesseract decodings * Add sentencepiece * Use same design as T5 * Add UdopEncoderModel * Add UdopEncoderModel to tests * More fixes * Fix fast tokenizer * Fix one more test * Remove parallelisable attribute * Fix copies * Remove legacy file * Copy from T5Tokenizer * Fix rebase * More fixes, copy from T5 * More fixes * Fix init * Use ArthurZ/udop for tests * Make all model tests pass * Remove UdopForConditionalGeneration from auto mapping * Fix more tests * fixups * more fixups * fix the tokenizers * remove un-necessary changes * nits * nits * replace truncate_sequences_boxes with truncate_sequences for fix-copies * nit current path * add a test for input ids * ids that we should get taken from c9f7a32 * nits converting * nits * apply ruff * nits * nits * style * fix slow order of addition * fix udop fast range as well * fixup * nits * Add docstrings * Fix gradient checkpointing * Update code examples * Skip tests * Update integration test * Address comment * Make fixup * Remove extra ids from tokenizer * Skip test * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update year * Address comment * Address more comments * Address comments * Add copied from * Update CI * Rename script * Update model id * Add AddedToken, skip tests * Update CI * Fix doc tests * Do not use Tesseract for the doc tests * Remove kwargs * Add original inputs * Update casting * Fix doc test * Update question * Update question * Use LayoutLMv3ImageProcessor * Update organization * Improve docs * Update forward signature * Make images optional * Remove deprecated device argument * Add comment, add add_prefix_space * More improvements * Remove kwargs --------- Co-authored-by: ArthurZucker <arthur.zucker@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

plamb-viso · 2024-03-12T17:10:32Z

Noting here that if you try to load the non-fast version of the tokenizer this way:

proc = UdopProcessor.from_pretrained(MODEL_KEY, use_fast=False)

When you go to tokenize labels, it calls base tokenization methods that miss the udop-boxed based methods and produces this error:

TypeError: UdopTokenizer.truncate_sequences() missing 1 required positional argument: 'token_boxes'

Same thing happens if you attempt this:

tok = UdopTokenizer.from_pretrained(MODEL_KEY)
    img = LayoutLMv3ImageProcessor.from_pretrained(MODEL_KEY)
    proc = UdopProcessor(img, tok)

For anyone else that encounters this, you can get the effect of the non-fast tokenizer by setting the env var TOKENIZERS_PARALLELISM=false (i believe)

ArthurZucker · 2024-03-25T06:49:13Z

Could you open a saperate issue @plamb-viso ?
To get a slow tokenizer you just need to use UdopTokenizer, but if you set use_fast = False in the UdopProcessor call to from pretrained that might also do the trick

cc @NielsRogge

* First draft * More improvements * More improvements * More fixes * Fix copies * More improvements * More fixes * More improvements * Convert checkpoint * More improvements, set up tests * Fix more tests * Add UdopModel * More improvements * Fix equivalence test * More fixes * Redesign model * Extend conversion script * Use real inputs for conversion script * Add image processor * Improve conversion script * Add UdopTokenizer * Add fast tokenizer * Add converter * Update README's * Add processor * Add fully fledged tokenizer * Add fast tokenizer * Use processor in conversion script * Add tokenizer tests * Fix one more test * Fix more tests * Fix tokenizer tests * Enable fast tokenizer tests * Fix more tests * Fix additional_special_tokens of fast tokenizer * Fix tokenizer tests * Fix more tests * Fix equivalence test * Rename image to pixel_values * Rename seg_data to bbox * More renamings * Remove vis_special_token * More improvements * Add docs * Fix copied from * Update slow tokenizer * Update fast tokenizer design * Make text input optional * Add first draft of processor tests * Fix more processor tests * Fix decoder_start_token_id * Fix test_initialization * Add integration test * More improvements * Improve processor, add test * Add more copied from * Add more copied from * Add more copied from * Add more copied from * Remove print statement * Update README and auto mapping * Delete files * Delete another file * Remove code * Fix test * Fix docs * Remove asserts * Add doc tests * Include UDOP in exotic model tests * Add expected tesseract decodings * Add sentencepiece * Use same design as T5 * Add UdopEncoderModel * Add UdopEncoderModel to tests * More fixes * Fix fast tokenizer * Fix one more test * Remove parallelisable attribute * Fix copies * Remove legacy file * Copy from T5Tokenizer * Fix rebase * More fixes, copy from T5 * More fixes * Fix init * Use ArthurZ/udop for tests * Make all model tests pass * Remove UdopForConditionalGeneration from auto mapping * Fix more tests * fixups * more fixups * fix the tokenizers * remove un-necessary changes * nits * nits * replace truncate_sequences_boxes with truncate_sequences for fix-copies * nit current path * add a test for input ids * ids that we should get taken from c9f7a32 * nits converting * nits * apply ruff * nits * nits * style * fix slow order of addition * fix udop fast range as well * fixup * nits * Add docstrings * Fix gradient checkpointing * Update code examples * Skip tests * Update integration test * Address comment * Make fixup * Remove extra ids from tokenizer * Skip test * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update year * Address comment * Address more comments * Address comments * Add copied from * Update CI * Rename script * Update model id * Add AddedToken, skip tests * Update CI * Fix doc tests * Do not use Tesseract for the doc tests * Remove kwargs * Add original inputs * Update casting * Fix doc test * Update question * Update question * Use LayoutLMv3ImageProcessor * Update organization * Improve docs * Update forward signature * Make images optional * Remove deprecated device argument * Add comment, add add_prefix_space * More improvements * Remove kwargs --------- Co-authored-by: ArthurZucker <arthur.zucker@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

NielsRogge force-pushed the add_udop branch from 8a9df08 to 17ef831 Compare May 3, 2023 15:58

SaulLu reviewed May 11, 2023

View reviewed changes

src/transformers/models/udop/tokenization_udop.py Outdated Show resolved Hide resolved

ArthurZucker reviewed May 23, 2023

View reviewed changes

src/transformers/models/udop/tokenization_udop.py Outdated Show resolved Hide resolved

src/transformers/models/udop/tokenization_udop.py Outdated Show resolved Hide resolved

ArthurZucker reviewed May 24, 2023

View reviewed changes

src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved

ArthurZucker mentioned this pull request May 31, 2023

🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909

Merged

NielsRogge force-pushed the add_udop branch from 9627742 to 8170a2c Compare July 2, 2023 17:26

NielsRogge force-pushed the add_udop branch from 8170a2c to 5e49335 Compare July 3, 2023 08:22

ArthurZucker reviewed Jul 5, 2023

View reviewed changes

AleRosae mentioned this pull request Jul 18, 2023

Generating bounding boxes with UDOP microsoft/i-Code#87

Open

NielsRogge added 10 commits February 19, 2024 16:51

Merge remote-tracking branch 'upstream/main' into add_udop

6741df4

Fix doc test

e47585b

Update question

081773d

Update question

a1bc1ba

Use LayoutLMv3ImageProcessor

084d358

Update organization

aa1780e

Improve docs

0655050

Update forward signature

ec4f8ea

Make images optional

a89f8e4

Remove deprecated device argument

5216563

ArthurZucker self-requested a review February 27, 2024 06:54

ArthurZucker approved these changes Feb 28, 2024

View reviewed changes

.circleci/create_circleci_config.py Show resolved Hide resolved

src/transformers/models/udop/modeling_udop.py Show resolved Hide resolved

NielsRogge added 5 commits February 29, 2024 08:51

Merge remote-tracking branch 'upstream/main' into add_udop

171469b

Add comment, add add_prefix_space

b59e4e7

More improvements

430f377

Remove kwargs

7e68ced

Merge remote-tracking branch 'upstream/main' into add_udop

0b7ee3c

NielsRogge merged commit 836921f into huggingface:main Mar 4, 2024
23 checks passed

khipp mentioned this pull request Mar 10, 2024

Add missing localized READMEs to the copies check #29575

Merged

1 task

NielsRogge mentioned this pull request Mar 11, 2024

Special tokens of UDOP aren't encoded/decoded correctly #29591

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UDOP #22940

Add UDOP #22940

NielsRogge commented Apr 22, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 22, 2023

plamb-viso commented May 1, 2023

dtiarks commented May 3, 2023 •

edited

Loading

plamb-viso commented May 8, 2023 •

edited

Loading

plamb-viso commented May 8, 2023

plamb-viso commented May 12, 2023 •

edited

Loading

regisss commented May 12, 2023

ArthurZucker left a comment

ArthurZucker left a comment

Jordy-VL commented Jun 6, 2023

github-actions bot commented Jun 30, 2023

plamb-viso commented Jun 30, 2023

dtiarks commented Jun 30, 2023

ArthurZucker commented Jul 1, 2023

dtiarks commented Jul 2, 2023

ArthurZucker commented Jul 3, 2023

dtiarks commented Jul 3, 2023

NielsRogge commented Jul 3, 2023 •

edited

Loading

AleRosae commented Jul 4, 2023

Jordy-VL commented Jul 4, 2023

ArthurZucker commented Jul 5, 2023

ArthurZucker left a comment

ArthurZucker commented Jul 5, 2023

sromoam commented Jul 17, 2023

ArthurZucker left a comment

NielsRogge commented Mar 4, 2024

plamb-viso commented Mar 4, 2024

NielsRogge commented Mar 4, 2024 •

edited

Loading

ageorgiev97 commented Mar 5, 2024

felixvor commented Mar 6, 2024

plamb-viso commented Mar 7, 2024

plamb-viso commented Mar 12, 2024 •

edited

Loading

ArthurZucker commented Mar 25, 2024

Add UDOP #22940

Add UDOP #22940

Conversation

NielsRogge commented Apr 22, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 22, 2023

plamb-viso commented May 1, 2023

dtiarks commented May 3, 2023 • edited Loading

plamb-viso commented May 8, 2023 • edited Loading

plamb-viso commented May 8, 2023

plamb-viso commented May 12, 2023 • edited Loading

regisss commented May 12, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Jordy-VL commented Jun 6, 2023

github-actions bot commented Jun 30, 2023

plamb-viso commented Jun 30, 2023

dtiarks commented Jun 30, 2023

ArthurZucker commented Jul 1, 2023

dtiarks commented Jul 2, 2023

ArthurZucker commented Jul 3, 2023

dtiarks commented Jul 3, 2023

NielsRogge commented Jul 3, 2023 • edited Loading

AleRosae commented Jul 4, 2023

Jordy-VL commented Jul 4, 2023

ArthurZucker commented Jul 5, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Jul 5, 2023

sromoam commented Jul 17, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

NielsRogge commented Mar 4, 2024

plamb-viso commented Mar 4, 2024

NielsRogge commented Mar 4, 2024 • edited Loading

ageorgiev97 commented Mar 5, 2024

felixvor commented Mar 6, 2024

plamb-viso commented Mar 7, 2024

plamb-viso commented Mar 12, 2024 • edited Loading

ArthurZucker commented Mar 25, 2024

NielsRogge commented Apr 22, 2023 •

edited

Loading

dtiarks commented May 3, 2023 •

edited

Loading

plamb-viso commented May 8, 2023 •

edited

Loading

plamb-viso commented May 12, 2023 •

edited

Loading

NielsRogge commented Jul 3, 2023 •

edited

Loading

NielsRogge commented Mar 4, 2024 •

edited

Loading

plamb-viso commented Mar 12, 2024 •

edited

Loading