-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UDOP #22940
Add UDOP #22940
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
hi @NielsRogge thank you for pushing this PR. I haven't had the chance to try yet, but I'm curious if you have an example or have tried to perform a |
@plamb-viso My impression was always that tracing Encoder-Decoder models (e.g. BART) works fine but exporting to ONNX is challenging using jit.trace. There's a research example for BART on how to do that: Bart + Beam Search to ONNX I think this part of the reason the ONNX export is now offloaded into optimum: #14222 (comment) |
Just want to make sure with the UdopProcessor that we need to manually add the task to each input string. For e.g. if I'm doing document classification, I need to add For e.g.: prompt_text = ['document', 'classification.']
prompt_boxes = [[0,0,0,0],[0,0,0,0]]
processor.tokenizer(text=prompt_text, boxes=prompt_boxes) And prepend these input_ids/boxes to the input_ids/boxes that come out of the (Note that i am using apply_ocr=False) |
Also curious how we should encode the label of a training example. Is it a part of the inputs to The I-Code example appears to do it like this |
thanks @dtiarks looks like a key component of that script is the BartBeamSearchGenerator which allows you to convert it to torchscript. Will UDOP have something like this? I tried some of the naive steps I tried in this comment for tracing this new UDOP PR. Looks like the same issues remain. Curious if we'll get a test/example of tracing/compiling/onnx exporting the model either here or in optimum? EDIT just a naive try at onnx export in optimum: And just for completeness, a RuntimeError: 0 INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/jit/ir/alias_analysis.cpp":621, please report a bug to PyTorch. We don't have an op for aten::full_like but it isn't a special case. Argument types: Tensor, bool, int, int, Device, bool, NoneType,
Candidates:
aten::full_like(Tensor self, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
aten::full_like.out(Tensor self, Scalar fill_value, *, MemoryFormat? memory_format=None, Tensor(a!) out) -> Tensor(a!) |
@plamb-viso Here is the guide to add ONNX export support for a new architecture in Optimum: https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked at the tokenization, I didn't check the original codebase, but I think we should not touch the convert_to_token_id function. My reasoning is that this function should usually be called after tokenize
, and only on tokens that are part of the original vocab.
Thus I also think that the addition of special tokens when initialising a model should not rely on this functions (special tokens are not part of the vocab, but the additional_special_tokens
setter uses it. I'll see if this is something that can be easily adapted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the failing tests are also due to the typos and wrong slow to fast converter
Highly anticipating this release! :) Keep up the great work |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Definitely still highly interested in this work |
@ArthurZucker does #24565 fix the remaining issues of this PR? |
not sure it does no! The added tokens was the issue if I remember correctly |
Ok. The question is how we can move this PR forward? @plamb-viso, @Jordy-VL, I (and probably others) are still definitely interested in this. @NielsRogge are you aware of other issues blocking this PR or do you have other priorities at the moment? |
My current priority is #24629, then it will be the tokenizer PR which seems to be the last blocking factor. In the mean time I think that it should be good to get all the tests green and ask for a review to make it ready for a final one! The tokenizer can be updated after wards 🤗 sorry for the wait 😓 |
No worries @ArthurZucker |
@ArthurZucker the tokenizer is the only thing left to make all tests green. The PR is ready other than that. The only issue that is remaining are the sentinel tokens that the UDOP author defined (T5 has 100 of them, UDOP a lot more). Those are actually only relevant during pre-training, not during fine-tuning. Hence the model is already perfectly usable. I can only assign core maintainers for review when the CI is more or less green, so will do that once the tokenizer issue is fixed. |
Hi @NielsRogge, are you planning to do one of your wonderful notebook tutorials once this PR is closed? I'm rather curios on how can we approach a token-classification task with a encoder-decoder architecture such as UDOP :) |
You can already check pix2struct ;) |
Ok! Let me have a second look at the tokenizer then! There are quite a few issues currently with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to manually add the tokens, and that can't be done in the init with the current API, but this allows us to remove the crazy regex in encoding. |
Eagerly anticipating this PR being merged. Is there any information on priority of this work and rough timelines? Thank you @ArthurZucker and @NielsRogge for your great work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
It's finally here folks! 3 checkpoints are on the hub: |
incredible, thank you. Hope you're able to do one of your amazing tutorials at some point as well |
Oh yes I'll upload the notebooks now ;) https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP |
Thanks a lot for the amazing work! Is there any chance that you upload question and answering fine tuning? Or give some ideas how it should be done. |
Thank you very much for your work! We are very excited to experiment with UDOP! |
For anyone following this thread, I looked today and saw that @NielsRogge added some UDOP tutorials: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP |
* First draft * More improvements * More improvements * More fixes * Fix copies * More improvements * More fixes * More improvements * Convert checkpoint * More improvements, set up tests * Fix more tests * Add UdopModel * More improvements * Fix equivalence test * More fixes * Redesign model * Extend conversion script * Use real inputs for conversion script * Add image processor * Improve conversion script * Add UdopTokenizer * Add fast tokenizer * Add converter * Update README's * Add processor * Add fully fledged tokenizer * Add fast tokenizer * Use processor in conversion script * Add tokenizer tests * Fix one more test * Fix more tests * Fix tokenizer tests * Enable fast tokenizer tests * Fix more tests * Fix additional_special_tokens of fast tokenizer * Fix tokenizer tests * Fix more tests * Fix equivalence test * Rename image to pixel_values * Rename seg_data to bbox * More renamings * Remove vis_special_token * More improvements * Add docs * Fix copied from * Update slow tokenizer * Update fast tokenizer design * Make text input optional * Add first draft of processor tests * Fix more processor tests * Fix decoder_start_token_id * Fix test_initialization * Add integration test * More improvements * Improve processor, add test * Add more copied from * Add more copied from * Add more copied from * Add more copied from * Remove print statement * Update README and auto mapping * Delete files * Delete another file * Remove code * Fix test * Fix docs * Remove asserts * Add doc tests * Include UDOP in exotic model tests * Add expected tesseract decodings * Add sentencepiece * Use same design as T5 * Add UdopEncoderModel * Add UdopEncoderModel to tests * More fixes * Fix fast tokenizer * Fix one more test * Remove parallelisable attribute * Fix copies * Remove legacy file * Copy from T5Tokenizer * Fix rebase * More fixes, copy from T5 * More fixes * Fix init * Use ArthurZ/udop for tests * Make all model tests pass * Remove UdopForConditionalGeneration from auto mapping * Fix more tests * fixups * more fixups * fix the tokenizers * remove un-necessary changes * nits * nits * replace truncate_sequences_boxes with truncate_sequences for fix-copies * nit current path * add a test for input ids * ids that we should get taken from c9f7a32 * nits converting * nits * apply ruff * nits * nits * style * fix slow order of addition * fix udop fast range as well * fixup * nits * Add docstrings * Fix gradient checkpointing * Update code examples * Skip tests * Update integration test * Address comment * Make fixup * Remove extra ids from tokenizer * Skip test * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update year * Address comment * Address more comments * Address comments * Add copied from * Update CI * Rename script * Update model id * Add AddedToken, skip tests * Update CI * Fix doc tests * Do not use Tesseract for the doc tests * Remove kwargs * Add original inputs * Update casting * Fix doc test * Update question * Update question * Use LayoutLMv3ImageProcessor * Update organization * Improve docs * Update forward signature * Make images optional * Remove deprecated device argument * Add comment, add add_prefix_space * More improvements * Remove kwargs --------- Co-authored-by: ArthurZucker <arthur.zucker@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Noting here that if you try to load the non-fast version of the tokenizer this way:
When you go to tokenize labels, it calls base tokenization methods that miss the udop-boxed based methods and produces this error:
Same thing happens if you attempt this: tok = UdopTokenizer.from_pretrained(MODEL_KEY)
img = LayoutLMv3ImageProcessor.from_pretrained(MODEL_KEY)
proc = UdopProcessor(img, tok) For anyone else that encounters this, you can get the effect of the non-fast tokenizer by setting the env var |
Could you open a saperate issue @plamb-viso ? cc @NielsRogge |
* First draft * More improvements * More improvements * More fixes * Fix copies * More improvements * More fixes * More improvements * Convert checkpoint * More improvements, set up tests * Fix more tests * Add UdopModel * More improvements * Fix equivalence test * More fixes * Redesign model * Extend conversion script * Use real inputs for conversion script * Add image processor * Improve conversion script * Add UdopTokenizer * Add fast tokenizer * Add converter * Update README's * Add processor * Add fully fledged tokenizer * Add fast tokenizer * Use processor in conversion script * Add tokenizer tests * Fix one more test * Fix more tests * Fix tokenizer tests * Enable fast tokenizer tests * Fix more tests * Fix additional_special_tokens of fast tokenizer * Fix tokenizer tests * Fix more tests * Fix equivalence test * Rename image to pixel_values * Rename seg_data to bbox * More renamings * Remove vis_special_token * More improvements * Add docs * Fix copied from * Update slow tokenizer * Update fast tokenizer design * Make text input optional * Add first draft of processor tests * Fix more processor tests * Fix decoder_start_token_id * Fix test_initialization * Add integration test * More improvements * Improve processor, add test * Add more copied from * Add more copied from * Add more copied from * Add more copied from * Remove print statement * Update README and auto mapping * Delete files * Delete another file * Remove code * Fix test * Fix docs * Remove asserts * Add doc tests * Include UDOP in exotic model tests * Add expected tesseract decodings * Add sentencepiece * Use same design as T5 * Add UdopEncoderModel * Add UdopEncoderModel to tests * More fixes * Fix fast tokenizer * Fix one more test * Remove parallelisable attribute * Fix copies * Remove legacy file * Copy from T5Tokenizer * Fix rebase * More fixes, copy from T5 * More fixes * Fix init * Use ArthurZ/udop for tests * Make all model tests pass * Remove UdopForConditionalGeneration from auto mapping * Fix more tests * fixups * more fixups * fix the tokenizers * remove un-necessary changes * nits * nits * replace truncate_sequences_boxes with truncate_sequences for fix-copies * nit current path * add a test for input ids * ids that we should get taken from c9f7a32 * nits converting * nits * apply ruff * nits * nits * style * fix slow order of addition * fix udop fast range as well * fixup * nits * Add docstrings * Fix gradient checkpointing * Update code examples * Skip tests * Update integration test * Address comment * Make fixup * Remove extra ids from tokenizer * Skip test * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update year * Address comment * Address more comments * Address comments * Add copied from * Update CI * Rename script * Update model id * Add AddedToken, skip tests * Update CI * Fix doc tests * Do not use Tesseract for the doc tests * Remove kwargs * Add original inputs * Update casting * Fix doc test * Update question * Update question * Use LayoutLMv3ImageProcessor * Update organization * Improve docs * Update forward signature * Make images optional * Remove deprecated device argument * Add comment, add add_prefix_space * More improvements * Remove kwargs --------- Co-authored-by: ArthurZucker <arthur.zucker@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
What does this PR do?
This PR adds UDOP as described in Unifying Vision, Text, and Layout for Universal Document Processing.
The model can be seen as an encoder-decoder Transformer with LayoutLMv3 as encoder and a T5 text decoder.
Fixes #20650
To do:
tests/models/udop/test_processor_udop.py::UdopProcessorTest::test_save_load_pretrained_default
microsoft
, replaceArthurZ/udop
everywhere by an official UDOP checkpoint