Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UDOP #22940

Merged
merged 159 commits into from
Mar 4, 2024
Merged

Add UDOP #22940

merged 159 commits into from
Mar 4, 2024

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Apr 22, 2023

What does this PR do?

This PR adds UDOP as described in Unifying Vision, Text, and Layout for Universal Document Processing.

The model can be seen as an encoder-decoder Transformer with LayoutLMv3 as encoder and a T5 text decoder.

Fixes #20650

To do:

  • fix tests/models/udop/test_processor_udop.py::UdopProcessorTest::test_save_load_pretrained_default
  • include pytesseract decodings in processor test
  • check forward signature of the model as we can't change this afterwards
  • update organization to microsoft, replace ArthurZ/udop everywhere by an official UDOP checkpoint

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@plamb-viso
Copy link

hi @NielsRogge thank you for pushing this PR. I haven't had the chance to try yet, but I'm curious if you have an example or have tried to perform a torch.jit.trace or onnx conversion on UDOP yet? I know with the previous PR that was where I got stuck.

@dtiarks
Copy link

dtiarks commented May 3, 2023

@plamb-viso My impression was always that tracing Encoder-Decoder models (e.g. BART) works fine but exporting to ONNX is challenging using jit.trace. There's a research example for BART on how to do that: Bart + Beam Search to ONNX

I think this part of the reason the ONNX export is now offloaded into optimum: #14222 (comment)

@plamb-viso
Copy link

plamb-viso commented May 8, 2023

Just want to make sure with the UdopProcessor that we need to manually add the task to each input string. For e.g. if I'm doing document classification, I need to add document classification. and [0,0,0,0] to my words and bboxes before they go through the processor

For e.g.:

        prompt_text = ['document', 'classification.']
        prompt_boxes = [[0,0,0,0],[0,0,0,0]]
        processor.tokenizer(text=prompt_text, boxes=prompt_boxes)

And prepend these input_ids/boxes to the input_ids/boxes that come out of the processor

(Note that i am using apply_ocr=False)

@plamb-viso
Copy link

Also curious how we should encode the label of a training example. Is it a part of the inputs to UdopProcessor?

The I-Code example appears to do it like this

@plamb-viso
Copy link

plamb-viso commented May 12, 2023

thanks @dtiarks looks like a key component of that script is the BartBeamSearchGenerator which allows you to convert it to torchscript. Will UDOP have something like this?

I tried some of the naive steps I tried in this comment for tracing this new UDOP PR. Looks like the same issues remain. Curious if we'll get a test/example of tracing/compiling/onnx exporting the model either here or in optimum?

EDIT just a naive try at onnx export in optimum:
KeyError: "udop is not supported yet.

And just for completeness, a torch.onnx.export gives:

RuntimeError: 0 INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/jit/ir/alias_analysis.cpp":621, please report a bug to PyTorch. We don't have an op for aten::full_like but it isn't a special case.  Argument types: Tensor, bool, int, int, Device, bool, NoneType,

Candidates:
	aten::full_like(Tensor self, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
	aten::full_like.out(Tensor self, Scalar fill_value, *, MemoryFormat? memory_format=None, Tensor(a!) out) -> Tensor(a!)

@regisss
Copy link
Contributor

regisss commented May 12, 2023

@plamb-viso Here is the guide to add ONNX export support for a new architecture in Optimum: https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute
Feel free to open a PR there and we'll help you if you encounter any issue 🙂

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked at the tokenization, I didn't check the original codebase, but I think we should not touch the convert_to_token_id function. My reasoning is that this function should usually be called after tokenize, and only on tokens that are part of the original vocab.
Thus I also think that the addition of special tokens when initialising a model should not rely on this functions (special tokens are not part of the vocab, but the additional_special_tokens setter uses it. I'll see if this is something that can be easily adapted

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the failing tests are also due to the typos and wrong slow to fast converter

@Jordy-VL
Copy link

Jordy-VL commented Jun 6, 2023

Highly anticipating this release! :) Keep up the great work

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@plamb-viso
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Definitely still highly interested in this work

@dtiarks
Copy link

dtiarks commented Jun 30, 2023

@ArthurZucker does #24565 fix the remaining issues of this PR?

@ArthurZucker
Copy link
Collaborator

not sure it does no! The added tokens was the issue if I remember correctly

@dtiarks
Copy link

dtiarks commented Jul 2, 2023

Ok. The question is how we can move this PR forward? @plamb-viso, @Jordy-VL, I (and probably others) are still definitely interested in this.

@NielsRogge are you aware of other issues blocking this PR or do you have other priorities at the moment?

@ArthurZucker
Copy link
Collaborator

My current priority is #24629, then it will be the tokenizer PR which seems to be the last blocking factor. In the mean time I think that it should be good to get all the tests green and ask for a review to make it ready for a final one! The tokenizer can be updated after wards 🤗 sorry for the wait 😓

@dtiarks
Copy link

dtiarks commented Jul 3, 2023

No worries @ArthurZucker ☺️. My comment was not meant to push anyone. I was just interested if I could contribute to speed up the process.

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Jul 3, 2023

@ArthurZucker the tokenizer is the only thing left to make all tests green. The PR is ready other than that. The only issue that is remaining are the sentinel tokens that the UDOP author defined (T5 has 100 of them, UDOP a lot more). Those are actually only relevant during pre-training, not during fine-tuning. Hence the model is already perfectly usable.

I can only assign core maintainers for review when the CI is more or less green, so will do that once the tokenizer issue is fixed.

@AleRosae
Copy link

AleRosae commented Jul 4, 2023

Hi @NielsRogge, are you planning to do one of your wonderful notebook tutorials once this PR is closed? I'm rather curios on how can we approach a token-classification task with a encoder-decoder architecture such as UDOP :)

@Jordy-VL
Copy link

Jordy-VL commented Jul 4, 2023

Hi @NielsRogge, are you planning to do one of your wonderful notebook tutorials once this PR is closed? I'm rather curios on how can we approach a token-classification task with a encoder-decoder architecture such as UDOP :)

You can already check pix2struct ;)

@ArthurZucker
Copy link
Collaborator

Ok! Let me have a second look at the tokenizer then! There are quite a few issues currently with spm and AddedToken being taken care of!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pushing the changes need, will let you fix the tests. Overall, extra id things should be handled with extra care see #24622.
Otherwise pushed a tokenizer here:

@ArthurZucker
Copy link
Collaborator

You have to manually add the tokens, and that can't be done in the init with the current API, but this allows us to remove the crazy regex in encoding.

@sromoam
Copy link

sromoam commented Jul 17, 2023

Eagerly anticipating this PR being merged. Is there any information on priority of this work and rough timelines? Thank you @ArthurZucker and @NielsRogge for your great work.

@ArthurZucker ArthurZucker self-requested a review February 27, 2024 06:54
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@NielsRogge NielsRogge merged commit 836921f into huggingface:main Mar 4, 2024
23 checks passed
@NielsRogge
Copy link
Contributor Author

It's finally here folks!

3 checkpoints are on the hub:

@plamb-viso
Copy link

incredible, thank you. Hope you're able to do one of your amazing tutorials at some point as well

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Mar 4, 2024

Oh yes I'll upload the notebooks now ;) /~https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP

@ageorgiev97
Copy link

Thanks a lot for the amazing work! Is there any chance that you upload question and answering fine tuning? Or give some ideas how it should be done.

@felixvor
Copy link

felixvor commented Mar 6, 2024

Thank you very much for your work! We are very excited to experiment with UDOP!

@plamb-viso
Copy link

For anyone following this thread, I looked today and saw that @NielsRogge added some UDOP tutorials: /~https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP

@plamb-viso
Copy link

plamb-viso commented Mar 12, 2024

Noting here that if you try to load the non-fast version of the tokenizer this way:

proc = UdopProcessor.from_pretrained(MODEL_KEY, use_fast=False)

When you go to tokenize labels, it calls base tokenization methods that miss the udop-boxed based methods and produces this error:

TypeError: UdopTokenizer.truncate_sequences() missing 1 required positional argument: 'token_boxes'

Same thing happens if you attempt this:

tok = UdopTokenizer.from_pretrained(MODEL_KEY)
    img = LayoutLMv3ImageProcessor.from_pretrained(MODEL_KEY)
    proc = UdopProcessor(img, tok)

For anyone else that encounters this, you can get the effect of the non-fast tokenizer by setting the env var TOKENIZERS_PARALLELISM=false (i believe)

@ArthurZucker
Copy link
Collaborator

Could you open a saperate issue @plamb-viso ?
To get a slow tokenizer you just need to use UdopTokenizer, but if you set use_fast = False in the UdopProcessor call to from pretrained that might also do the trick

cc @NielsRogge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Model] UDOP: Unifying Vision, Text, and Layout for Universal Document Processing