-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MaxViT model #912
Comments
Quite related to #911 |
Hey all I can work on this. :) |
I'd gladly also port this from the official repo to here. :) |
@ayulockin @innat Ideally we would like to have SwinTransformer first: |
@tanzhenyu In the mean time, I think it's also ok to start working on the basic component like window partition, grid attention, trail-dense etc. cc @ayulockin @DavidLandup0 |
|
Creating a pull request later today with layers for patching, mlp heads, linear projections, etc. We can use those to build a ViT and then extend it to Swin and other transformers for vision. A rough draft for ViT will be coming in with the basic layers. Would you prefer a PR for components, and then a PR for ViT on a different branch instead? @tanzhenyu @innat |
Not that we need to do the same but at the same time I will take a look to also at the modularization organized in the quite popular Huggingface Transformers API |
Given we don't anticipate the need to expose components such as linear projections as public APIs, either creating a single PR or multiple PRs sounds good to me. |
Sure! I'm packaging them into one PR as a draft overview, just to check whether the general structure is okay. It'd be unwise to work more on it if major changes need to be done. I'm testing out a rough idea and will push it in later for a cursory look/review :) The idea was to build blocks that we can reuse for most transformer-based models. Currently, building a ViT with it looks like this:
|
I think adding basic components as you mentioned should be the way to go. KerasCV's aim is to provide components for industrial adaption of research. I think instead of focusing on models (ViT, Swin, etc) we should scope the transformers for vision such that we can build fundamentals blocks. |
It is why I've suggest to explore Huggingface transformer modules. Probably it is not the best modularization that we could achieve but at least they have already accumulate a quite relevant list of transformer archs on the library. I don't know if its is production level or not but at least it is partially validated by the number of models. |
This seems to be concise. My only question here is whether the And yes I agree with @bhack and @ayulockin that we should take a look at HF's implementation and make sure we're providing enough modularization. |
Yes this is another very important point already discussed to minimize (future?) duplications with Keras-nlp |
As ViTs are finished - I'll be working on this one now ;) |
Hey, @DavidLandup0, I would love to collaborate on this with you. :) I was waiting for the ViT to be added so I could build on top of it from a design point of view. Since you have worked on it, collaborating with you would be a great learning experience. :) |
@ayulockin just to inform, MAXIM is welcomed too. Most of the official code (jax) + weight was ported to keras, here. |
If nobody else signs up for it by the time MaxViT is done, I'd gladly hop onto MAXIM too :) |
Since MaxViT uses MBConvs, which we have in EfficientNets, and which originated in MobileNets - we'll have three architectures reusing them same blocks. Additionally, having them as a layer would let users try to build networks with them themselves for edge/mobile applications. I think we should have Can I separate it into a layer and refactor EfficientNets in preparation for MaxViT? @tanzhenyu @LukeWood @bhack |
I can work on MAXIM !! |
Yep, it'd be great to reuse both MBConv and SE |
Done in new PR :) |
Go ahead! |
Here is a quick update on the work done so far: Work done in collaboration with @DavidLandup0 :) We have almost all the components - We have stacked them together to build a barebone @DavidLandup0, do you have anything more to add? cc: @innat @bhack @tanzhenyu |
Thanks for tagging and awesome work on Since we should package the components for review first, it's enough to have a rough model for the first PR to prove that they work, and assess their usage. I'll do the It'd be a good idea to see if we can generalize the existing transformer encoder to be used between ViTs and MaxViTs since they're not too different (and allow the type of multihead attention to be changed). The main counter argument is that it already has quite a few arguments so having a general encoder with many might not be very user friendly. Thoughts? |
Generally this could be an indirect signal that it could require a base class. |
For reference, this is the constructor:
Though, because of the defaults, usage can be as simple as:
Now - I remember KerasNLP having this same issue. We might not be able to have a fully general TransformerEncoder for all cases, so it might be better to do them separately? In the case of MaxViT, it's one extra arg, that simply defines:
So it's a small change. The question is mainly for work down the line when we might need to support more options. |
I am in favour of a separate TransformerEncoder. It allows for speedy implementation since vision transformers rapidly evolve. The counterargument is that we implement a handful of vision transformers and then try to build a unified transformer encoder by introducing a base class. |
Great progress! The breakdown of those components sounds good to me. @vztu @Yinxiaoli can you comment here? Re David's question -- I think it'd be nice to have a transformer encoder that accept different attention mechanisms, though we don't have plan to move relative attention to core keras yet -- maybe later, given there are so many different attentions out there. If MaxVit can re-use the encoder that'd be great, the core value of KCV is always to provide generic components. |
Short Description
Multi-Axis Vision Transformer: MaxViT is a family of hybrid (CNN + ViT) image classification models, that achieves better performances across the board for both parameter and FLOPs efficiency than both SoTA ConvNets and Transformers.
Papers
https://arxiv.org/abs/2204.01697
Existing Implementations
Official Implementation:
Goolge, TensorFlow 2 (Keras). /~https://github.com/google-research/maxvit
cc. @Yinxiaoli @vztu
The text was updated successfully, but these errors were encountered: