Add MaxViT model #912

innat · 2022-10-12T14:24:52Z

Short Description

Multi-Axis Vision Transformer: MaxViT is a family of hybrid (CNN + ViT) image classification models, that achieves better performances across the board for both parameter and FLOPs efficiency than both SoTA ConvNets and Transformers.

Papers

https://arxiv.org/abs/2204.01697

Existing Implementations

Official Implementation:
Goolge, TensorFlow 2 (Keras). /~https://github.com/google-research/maxvit

cc. @Yinxiaoli @vztu

bhack · 2022-10-12T14:58:59Z

Quite related to #911

ayulockin · 2022-10-12T15:24:25Z

Hey all I can work on this. :)

DavidLandup0 · 2022-10-12T16:12:15Z

I'd gladly also port this from the official repo to here. :)
If someone could assign me to it, I'd get to it as soon as the Dice and Jaccard coefficients are done.

tanzhenyu · 2022-10-13T00:10:10Z

Hey all I can work on this. :)

@ayulockin @innat Ideally we would like to have SwinTransformer first:
#671

innat · 2022-10-13T03:34:35Z

@tanzhenyu
I think vanila ViT should be first, it's like VGG for transformer 😄

In the mean time, I think it's also ok to start working on the basic component like window partition, grid attention, trail-dense etc. cc @ayulockin @DavidLandup0

tanzhenyu · 2022-10-13T04:47:47Z

@tanzhenyu I think vanila ViT should be first, it's like VGG for transformer 😄

In the mean time, I think it's also ok to start working on the basic component like window partition, grid attention, trail-dense etc. cc @ayulockin @DavidLandup0

#668

DavidLandup0 · 2022-10-13T11:16:25Z

Creating a pull request later today with layers for patching, mlp heads, linear projections, etc. We can use those to build a ViT and then extend it to Swin and other transformers for vision. A rough draft for ViT will be coming in with the basic layers. Would you prefer a PR for components, and then a PR for ViT on a different branch instead? @tanzhenyu @innat

bhack · 2022-10-13T13:01:06Z

Not that we need to do the same but at the same time I will take a look to also at the modularization organized in the quite popular Huggingface Transformers API
/~https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_tf_vit.py

tanzhenyu · 2022-10-13T13:38:39Z

Creating a pull request later today with layers for patching, mlp heads, linear projections, etc. We can use those to build a ViT and then extend it to Swin and other transformers for vision. A rough draft for ViT will be coming in with the basic layers. Would you prefer a PR for components, and then a PR for ViT on a different branch instead? @tanzhenyu @innat

Given we don't anticipate the need to expose components such as linear projections as public APIs, either creating a single PR or multiple PRs sounds good to me.
In this case, it really depends if you want it to be re-used at Swin and others. Modularity is the key. If that's what you want, would you mind coming up with a basic design to show how those components can fit into other models?

DavidLandup0 · 2022-10-13T13:43:42Z

Sure! I'm packaging them into one PR as a draft overview, just to check whether the general structure is okay. It'd be unwise to work more on it if major changes need to be done. I'm testing out a rough idea and will push it in later for a cursory look/review :)

The idea was to build blocks that we can reuse for most transformer-based models. Currently, building a ViT with it looks like this:

inputs = utils.parse_model_inputs(input_shape, input_tensor)
 x = inputs

 if include_rescaling:
     x = layers.Rescaling(1 / 255.0)(x)

 patches = keras_cv.layers.Patching(patch_size)(x)
 encoded_patches = keras_cv.transformers.PatchEncoder(num_patches, project_dim)(patches)


    for _ in range(transformer_layer_num):
        x = keras_cv.transformers.TransformerEncoder()(encoded_patches)

 representation = layers.LayerNormalization(epsilon=1e-6)(x)
 representation = layers.Flatten()(representation)
 representation = layers.Dropout(0.5)(representation)

 features =  mlp_ffn(representation, hidden_units=head_units, dropout_rate=0.5)
 logits = layers.Dense(num_classes)(features)
 model = keras.Model(inputs=inputs, outputs=logits)

ayulockin · 2022-10-13T13:48:07Z

In the mean time, I think it's also ok to start working on the basic component like window partition, grid attention, trail-dense etc. cc @ayulockin @DavidLandup0

I think adding basic components as you mentioned should be the way to go. KerasCV's aim is to provide components for industrial adaption of research. I think instead of focusing on models (ViT, Swin, etc) we should scope the transformers for vision such that we can build fundamentals blocks.

bhack · 2022-10-13T14:13:28Z

I think adding basic components as you mentioned should be the way to go. KerasCV's aim is to provide components for industrial adaption of research. I think instead of focusing on models (ViT, Swin, etc) we should scope the transformers for vision such that we can build fundamentals blocks.

It is why I've suggest to explore Huggingface transformer modules.

Probably it is not the best modularization that we could achieve but at least they have already accumulate a quite relevant list of transformer archs on the library.

I don't know if its is production level or not but at least it is partially validated by the number of models.

tanzhenyu · 2022-10-13T14:50:20Z

Sure! I'm packaging them into one PR as a draft overview, just to check whether the general structure is okay. It'd be unwise to work more on it if major changes need to be done. I'm testing out a rough idea and will push it in ~30min for a cursory look/review :)

The idea was to build blocks that we can reuse for most transformer-based models. Currently, building a ViT with it looks like this:
inputs = utils.parse_model_inputs(input_shape, input_tensor)
 x = inputs

 if include_rescaling:
     x = layers.Rescaling(1 / 255.0)(x)

 patches = keras_cv.layers.Patching(patch_size)(x)
 encoded_patches = keras_cv.transformers.PatchEncoder(num_patches, project_dim)(patches)


    for _ in range(transformer_layer_num):
        x = keras_cv.transformers.TransformerEncoder()(encoded_patches)

 representation = layers.LayerNormalization(epsilon=1e-6)(x)
 representation = layers.Flatten()(representation)
 representation = layers.Dropout(0.5)(representation)

 features =  mlp_ffn(representation, hidden_units=head_units, dropout_rate=0.5)
 logits = layers.Dense(num_classes)(features)
 model = keras.Model(inputs=inputs, outputs=logits)

This seems to be concise. My only question here is whether the TransformerEncoder is implemented in a different way than "normal" transformer encoders, or the same way? We might want to consider bring this to core Keras instead of KerasCV if it's the same way.

And yes I agree with @bhack and @ayulockin that we should take a look at HF's implementation and make sure we're providing enough modularization.

bhack · 2022-10-13T14:53:31Z

We might want to consider bring this to core Keras instead of KerasCV if it's the same way

Yes this is another very important point already discussed to minimize (future?) duplications with Keras-nlp

DavidLandup0 · 2022-12-15T03:38:07Z

As ViTs are finished - I'll be working on this one now ;)
If anyone wants to collab, let me know. (@ayulockin wanted to work on this a while back)

ayulockin · 2022-12-15T04:28:54Z

Hey, @DavidLandup0, I would love to collaborate on this with you. :) I was waiting for the ViT to be added so I could build on top of it from a design point of view. Since you have worked on it, collaborating with you would be a great learning experience. :)

innat · 2022-12-15T04:54:07Z

@ayulockin just to inform, MAXIM is welcomed too. Most of the official code (jax) + weight was ported to keras, here.

DavidLandup0 · 2022-12-15T05:45:25Z

If nobody else signs up for it by the time MaxViT is done, I'd gladly hop onto MAXIM too :)

DavidLandup0 · 2022-12-15T05:54:24Z

Since MaxViT uses MBConvs, which we have in EfficientNets, and which originated in MobileNets - we'll have three architectures reusing them same blocks. Additionally, having them as a layer would let users try to build networks with them themselves for edge/mobile applications.

I think we should have MBConv as a standalone layer.

Can I separate it into a layer and refactor EfficientNets in preparation for MaxViT? @tanzhenyu @LukeWood @bhack

IMvision12 · 2022-12-15T07:10:05Z

I can work on MAXIM !!

tanzhenyu · 2022-12-15T15:21:30Z

Since MaxViT uses MBConvs, which we have in EfficientNets, and which originated in MobileNets - we'll have three architectures reusing them same blocks. Additionally, having them as a layer would let users try to build networks with them themselves for edge/mobile applications.

I think we should have MBConv as a standalone layer.

Can I separate it into a layer and refactor EfficientNets in preparation for MaxViT? @tanzhenyu @LukeWood @bhack

Yep, it'd be great to reuse both MBConv and SE

DavidLandup0 · 2022-12-15T15:29:42Z

Done in new PR :)
#1146

tanzhenyu · 2022-12-24T04:17:14Z

I can work on MAXIM !!

Go ahead!

ayulockin · 2022-12-29T12:23:20Z

Here is a quick update on the work done so far:

Work done in collaboration with @DavidLandup0 :)

We have almost all the components - WindowPartition, UnWindowPartition, GridPartition, UnGridPartition and RelativeMultiHeadAttention done.

We have stacked them together to build a barebone MaxViTBlock, and the input and output signatures match the official implementation. We will package it in a class and create MaxViT variants. Will send over a PR once done. :)

@DavidLandup0, do you have anything more to add?

cc: @innat @bhack @tanzhenyu

DavidLandup0 · 2022-12-29T12:42:59Z

Thanks for tagging and awesome work on RelativeMultiHeadAttention! Question for the Keras team - do we want to make RelativeMultiHeadAttention part of core Keras? MHA already is, and the relative variant is general enough for it, IMO.

Since we should package the components for review first, it's enough to have a rough model for the first PR to prove that they work, and assess their usage. I'll do the MaxViTTransformerEncoder and we can open the components PR.

It'd be a good idea to see if we can generalize the existing transformer encoder to be used between ViTs and MaxViTs since they're not too different (and allow the type of multihead attention to be changed). The main counter argument is that it already has quite a few arguments so having a general encoder with many might not be very user friendly.

Thoughts?

bhack · 2022-12-29T13:02:54Z

The main counter argument is that it already has quite a few arguments so having a general encoder with many might not be very user friendly.

Generally this could be an indirect signal that it could require a base class.

DavidLandup0 · 2022-12-29T13:09:49Z

For reference, this is the constructor:

def __init__(
        self,
        project_dim,
        num_heads,
        mlp_dim,
        mlp_dropout=0.1,
        attention_dropout=0.1,
        activation=tf.keras.activations.gelu,
        layer_norm_epsilon=1e-06,
        attention_type='mha',
        **kwargs,
    ):

Though, because of the defaults, usage can be as simple as:

keras_cv.layers.TransformerEncoder(project_dim=project_dim,
                                           mlp_dim = mlp_dim,
                                           num_heads=num_heads)(encoded_patches)

Now - I remember KerasNLP having this same issue. We might not be able to have a fully general TransformerEncoder for all cases, so it might be better to do them separately?

In the case of MaxViT, it's one extra arg, that simply defines:

        if attention_type == 'mha':
            attention_layer = layers.MultiHeadAttention
        elif attention_type == 'relmha':
            attention_layer = layers.RelativeMultiHeadAttention

So it's a small change. The question is mainly for work down the line when we might need to support more options.

ayulockin · 2022-12-29T13:31:29Z

I am in favour of a separate TransformerEncoder. It allows for speedy implementation since vision transformers rapidly evolve.

The counterargument is that we implement a handful of vision transformers and then try to build a unified transformer encoder by introducing a base class.

tanzhenyu · 2022-12-29T13:59:46Z

Here is a quick update on the work done so far:

Work done in collaboration with @DavidLandup0 :)

We have almost all the components - WindowPartition, UnWindowPartition, GridPartition, UnGridPartition and RelativeMultiHeadAttention done.

We have stacked them together to build a barebone MaxViTBlock, and the input and output signatures match the official implementation. We will package it in a class and create MaxViT variants. Will send over a PR once done. :)

@DavidLandup0, do you have anything more to add?

cc: @innat @bhack @tanzhenyu

Great progress! The breakdown of those components sounds good to me. @vztu @Yinxiaoli can you comment here?

Re David's question -- I think it'd be nice to have a transformer encoder that accept different attention mechanisms, though we don't have plan to move relative attention to core keras yet -- maybe later, given there are so many different attentions out there. If MaxVit can re-use the encoder that'd be great, the core value of KCV is always to provide generic components.

DavidLandup0 mentioned this issue Oct 13, 2022

Rough draft for Transformer-related components and ViT #916

Closed

5 tasks

tanzhenyu added the wip working in progress from KerasCV team label Oct 27, 2022

DavidLandup0 mentioned this issue Dec 15, 2022

Supporting MaxViT #1144

Closed

tanzhenyu added stat:contributions welcome and removed wip working in progress from KerasCV team labels Dec 15, 2022

ayulockin mentioned this issue Jan 24, 2023

Add MaxViT (WIP) #1311

Closed

4 tasks

innat closed this as completed Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MaxViT model #912

Add MaxViT model #912

innat commented Oct 12, 2022 •

edited

Loading

bhack commented Oct 12, 2022

ayulockin commented Oct 12, 2022

DavidLandup0 commented Oct 12, 2022

tanzhenyu commented Oct 13, 2022

innat commented Oct 13, 2022

tanzhenyu commented Oct 13, 2022

DavidLandup0 commented Oct 13, 2022

bhack commented Oct 13, 2022

tanzhenyu commented Oct 13, 2022

DavidLandup0 commented Oct 13, 2022 •

edited

Loading

ayulockin commented Oct 13, 2022

bhack commented Oct 13, 2022

tanzhenyu commented Oct 13, 2022

bhack commented Oct 13, 2022

DavidLandup0 commented Dec 15, 2022

ayulockin commented Dec 15, 2022

innat commented Dec 15, 2022

DavidLandup0 commented Dec 15, 2022

DavidLandup0 commented Dec 15, 2022

IMvision12 commented Dec 15, 2022 •

edited

Loading

tanzhenyu commented Dec 15, 2022

DavidLandup0 commented Dec 15, 2022

tanzhenyu commented Dec 24, 2022

ayulockin commented Dec 29, 2022

DavidLandup0 commented Dec 29, 2022

bhack commented Dec 29, 2022

DavidLandup0 commented Dec 29, 2022 •

edited

Loading

ayulockin commented Dec 29, 2022

tanzhenyu commented Dec 29, 2022

Add MaxViT model #912

Add MaxViT model #912

Comments

innat commented Oct 12, 2022 • edited Loading

bhack commented Oct 12, 2022

ayulockin commented Oct 12, 2022

DavidLandup0 commented Oct 12, 2022

tanzhenyu commented Oct 13, 2022

innat commented Oct 13, 2022

tanzhenyu commented Oct 13, 2022

DavidLandup0 commented Oct 13, 2022

bhack commented Oct 13, 2022

tanzhenyu commented Oct 13, 2022

DavidLandup0 commented Oct 13, 2022 • edited Loading

ayulockin commented Oct 13, 2022

bhack commented Oct 13, 2022

tanzhenyu commented Oct 13, 2022

bhack commented Oct 13, 2022

DavidLandup0 commented Dec 15, 2022

ayulockin commented Dec 15, 2022

innat commented Dec 15, 2022

DavidLandup0 commented Dec 15, 2022

DavidLandup0 commented Dec 15, 2022

IMvision12 commented Dec 15, 2022 • edited Loading

tanzhenyu commented Dec 15, 2022

DavidLandup0 commented Dec 15, 2022

tanzhenyu commented Dec 24, 2022

ayulockin commented Dec 29, 2022

DavidLandup0 commented Dec 29, 2022

bhack commented Dec 29, 2022

DavidLandup0 commented Dec 29, 2022 • edited Loading

ayulockin commented Dec 29, 2022

tanzhenyu commented Dec 29, 2022

innat commented Oct 12, 2022 •

edited

Loading

DavidLandup0 commented Oct 13, 2022 •

edited

Loading

IMvision12 commented Dec 15, 2022 •

edited

Loading

DavidLandup0 commented Dec 29, 2022 •

edited

Loading