Support mistral 7 b #443

Felhof · 2023-10-27T13:09:52Z

Description

This PR closes #387 by adding support for Mistral and implementing Grouped Query Attention.

Mistral 7B

The weight's from Huggingface's Mistral-7B-v0.1 can now be loaded into a HookedTransformer using

# these parameters are necessary as Mistral uses RMS norm and not layernorm.
tl_mistral = HookedTransformer.from_pretrained(
    "mistral-7B",
    fold_ln=False,
    center_writing_weights=False,
    center_unembed=False
)

Note that Mistral is only supported in transformers >= 4.34 and hence Python >= 3.8 is required to use it.
The demo notebook Mistral.ipynb features a comparison of Huggingface's Mistral with the HookedTransformer implementation. I tested it on the same prompts that were used in the Llama demo. The differences in the resulting logits were around 0.01 which is slightly higher than what Llama2 gets. This may be due to a known issue with rotary embeddings which also affects LLama2 and Pythia.

Grouped Query Attention

Mistral utilizes Grouped Query Attention (GPA) which is not used by any other model supported by TransformerLens and had to be implemented. I added a new class GroupedQueryAttention and an abstract base class AbstractAttention which now features the common functionality of both attention classes. The difference between Attention and GroupedQueryAttention is how they handle the key and value projections, as in GPA groups of queries share the same keys and values (see image below). This mostly affects the internal workings of the class. To avoid confusion, to not break existing code interacting with attention, and to make the design of future code easier, public attributes such as W_K and W_V have the same shape for both classes. This is because in the case of GPA, the underlying parameters are hidden behind a property that expands them using torch.repeat_interleave. A GPA block should behave the same as a regular Attention block whose weights are the result of applying torch.repeat_interleave to the GPA block's weight. There is a unit test that confirms this.

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have not rewritten tests relating to key interfaces which would affect backward compatibility

… transformers

transformer_lens/loading_from_pretrained.py

neelnanda-io

Thanks a lot for doing this! I left some small comments, but this isn't a full review (I haven't looked at exactly how you implemented grouped query attention), so I'll leave that for someone else to finish off.

transformer_lens/loading_from_pretrained.py

alan-cooney · 2023-11-09T11:58:22Z

Just to flag I'll review this once the previous PR goes through on attention, as we'll need to adjust for a few conflicts in it

alan-cooney · 2023-11-11T00:59:12Z

Hey @Felhof nice work on this, and the abstract attention class approach is great (we're planning on doing this for a bunch of the components).

Would you be able to fix the merge conflicts first and then I can do a full review? There's a few things there as the last PR to go through had some attention component changes.

pyproject.toml

neelnanda-io · 2023-11-28T21:10:46Z

Hey! I'm curious what the status of this PR is? A few of my MATS scholars want to use Mistral. Can they just check out this PR?

ojh31 · 2023-11-28T21:12:28Z

Hey! I'm curious what the status of this PR is? A few of my MATS scholars want to use Mistral. Can they just check out this PR?

That worked for me, modulo the comment I made above about using the right version of transformers (and the tokenizer being disgusting)

abdurraheemali · 2023-12-22T04:50:00Z

@alan-cooney @neelnanda-io @Felhof I want to run an activation patching experiment on llama-70b, so I'm going to check out the support-mistral-7b branch and report (in an hour-ish?) whether the grouped query attention implementation works for me or not

Edit: I couldn't get it to work with some other dependencies I had, but I'll try again later this week.

neelnanda-io · 2024-01-09T22:42:25Z

@alan-cooney @Felhof What's the status of this PR? I'm not sure which of you it's blocked on. Either way, I hear that people have been able to checkout this branch and get Mistral working, so thanks a lot for the work up to that point!

Felhof · 2024-01-10T01:37:54Z

The branch has been working for a while but needs approval to be merged into main :) I made sure it's up-to-date with main again.

wesg52 · 2024-01-16T18:31:15Z

@alan-cooney Ping on review :)

alan-cooney · 2024-01-17T15:20:52Z

Sorry folks!! I'm on it now

alan-cooney

Nice! Thanks for adding this and the approach is great.

A general point (don't feel the need to change, but to be aware) is that we're trying to type all methods and also add docstrings to them (Google style). If you have a chance it would be good to do this as well to e.g. explain the interleave approach that is used.

demos/Mistral.ipynb

tests/unit/test_grouped_query_attention.py

Felhof · 2024-01-21T06:20:29Z

Thanks for the review Alan! I have added better typing and documentation and removed the demo

alan-cooney · 2024-01-22T11:08:32Z

Thanks! And thanks for adding this!

neelnanda-io · 2024-01-22T21:31:18Z

Woot! Really glad this got merged in! Thanks for adding it @Felhof and sorry for the long delay

Felhof and others added 15 commits October 27, 2023 08:11

add mistral model name and alias

3fe1e47

add code for converting mistral config to hooked transformer config

4c342c0

add function for converting mistral weights

f6bfba6

add GroupedQueryAttention

9de861b

add abstract base class for attention

d953e70

adapt keyvaluecache if grouped query attention is used

d12de7f

fix fold_value_biases when using grouped query attention

914aa56

Add unit test for grouped query attention

01d7d4d

Add demo notebook for Mistral

0ae41aa

merge from main and solve conflicts

38f7607

fix formatting

e766313

add documentation for grouped query attention

7f734d5

update lock file

be70ed5

use Union instead of | for union types

ae27a64

hardcode mistral config so building docs works with older versions of…

2322bd8

… transformers

Felhof force-pushed the support-mistral-7B branch from 7da7b4f to 2322bd8 Compare October 27, 2023 14:35

neelnanda-io reviewed Oct 27, 2023

View reviewed changes

transformer_lens/loading_from_pretrained.py Outdated Show resolved Hide resolved

neelnanda-io reviewed Oct 27, 2023

View reviewed changes

transformer_lens/loading_from_pretrained.py Outdated Show resolved Hide resolved

alan-cooney self-requested a review October 28, 2023 13:10

Felhof added 2 commits November 3, 2023 11:11

don't set final_rms in Mistral config

0088c02

make Mistral-7b's alias name consistent with other models

473da7a

Felhof force-pushed the support-mistral-7B branch 3 times, most recently from cf3245b to 473da7a Compare November 3, 2023 11:36

merge and fix conflicts

5b9a3fa

Felhof force-pushed the support-mistral-7B branch from d35a5a2 to 5b9a3fa Compare November 3, 2023 11:56

update main demo notebook

cfef128

neelnanda-io mentioned this pull request Nov 14, 2023

[Proposal] Add Support for Yi-6B and Yi-34B #449

Open

merge from main and fix conflicts

cb83b3e

ojh31 reviewed Nov 22, 2023

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

Felhof added 2 commits December 1, 2023 11:02

merge from main, fix conflict in poetry.lock

2afad28

require transformers>=3.34

cff0a86

Felhof force-pushed the support-mistral-7B branch from 8f0b06f to cff0a86 Compare December 1, 2023 11:03

neelnanda-io mentioned this pull request Dec 22, 2023

[Proposal] Support Mixtral #471

Closed

merge and fix conflicts

9ede6f9

Felhof force-pushed the support-mistral-7B branch from f27d05d to 9ede6f9 Compare January 10, 2024 01:26

alan-cooney requested changes Jan 17, 2024

View reviewed changes

demos/Mistral.ipynb Outdated Show resolved Hide resolved

tests/unit/test_grouped_query_attention.py Outdated Show resolved Hide resolved

Felhof added 4 commits January 20, 2024 18:48

improve docstrings and clarify test name for grouped query attention

2203dff

remove Mistral demo

88dc810

merge from main and fix conflict in components.py

f6c4939

fix docstring format

383a031

Felhof force-pushed the support-mistral-7B branch from e8ea289 to 383a031 Compare January 21, 2024 05:57

alan-cooney approved these changes Jan 22, 2024

View reviewed changes

alan-cooney merged commit 11edb28 into TransformerLensOrg:main Jan 22, 2024
8 checks passed

collingray mentioned this pull request Jan 25, 2024

Add Support for Yi-6B and Yi-34B #494

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support mistral 7 b #443

Support mistral 7 b #443

Felhof commented Oct 27, 2023 •

edited

Loading

neelnanda-io left a comment

alan-cooney commented Nov 9, 2023

alan-cooney commented Nov 11, 2023 •

edited

Loading

neelnanda-io commented Nov 28, 2023

ojh31 commented Nov 28, 2023

abdurraheemali commented Dec 22, 2023 •

edited

Loading

neelnanda-io commented Jan 9, 2024

Felhof commented Jan 10, 2024

wesg52 commented Jan 16, 2024

alan-cooney commented Jan 17, 2024

alan-cooney left a comment

Felhof commented Jan 21, 2024

alan-cooney commented Jan 22, 2024

neelnanda-io commented Jan 22, 2024

Support mistral 7 b #443

Support mistral 7 b #443

Conversation

Felhof commented Oct 27, 2023 • edited Loading

Description

Mistral 7B

Grouped Query Attention

Type of change

Checklist:

neelnanda-io left a comment

Choose a reason for hiding this comment

alan-cooney commented Nov 9, 2023

alan-cooney commented Nov 11, 2023 • edited Loading

neelnanda-io commented Nov 28, 2023

ojh31 commented Nov 28, 2023

abdurraheemali commented Dec 22, 2023 • edited Loading

neelnanda-io commented Jan 9, 2024

Felhof commented Jan 10, 2024

wesg52 commented Jan 16, 2024

alan-cooney commented Jan 17, 2024

alan-cooney left a comment

Choose a reason for hiding this comment

Felhof commented Jan 21, 2024

alan-cooney commented Jan 22, 2024

neelnanda-io commented Jan 22, 2024

Felhof commented Oct 27, 2023 •

edited

Loading

alan-cooney commented Nov 11, 2023 •

edited

Loading

abdurraheemali commented Dec 22, 2023 •

edited

Loading