Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should we handle matrix ABIs? #133144

Open
workingjubilee opened this issue Nov 17, 2024 · 5 comments
Open

How should we handle matrix ABIs? #133144

workingjubilee opened this issue Nov 17, 2024 · 5 comments
Labels
A-ABI Area: Concerning the application binary interface (ABI) E-needs-investigation Call for partcipation: This issues needs some investigation to determine current status O-AArch64 Armv8-A or later processors in AArch64 mode O-PowerPC Target: PowerPC processors O-x86_64 Target: x86-64 processors (like x86_64-*) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@workingjubilee
Copy link
Member

workingjubilee commented Nov 17, 2024

Some CPU architectures have developed "matrix extensions". These are sometimes equivalent to "vectors, but bigger" in terms of how the ABI should be handled (reusing the same architectural state, thus having similar concerns). But not always! They may use entirely different architectural state, usually entirely "caller-save" (i.e. always "volatile" or "call-clobbered").

AArch64

Scalable Matrix Extensions

PowerPC

MMA

x86

AMX

References

@workingjubilee workingjubilee added A-ABI Area: Concerning the application binary interface (ABI) O-AArch64 Armv8-A or later processors in AArch64 mode O-PowerPC Target: PowerPC processors O-x86_64 Target: x86-64 processors (like x86_64-*) labels Nov 17, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Nov 17, 2024
@workingjubilee workingjubilee added E-needs-investigation Call for partcipation: This issues needs some investigation to determine current status and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Nov 17, 2024
@workingjubilee workingjubilee changed the title How to handle the ABI of matrix extensions? How should we handle the ABI of matrix extensions? Nov 17, 2024
@workingjubilee workingjubilee changed the title How should we handle the ABI of matrix extensions? How should we handle matrix ABIs? Nov 17, 2024
@programmerjake
Copy link
Member

afaik PowerPC MMA doesn't change the ABI: #131800 (comment)

@workingjubilee
Copy link
Member Author

workingjubilee commented Nov 18, 2024

It is good this issue is about handling ABIs rather than merely describing them, then? specifically, if we want to avoid involving this in our ABIs, we need to adopt the same bans.

@RalfJung
Copy link
Member

How does LLVM even represent these types in function signatures?

Sounds to me like this will require repr(matrix) and corresponding dedicated logic everywhere?

@jieyouxu jieyouxu added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Nov 18, 2024
@workingjubilee
Copy link
Member Author

workingjubilee commented Nov 18, 2024

I'm not sure if there's much in common that would justify repr(matrix). Each ISA might just require boutique handling here. But I am still trying to understand how Power ISA's MMA, Arm's Scalable Matrix Extensions, and x86's AMX tiles work, and how we will want to represent them.

My current understanding is

PowerISA's Matrix Multiply Assist

  • __vector_pair and __vector_quad are the relevant types
  • __vector_quad represents the accumulator register

C Interop

  • According to clang the __vector_quad type should never be passed anywhere?

Intrinsics

  • The __vector_quad type is always handled by-pointer.
  • The __vector_pair type seems to be defined as opaque(?) yet is sometimes passed by-value to intrinsics.

Arm Scalable Matrix Extensions

It is almost more like a dedicated thread-local allocation... the "ZArray"... that gets reinterpreted or examined along various dimensions. Then you set the CPU into Matrix Math... sorry, "Arm Streaming SVE" state... and Big Array Math happens, accumulating into the ZArray. The Big Array Math however is expressible as vector operations that just might use a different size than the normal Arm SVE operations, which is why it's "Streaming SVE": the model is "matrix math is mostly a pile of vector operations, done really fast". This does remove the ability to use some of the more complicated Arm SVE2 operations while in it.

C Interop

  • SME2: there is probably an assumption about what state the ZArray is on procedure entry/exit, likely "none, that's caller-saved"
  • SME: there is probably an assumption about whether the CPU is in "Matrix Math" or "Vector Math" states on procedure entry/exit, and it is probably "Vector Math" ("Non-Streaming") state
  • otherwise, it basically just seems to use the same vector registers, so there's that mercy

x86 AMX Tiles

The tiles seem to be more "classic" registers, but use an interesting API. They are also "shape-changing" in a way. I assume @sayantn knows more about this.

C Interop

  • there is probably an assumption about what state the tiles are in on procedure entry/exit (also probably "none, caller-saved")
  • there is probably an assumption about what shape the tiles are in on procedure entry/exit

Intrinsics

  • The __tile1024i type seems to be passed both by-value and handled by-pointer, for a typical signature looking like this:
fn some_tile_intrinsic(dst: &mut __tile1024i, src_a: __tile1024i, src_b: __tile1024i)

@sayantn
Copy link
Contributor

sayantn commented Dec 16, 2024

For AMX, the tile registers are nothing complicated - just plain old registers (with a 8192 bit size, so it is not enabled by default on Linux). The interesting bit is the __tile1024i type - which is a tile register (which CLang and GCC represent as i32x256) and its shape (1 u16).

Take for example the instruction TDPBSSD.

Intel lists 2 intrinsics, the _tile variation, which takes the 3 tmm register numbers as input, e.g if you want tmm0 * tmm1 in tmm2, you write _tile_dpbssd(2,0,1). CLang (and also Rust) directly calls llvm.x86.tdpbssd

The __tile (2 undescores) variation, on the other hand, takes 3 __tile1024i operands. CLang uses the llvm intrinsic llvm.x86.tdpbssd.internal, which has signature (i16, i16, i16, tmm, tmm, tmm). CLang unwraps the __tile1024i struct and passes the args to the intrinsic.

Also, the type for tmm in llvm is called llvm_x86amx_ty - and the autoupgrade script supports bitcasts to and from i32x256 - but for some reason, this doesn't work from Rust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ABI Area: Concerning the application binary interface (ABI) E-needs-investigation Call for partcipation: This issues needs some investigation to determine current status O-AArch64 Armv8-A or later processors in AArch64 mode O-PowerPC Target: PowerPC processors O-x86_64 Target: x86-64 processors (like x86_64-*) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants