How should we handle matrix ABIs? #133144

workingjubilee · 2024-11-17T21:55:45Z

Some CPU architectures have developed "matrix extensions". These are sometimes equivalent to "vectors, but bigger" in terms of how the ABI should be handled (reusing the same architectural state, thus having similar concerns). But not always! They may use entirely different architectural state, usually entirely "caller-save" (i.e. always "volatile" or "call-clobbered").

AArch64

Scalable Matrix Extensions

PowerPC

MMA

Figure out which target features are required for which SIMD size #131800 (comment)

x86

AMX

Tracking Issue for x86_amx_intrinsics #126622
introduces the amx_tile type, AKA x86_amx or __tile1024i

References

Figure out which target features are required for which SIMD size #131800

The text was updated successfully, but these errors were encountered:

programmerjake · 2024-11-17T22:51:06Z

afaik PowerPC MMA doesn't change the ABI: #131800 (comment)

workingjubilee · 2024-11-18T02:24:40Z

It is good this issue is about handling ABIs rather than merely describing them, then? specifically, if we want to avoid involving this in our ABIs, we need to adopt the same bans.

RalfJung · 2024-11-18T07:52:33Z

How does LLVM even represent these types in function signatures?

Sounds to me like this will require repr(matrix) and corresponding dedicated logic everywhere?

workingjubilee · 2024-11-18T21:19:29Z

I'm not sure if there's much in common that would justify repr(matrix). Each ISA might just require boutique handling here. But I am still trying to understand how Power ISA's MMA, Arm's Scalable Matrix Extensions, and x86's AMX tiles work, and how we will want to represent them.

My current understanding is

PowerISA's Matrix Multiply Assist

__vector_pair and __vector_quad are the relevant types
__vector_quad represents the accumulator register

C Interop

According to clang the __vector_quad type should never be passed anywhere?

Intrinsics

The __vector_quad type is always handled by-pointer.
The __vector_pair type seems to be defined as opaque(?) yet is sometimes passed by-value to intrinsics.

Arm Scalable Matrix Extensions

It is almost more like a dedicated thread-local allocation... the "ZArray"... that gets reinterpreted or examined along various dimensions. Then you set the CPU into Matrix Math... sorry, "Arm Streaming SVE" state... and Big Array Math happens, accumulating into the ZArray. The Big Array Math however is expressible as vector operations that just might use a different size than the normal Arm SVE operations, which is why it's "Streaming SVE": the model is "matrix math is mostly a pile of vector operations, done really fast". This does remove the ability to use some of the more complicated Arm SVE2 operations while in it.

C Interop

SME2: there is probably an assumption about what state the ZArray is on procedure entry/exit, likely "none, that's caller-saved"
SME: there is probably an assumption about whether the CPU is in "Matrix Math" or "Vector Math" states on procedure entry/exit, and it is probably "Vector Math" ("Non-Streaming") state
otherwise, it basically just seems to use the same vector registers, so there's that mercy

x86 AMX Tiles

The tiles seem to be more "classic" registers, but use an interesting API. They are also "shape-changing" in a way. I assume @sayantn knows more about this.

C Interop

there is probably an assumption about what state the tiles are in on procedure entry/exit (also probably "none, caller-saved")
there is probably an assumption about what shape the tiles are in on procedure entry/exit

Intrinsics

The __tile1024i type seems to be passed both by-value and handled by-pointer, for a typical signature looking like this:

fn some_tile_intrinsic(dst: &mut __tile1024i, src_a: __tile1024i, src_b: __tile1024i)

sayantn · 2024-12-16T18:16:59Z

For AMX, the tile registers are nothing complicated - just plain old registers (with a 8192 bit size, so it is not enabled by default on Linux). The interesting bit is the __tile1024i type - which is a tile register (which CLang and GCC represent as i32x256) and its shape (1 u16).

Take for example the instruction TDPBSSD.

Intel lists 2 intrinsics, the _tile variation, which takes the 3 tmm register numbers as input, e.g if you want tmm0 * tmm1 in tmm2, you write _tile_dpbssd(2,0,1). CLang (and also Rust) directly calls llvm.x86.tdpbssd

The __tile (2 undescores) variation, on the other hand, takes 3 __tile1024i operands. CLang uses the llvm intrinsic llvm.x86.tdpbssd.internal, which has signature (i16, i16, i16, tmm, tmm, tmm). CLang unwraps the __tile1024i struct and passes the args to the intrinsic.

Also, the type for tmm in llvm is called llvm_x86amx_ty - and the autoupgrade script supports bitcasts to and from i32x256 - but for some reason, this doesn't work from Rust.

workingjubilee added A-ABI Area: Concerning the application binary interface (ABI) O-AArch64 Armv8-A or later processors in AArch64 mode O-PowerPC Target: PowerPC processors O-x86_64 Target: x86-64 processors (like x86_64-*) labels Nov 17, 2024

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Nov 17, 2024

workingjubilee added E-needs-investigation Call for partcipation: This issues needs some investigation to determine current status and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Nov 17, 2024

workingjubilee changed the title ~~How to handle the ABI of matrix extensions?~~ How should we handle the ABI of matrix extensions? Nov 17, 2024

This was referenced Nov 17, 2024

How should we handle dynamic vector ABIs? #133146

Open

Figure out which target features are required for which SIMD size #131800

Closed

workingjubilee changed the title ~~How should we handle the ABI of matrix extensions?~~ How should we handle matrix ABIs? Nov 17, 2024

jieyouxu added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should we handle matrix ABIs? #133144

How should we handle matrix ABIs? #133144

workingjubilee commented Nov 17, 2024 •

edited

Loading

programmerjake commented Nov 17, 2024

workingjubilee commented Nov 18, 2024 •

edited

Loading

RalfJung commented Nov 18, 2024

workingjubilee commented Nov 18, 2024 •

edited

Loading

sayantn commented Dec 16, 2024 •

edited

Loading

How should we handle matrix ABIs? #133144

How should we handle matrix ABIs? #133144

Comments

workingjubilee commented Nov 17, 2024 • edited Loading

AArch64

PowerPC

x86

References

programmerjake commented Nov 17, 2024

workingjubilee commented Nov 18, 2024 • edited Loading

RalfJung commented Nov 18, 2024

workingjubilee commented Nov 18, 2024 • edited Loading

PowerISA's Matrix Multiply Assist

C Interop

Intrinsics

Arm Scalable Matrix Extensions

C Interop

x86 AMX Tiles

C Interop

Intrinsics

sayantn commented Dec 16, 2024 • edited Loading

workingjubilee commented Nov 17, 2024 •

edited

Loading

workingjubilee commented Nov 18, 2024 •

edited

Loading

workingjubilee commented Nov 18, 2024 •

edited

Loading

sayantn commented Dec 16, 2024 •

edited

Loading