Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add a scalable representation to allow support for scalable vectors #3268

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

JamieCunliffe
Copy link

@JamieCunliffe JamieCunliffe commented May 19, 2022

A proposal to add an additional representation to be used with simd to allow for scalable vectors to be used.

Rendered

@JamieCunliffe JamieCunliffe changed the title RFC: Add a scalable representation to allow support for scalable vectors. RFC: Add a scalable representation to allow support for scalable vectors May 19, 2022
@ehuss ehuss added T-lang Relevant to the language team, which will review and decide on the RFC. A-simd SIMD related proposals & ideas labels May 19, 2022
@Amanieu
Copy link
Member

Amanieu commented May 25, 2022

I think a more general definition of an "opaque" type would be useful. This is a type which can exist in a register but not in memory, specifically:

  • It can be used as a function parameter or return value.
  • It can be used as the type of a local variable.
  • (Possible extension) you can make a struct consisting only of opaque types. The struct itself acts like an opaque type.
  • You can't have a pointer to an opaque type since it doesn't exist in memory.

Other that ARM and RISC-V scalable vectors, this would also be useful to represent reference types in WebAssembly. These are opaque references to objects which can only be used as local variables or function arguments and can't be written to WebAssembly memory.

@tschuett
Copy link

ARM SVE uses svfloat64x2_t. Vectors are a multiples of 128 bit. I don't know what RISC-V uses.

f64xN is in the Portable packed SIMD vector types RFC.

@boomshroom
Copy link

I noticed that seeing the vector length pseudoregister at runtime was considered undefined behavior. For RISC-V, rather than masking out elements that aren't used, it seems to primarily focus on setting the VL register, which is an actual register that needs to be modified when switching between different vector types. It also let's you change the actual "register size" by grouping together multiple physical registers, which is used either to save instructions or to facilitate type conversions. (ie casting from a u16 vector to a u32 vector puts the result across 2 contiguous vector registers, which can then be used as though they're one register.)

@JamieCunliffe
Copy link
Author

@boomshroom
I'm not too familiar with RISC-V, the reason I said changing VL at runtime is undefined is because LLVM considers vscale to be a runtime constant, and as far as I'm aware considers changing vscale to be undefined behaviour.

"That vscale is constant -- that the number of elements in a scalable vector does not change during program execution -- is baked into the accepted scalable vector type proposal from top to bottom and in fact was one of the conditions for its acceptance" - https://lists.llvm.org/pipermail/llvm-dev/2019-October/135560.html

It might just be a case of changing the wording so that it's more clear that causing vscale to change is the undefined behaviour. On RISC-V, I think vscale corresponds to VLMAX rather than VL. If that seems reasonable then I can update the RFC accordingly.

@Amanieu
I think we would have to be careful with the wording here, "This is a type which can exist in a register but not in memory" could be a little confusing as the SVE types can spill to the stack for instance.

Just to be clear though, are you asking me to transform this into a more general RFC for opaque types, or just mention them?

@tschuett
Copy link

tschuett commented Jun 7, 2022

ARM offers ACLEs, which can read the vscale. I have an array of floats, then I read them with ACLE SVE. Do SVE types ever exist in memory or only in registers?

@Amanieu
Copy link
Member

Amanieu commented Jun 7, 2022

I don't think this needs to be a general RFC on opaque types, but more details on how scalable vectors differ from normal types would be nice to have.

@tschuett
Copy link

tschuett commented Jun 7, 2022

There are SVE registers. The calling convention can probably pass scalable vectors on the stack. Then it will be vscale * 1 bytes. It has to be a fixed size.

@tschuett
Copy link

tschuett commented Jun 7, 2022

If you have too much time, you can actually play with a SVE box:
/~https://github.com/aws/aws-graviton-getting-started
The other option is a Fujitsu box. It is a harder problem to get access.

@tschuett
Copy link

tschuett commented Jun 7, 2022

One selling point of SVE is: if you use ARM ACLE SVE intrinsics and you follow the rules, then your program will run on 256-bit and 2048-bit hardware. ARM SVE are plain Cray vectors. I believe the RISC-V scalable vectors are more elaborate.

@clarfonthey
Copy link

I'm honestly a bit confused by this RFC. I understand the benefits of SVE and what it is, but I'm not 100% sure what it's asking.

Specifically, it seems like it's suggesting stabilising #[repr(simd)] for scalable vectors, which… I don't think is stabilised or will ever be stabilised for fixed-size vectors? Is it suggesting to add specific ARM-specific intrinsics in core::arch? How would this be added to std::simd when that gets stabilised?

Like, I'm sold on the idea of having scalable vectors in stdlib, but unsure about both what the RFC is proposing, and the potential implementation.

@tschuett
Copy link

tschuett commented Jun 8, 2022

>  wc -l arm_sve.h
24043 arm_sve.h

@eddyb
Copy link
Member

eddyb commented Jun 8, 2022

I think a more general definition of an "opaque" type would be useful. This is a type which can exist in a register but not in memory, specifically:

  • It can be used as a function parameter or return value.
  • It can be used as the type of a local variable.
  • (Possible extension) you can make a struct consisting only of opaque types. The struct itself acts like an opaque type.
  • You can't have a pointer to an opaque type since it doesn't exist in memory.

Other that ARM and RISC-V scalable vectors, this would also be useful to represent reference types in WebAssembly. These are opaque references to objects which can only be used as local variables or function arguments and can't be written to WebAssembly memory.

@Amanieu Mostly agree with #3268 (comment), just had a couple notes:

  • "opaque" feels ambiguous with e.g. extern { type } and similar existing FFI concepts
    • ironically, they're opposites, because extern { type } is "always behind a pointer" (i.e. data in memory), while this other concept is "never in memory"/always-by-value
    • free bikeshed material: "value-only types", "exotic types" (too vague?), "memoryless types"
    • however, there is an interesting connection: if we consider a Sized/DynSized/Pointee hierarchy, then the straightforward thing to do is have such types be !Pointee (which also implies they can't be used in ADTs without making the ADTs !Pointee as well, forcing FCA(first-class aggregates)/early SROA(scalar replacement of aggregates))
  • more than just/on top of externref in wasm, upcoming GC proposals would have entire hierarchies of types that it would be nice to have access to
    • unlike miri/CHERI, wasm wants to keep linear memory a plain array of bytes so all the GC allocations are completely separate - great design, but if we don't want LLVM/linker-level errors about how they got misused, we do need robust high-level support
    • long-term, GC-only wasm (w/o linear memory) could serve as a building block for some very interesting things (been thinking about it a lot in the context of GraalVM / Truffle, which today is built on Java bytecode)
  • Rust-GPU/rustc_codegen_spirv exposes several SPIR-V types that are effectively high-level abstract handles to GPU resources (buffers, textures, various aspects of raytracing, etc.), and while SPIR-V is inconsistent about how it deals with them (e.g. whether a pointer is required/allowed/disallowed), it would be great to hide a lot of it from the Rust code
    • OTOH long-term we may end up having good enough capabilities in rewriting memory-heavy code to memory-less code that we may not want to limit the user, and if we'd be comfortable with erroring in our equivalent of LTO (instead of on the original generic Rust code), then a lot of this probably doesn't matter as much

@workingjubilee
Copy link
Member

@tschuett This is an RFC, not IRC. Please only leave productive comments that advance the state of the conversation instead of non-contributing allusions that have no clear meaning. I can't even tell if your remark is critical or supportive.

@tschuett
Copy link

tschuett commented Jun 8, 2022

Sorry for my misbehaviour. I am supportive of adding scalable vectors to Rust. Because of type inference you cannot see that the pred variable is a predicate.

@tschuett
Copy link

tschuett commented Jun 8, 2022

The real questions is whether you want to make scalable vectors target-dependent (SVE, RISC-V).
I still like this f64xN. Scalable vectors of f64. rustc or LLVM can make it target-dependent:
/~https://github.com/gnzlbg/rfcs/blob/ppv/text/0000-ppv.md#unresolved-questions

@programmerjake
Copy link
Member

The real questions is whether you want to make scalable vectors target-dependent (SVE, RISC-V).

Imho scalable vectors should be target independent, the compiler backend will simply pick a suitable constant for vscale at compile time if not otherwise supported.

@tschuett
Copy link

tschuett commented Jun 8, 2022

Note that vscale is a LLVM thing and should not be part of the RFC. LLVM assumes the vscale is an unknown but constant value during the execution of the program. The real value is hardware dependent.

@programmerjake
Copy link
Member

Note that vscale is a LLVM thing and should not be part of the RFC.

I think it should not be dismissed just because it's a LLVM thing: every other compiler will have a similar constant simply because they need to represent scalable vectors as some multiple of an element count, that multiple is vscale.

Also, there should be variants for vectors like llvm's <vscale x 4 x f32>, not just <vscale x f32>, especially because fixed-length vector architectures are likely to pick 1 as vscale and vectors should be more than 1 element for efficiency.

https://reviews.llvm.org/D53695

Legalization

To legalize a scalable vector IR type to SelectionDAG types, the same procedure
is used as for fixed-length vectors, with one minor difference:

  • If the target does not support scalable vectors, the runtime multiple is
    assumed to be a constant '1' and the scalable flag is dropped. Legalization
    proceeds as normal after this.

@tschuett
Copy link

tschuett commented Jun 9, 2022

Do you want to expose this in Rust or should it be a an implementation detail of the compiler?

@programmerjake
Copy link
Member

Do you want to expose this in Rust or should it be a an implementation detail of the compiler?

imho @rust-lang/project-portable-simd should expose scalable vector types with vscale, an additional multiplier, and an element type -- perhaps by exposing a wrapper struct that also contains the number of valid elements (like ArrayVec::len -- VL for RISC-V V and SimpleV) rather than the underlying compiler type.

@programmerjake
Copy link
Member

programmerjake commented Jun 9, 2022

One important thing that imho this RFC needs to be usable by portable-simd is for the element type and the multiplier to be able to be generics:

#[repr(simd, scalable(MUL))]
struct ScalableVector<T, const MUL: usize>([T; 0]);

portable-simd's exposed wrapper type might be:

pub struct ScalableSimd<T, const MUL: usize>
where
    T: ElementType,
    ScalableMul<MUL>: SupportedScalableMul,
{
    len: u32, // exposed as usize, but realistically u32 is big enough
    value: ScalableVector<T, MUL>,
}

@tschuett
Copy link

tschuett commented Jun 9, 2022

How about this notation (without the 4):

#[repr(simd, scalable)]
#[derive(Clone, Copy)]
pub struct svfloat32_t {
    _ty: [f32; 0],
}

It is a target-indent scalable vector of f32. If you need len(), then it will tell the number of f32 in the vector.

@JamieCunliffe
Copy link
Author

MUL would be known at compile time and it's being constrained to a valid value by the traits, so I don't see a reason we couldn't have something like that. Having said that, I'm not yet fully sure of the implications of allowing a repr to depend on a const generic parameter as part of it though.

@tschuett
The RFC gives details as to why this takes a parameter, but without this parameter rustc would need to know about the SVE and RISC-V types (and any other future scalable SIMD extensions that might be created) to be able to emit the correct types to the compiler backend. For example with SVE and LLVM, you can't just use vscale x i64 the SVE intrinsics would be expecting a vscale x 2 x i64

My intention was that the feature proposed by this RFC would be target independent, and the rustc implementation would be target independent.
The bit that would then make it target dependent would be stdarch which would be able to expose a set of types and intrinsics that are architecture (and compiler backend) specific, like currently exists for SIMD.

@tschuett
Copy link

Honestly my RISC-V knowledge is limited. If you say that MUL is 4, then you make it target-dependent. It most likely only works for SVE. If In the future there comes a new scalable ISA that requires 8. How can your representation with integers be target-independent.

I agree with your vscale vector examples.

Maybe you can query LLVM for information about targets.

@tschuett
Copy link

For reference, IBM is also working on a scalable vector ISA:
https://libre-soc.org/openpower/sv/svp64/
https://libre-soc.org/openpower/sv/overview/

`Sized` (or both). Once returning of unsized is allowed this part of the rule
would be superseded by that mechanism. It's worth noting that, if any other
types are created that are `Copy` but not `Sized` this rule would apply to
those.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that Rust has generics, so I can e.g. write a function fn foo<T: Copy>(x: &T) -> T. The RFC seems to say this is allowed, because the return type is Copy. But for most types T and most ABIs this can't be implemented.

You can't just say in a sentence that you allow unsized return values. That's a major language feature that needs significant design work on its own.

I think what you actually want is some extremely special cases where specifically these scalable vector types are allowed as return values, but in a non-compositional way. There is no precedent for anything like this in Rust so it needs to be fairly carefully described and discussed.

RalfJung pushed a commit to RalfJung/rust-analyzer that referenced this pull request Apr 27, 2024
… r=Amanieu

Stabilize Ratified RISC-V Target Features

Stabilization PR for the ratified RISC-V target features. This stabilizes some of the target features tracked by #44839. This is also a part of #114544 and eventually needed for the RISC-V part of rust-lang/rfcs#3268.

There is a similar PR for the the stdarch crate which can be found at rust-lang/stdarch#1476.

This was briefly discussed on Zulip
(https://rust-lang.zulipchat.com/#narrow/stream/250483-t-compiler.2Frisc-v/topic/Stabilization.20of.20RISC-V.20Target.20Features/near/394793704).

Specifically, this PR stabilizes the:
* Atomic Instructions (A) on v2.0
* Compressed Instructions (C) on v2.0
* ~Double-Precision Floating-Point (D) on v2.2~
* ~Embedded Base (E) (Given as `RV32E` / `RV64E`) on v2.0~
* ~Single-Precision Floating-Point (F) on v2.2~
* Integer Multiplication and Division (M) on v2.0
* ~Vector Operations (V) on v1.0~
* Bit Manipulations (B) on v1.0 listed as `zba`, `zbc`, `zbs`
* Scalar Cryptography (Zk) v1.0.1 listed as `zk`, `zkn`, `zknd`, `zkne`, `zknh`, `zkr`, `zks`, `zksed`, `zksh`, `zkt`, `zbkb`, `zbkc` `zkbx`
* ~Double-Precision Floating-Point in Integer Register (Zdinx) on v1.0~
* ~Half-Precision Floating-Point (Zfh) on v1.0~
* ~Minimal Half-Precision Floating-Point (Zfhmin) on v1.0~
* ~Single-Precision Floating-Point in Integer Register (Zfinx) on v1.0~
* ~Half-Precision Floating-Point in Integer Register (Zhinx) on v1.0~
* ~Minimal Half-Precision Floating-Point in Integer Register (Zhinxmin) on v1.0~

r? `@Amanieu`
@RalfJung
Copy link
Member

I wonder if the proposal for "claimable" types with automatic claim can be used to overcome the issue of Copy: Sized? We'd still need to introduce a new category of "types that are unsized but can anyway be passed to an from functions", but maybe we don't have to break Copy: Sized...

@Amanieu
Copy link
Member

Amanieu commented Jun 27, 2024

The current plan in the implementation PR (rust-lang/rust#118917) is for scalable vector types to not implement either Copy or Sized but to instead specifically allow these types to be used as local variables and function arguments/return values.

My understanding is that this RFC is going to be rewritten to match the new implementation plan.

@RalfJung
Copy link
Member

That sounds potentially quite hacky... but in the end it'll be up to @rust-lang/types to decide whether that is acceptable.

An interesting part of this will be properly working out the MIR semantics, ideally by implementing them in the interpreter.

@eddyb
Copy link
Member

eddyb commented Nov 11, 2024

Citing myself from a comment on the draft PR (rust-lang/rust#118917 (comment)):

Based on rust-lang/rust#46571 (comment) (by @kennytm, responding to @Amanieu, on the matter of e.g. size_of_val on SVE types), I can probably expand on my comment above (rust-lang/rust#118917 (comment)):

The impression I'm getting is that the intent here is to make something similar to !Sized (and more specifically, some DynSized proposals) but "by-value, not in memory"... except even "by-value DynSized" is still in-memory, just allowing variable-level "move" operations (growing the stack dynamically and copying the contents, or even taking advantage of pass-by-ref).

But with a type that is meant to be register-only, wouldn't it make more sense to have it outside of the DynSized/Sized spectrum? How would you even place such a value in memory without the dependent typing to be able to ensure it's never accessed with different sizes? (unless this allows ARM SVE but rules out RISC-V's V extension, which would seem like quite the overspecialization IMO)

Relying on something like !Pointee to 100% rule out such types from memory (including being borrowed etc.) seems much better, and we can keep ?Pointee perma-unstable if only intrinsics might need it (if at all) - and that seems to be what I was saying even back in #3268 (comment).

@RalfJung
Copy link
Member

But with a type that is meant to be register-only, wouldn't it make more sense to have it outside of the DynSized/Sized spectrum? How would you even place such a value in memory without the dependent typing to be able to ensure it's never accessed with different sizes? (unless this allows ARM SVE but rules out RISC-V's V extension, which would seem like quite the overspecialization IMO)

Could someone explain the key difference between ARM SVE and RISC-V Vectors here?

@Amanieu
Copy link
Member

Amanieu commented Nov 11, 2024

Could someone explain the key difference between ARM SVE and RISC-V Vectors here?

From the language's point of view they can be treated mostly identically. The platform-specific intrinsics expose types whose size is only known at runtime and can be computed by reading a CPU register (vl on AArch64, vlenb on RISC-V).

@RalfJung
Copy link
Member

RalfJung commented Nov 11, 2024 via email

@michaelmaitland
Copy link

michaelmaitland commented Nov 11, 2024

Could someone explain the key difference between ARM SVE and RISC-V Vectors here?

From the language's point of view they can be treated mostly identically. The platform-specific intrinsics expose types whose size is only known at runtime and can be computed by reading a CPU register (vl on AArch64, vlenb on RISC-V).

Also to be clear, this is not a real register, right? It is a per-CPU constant that can be inspected?

On RISC-V there is the read only vlenb register. It is a per-CPU constant that can be inspected. It holds the value VLEN/8, which is the vector register length in bytes. The vlenb register is not the same thing as the read-only vl register. We use the vset{i}vl{i} instructions to change the read only vl register. These instructions take a requested vector length as immediate or GPR register, and the hardware responds via a GPR with the (frequently smaller) number of elements that the hardware will handle per iteration (stored in vl). While vl may be set as VLEN, it can take on other values.

@Amanieu
Copy link
Member

Amanieu commented Nov 11, 2024

Could someone explain the key difference between ARM SVE and RISC-V Vectors here?

From the language's point of view they can be treated mostly identically. The platform-specific intrinsics expose types whose size is only known at runtime and can be computed by reading a CPU register (vl on AArch64, vlenb on RISC-V).

Also to be clear, this is not a real register, right? It is a per-CPU constant that can be inspected?

On RISC-V there is the read only vlenb register. It is a per-CPU constant that can be inspected. It holds the value VLEN/8, which is the vector register length in bytes. The vlenb register is not the same thing as the read-only vl register. While vl may be set as VLEN, it can take on other values. We use the vset{i}vl{i} instructions to change the read only vl register. These instructions take a requested vector length as immediate or GPR register, and the hardware responds via a GPR with the (frequently smaller) number of elements that the hardware will handle per iteration (stored in vl).

To clarify, on RISC-V the size of vector registers is indicated by vlenb which is a constant. This is also what determines the size of the language-level scalable vector types.

vl on RISC-V is separate: it can be changed at runtime and controls the number of elements within a vector that subsequent vector instructions operate on (elements above vl are effectively ignored). It does not affect the size of vector registers or the size of language-level scalable vector types.

I hope this clarifies some of the confusion around vector lengths in RISC-V.

@RalfJung
Copy link
Member

RalfJung commented Nov 11, 2024 via email

@eddyb
Copy link
Member

eddyb commented Nov 11, 2024

To clarify, on RISC-V the size of vector registers is indicated by vlenb which is a constant. This is also what determines the size of the language-level scalable vector types.
vl on RISC-V is separate: it can be changed at runtime and controls the number of elements within a vector that subsequent vector instructions operate on (elements above vl are effectively ignored). It does not affect the size of vector registers or the size of language-level scalable vector types.

Wait, so they exposed types to C whose implicit copies can cause the equivalent of context-save/restore operations? I wasn't even aware there were vl/vtype-agnostic instructions but:

(from /~https://github.com/riscvarchive/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf)

7.9. Vector Load/Store Whole Register Instructions

  • Note
    These instructions are intended to be used to save and restore vector registers when the type or length of the current contents of the vector register is not known, or where modifying vl and vtype would be costly. Examples include compiler register spills, vector function calls where values are passed in vector registers, interrupt handlers, and OS context switches. Software can determine the number of bytes transferred by reading the vlenb register.

...
The instructions operate with an effective vector length, evl=NFIELDS*VLEN/EEW, regardless of current settings in vtype and vl.

On wide enough vector machines, accidentally triggering this functionality seems to be able to easily increase the amount of data transferred by e.g. an order of magnitude.
Also, the semantics are those of context-switching, i.e. preserving hardware state when crossing between mutually-untrusted domains, not intra-domain dataflow.

That said, I concede RVV's vlenb can be considered a "load-time constant" (or whatever else we want to call these things), comparable to e.g. the virtual address of a static (w/ ASLR), even if I find exposing that architectural value as a property of data types, still somewhat odd.


To turn this around, you could argue you can do the same thing on x86: run cpuid once and compute the largest SIMD size out of SSE2/AVX/AVX512, to use as the size of a type which preserves anything in those registers, however big they happen to be.
Reminds me of some of the P-vs-E core debacles where they didn't realize all cores must present the same feature set and the scheduler can't soundly migrate processes or even run threads on a mix of different cores with different feature sets (without some kind of application opt-in, which I've yet to see materialize so far).

Anyway, it's a neat trick, and if C intrinsics used it first that removes a lot of non-language-design concerns (like what values are not allowed to change during program execution, or between its threads etc.), I'm just mildly skeptical it's also a good idea without e.g. separate types for "SSA-only vl+vtype-dependent vector value" vs "hardware register save/restore" (the latter arguably doesn't even need a type, just a way to read vlenb and then reading/writing the register file can just use raw pointers).

(also just saw rust-lang/rust#46571 (comment) which talks about even allowing size_of::<svfloat32_t>() - probably worth discussing some of this off of GitHub so I can get an idea of how far the various plans extend etc. - last time I left a comment, I forgot to contact anyone like I meant to, IIRC)

@RalfJung
Copy link
Member

@eddyb sorry I am completely confused by this comment. Could you take 5 steps back and explain what this means for Rust with some context? :)

@eddyb
Copy link
Member

eddyb commented Nov 11, 2024

@eddyb sorry I am completely confused by this comment. Could you take 5 steps back and explain what this means for Rust with some context? :)

Say you have a RVV implementation with a register file comparable to AVX512 (vlenb=64 bytes per register), or even several times wider (but 64 is already a good example).

The main benefit of the dynamic vector length is that you can do very simple vector loops (without scalar versions of those loops for elements that "don't fit", like SIMD architectures tend to need).
That means you can e.g. implement memcmp such that memcmp(a, b, 5) and memcmp(a, b, 64) both run the same instructions on the example AVX512-like RVV impl (and it's one iteration of a loop, with e.g. memcmp(a, b, 65) hitting two iterations etc.).

However, this is only a clear win (over a scalar loop) if e.g. memcmp(a, b, 5) sets vl+vtype such that only 5 bytes are read and operated on, and the remaining 59 bytes in each register (for the example AVX512-like RVV impl) are as inert as possible (there's also some power draw considerations, clock gating, etc. - but that's too into the weeds).

The fact (unknown to me before today) that the C types you need to use to refer to the hardware registers (i.e. connecting vector-producing instructions to vector-consuming instructions), can also be stored to memory, means that, with a Rust version of those types and intrinsics, you're one implicit borrow away (e.g. a &self method instead of a self one) from saving the full e.g. 64 bytes to the stack, regardless of how much is actually in use (in most cases I expect MIR+LLVM optimizations to hide the issue, but still).

And this gets worse the larger the implementation choice of vector width, even if appropriate usage (keeping everything in registers) doesn't see a penalty.

I don't expect such accidents in C functions like memcpy/memset/memcmp/etc., even when specialized for RVV hardware, but a more bespoke algorithm using the RVV intrinsics directly has to tiptoe around it and make sure no local variables end up in memory (and AFAICT this is worsened by having it spread across multiple functions).

Arguably this is still an issue even when most/all of the width of the hardware registers is in use, but in a different scenario, e.g. if you have some arithmetic-heavy computation that can remain in registers (for a lot of operations, relative to initial/final memory accesses), that's very different from causing additional memory traffic e.g. for each of those arithmetic operations (OTOH, that's closer to how we optimize local variables to use registers instead of the stack, so it might not be a realistic concern).


If you're familiar with ArrayVec<T, N>, perhaps a comparison can be made to it: the intended usage of hardware like RVV is comparable to declaring e.g. several ArrayVec<T, N> variables (i.e. [MaybeUninit<T>; N] with a safe API tracking the initialized length), but only ever initializing (and accessing) a number of elements that can be on average much smaller than N.

Then the "the C intrinsics types can also be stored to memory" choice is like implementing Copy such that you are now copying the full N elements, even if most are uninitialized (vs something like clone_from, that can avoid the full memcpy, which might be what register-to-register copying looks like in realistic RVV impls).


The context-switching aspect I mentioned can be useful on its own, in the sense that the kernel doesn't have to care about how you were using the register file, and can just save & restore it in bulk (at some cost, but context-switching is already not "free", and ideally not something a thread willingly triggers itself often).
But for that side of things you don't need a language-level type, or even the intrinsics, it's more that the C implementation of intrinsics is piggy-backing on the same functionality AIUI.


Anyway, I really don't have a strong preference here, and besides being surprised by the RVV C intrinsics deciding to include a footgun (which is the only reason I had to look up the RVV specs and reply at all), I had mostly hoped to end up with some kind of solution that could also cover functionality that can't fall back to some excessive worst-case memory size (such as wasm externref - I mentioned more in #3268 (comment)), but that doesn't seem on the table.


If that plan goes forward as described (and if I've understood everything correctly, as a bystander), "load/exec/run-time constant" (or w/e a good name for the concept is) means:

  • consteval can't reason about the value at all, it "hasn't been chosen yet"
  • tools like cargo miri test could randomize it alongside all the other randomized aspects, such that trying different seeds could stress-test the correctness of some algorithm across possible achitectural choices
    • if the value wasn't guaranteed in some way, each new use of the vector intrinsics could randomize the emulated hardware width (as long as it was locally consistent
    • on the other end of the spectrum, emulating "unlimited vectors" (or rather whatever the spec maximum is) - and esp. if it wasn't just for RISC-V - can allow an interpreter (like miri) to expose bulk operations more interesting than memcpy/memset, even using a mix of rayon and host SIMD to maximize throughput, but that kind of situation is rare outside of scientific compute or some graphics etc.

I really didn't plan to write more than a couple sentences about any of this, and I'm sure there's far more qualified people to discuss ARM SVE and RVV with, feel free to hide my comments if that makes more sense etc.

@RalfJung
Copy link
Member

That was quite helpful, thanks. :) (The RFC sadly lacks a lot of context, but I already left comments about that a while ago.)

@hkratz
Copy link

hkratz commented Nov 12, 2024

As an aside: Actually in RISC-V one can choose to combine up to 8 registers into register groups on which the individual instructions operate, so if vlenb=64 and you set vlmul=8 instead of 32 vector registers of 64 bytes each you are now working with four registers of 512 bytes each. This is fully exposed in the C intrinsics (and the vector types). Careless use can lead to quite a lot of expensive spilling.

@eddyb
Copy link
Member

eddyb commented Nov 12, 2024

As an aside: Actually in RISC-V one can choose to combine up to 8 registers into register groups on which the individual instructions operate,

I was wondering about LMUL (as one more factor in making the "shape" of the active vector register file dynamic), but didn't mention it since vlmul being "just" a field of vtype means it's also going to be ignored by the instructions meant to save/restore the whole register file (and when used for "individual registers" I expect them to behave like LMUL=1 or w/e).

so if vlenb=64 and you set vlmul=8 instead of 32 vector registers of 64 bytes each you are now working with four registers of 512 bytes each. This is fully exposed in the C intrinsics (and the vector types). Careless use can lead to quite a lot of expensive spilling.

Ohh, I see, so in your example, the relevant types behave (memory-wise) like [_; 8] of the (already themselves large IMO) vlenb-sized base ones, amplifying the issue further.

That's arguably another layer of surprise, since my expectation was that LMUL applied to operations, not types, but realistically, at the programming language level, vtype has to be driven by static types, and only vl gets the "pseudo-dependent-typing" behavior.


I used to think misuse of such "your non-dependent typesystem cannot express the runtime relationships" intrinsics would result in (unfortunate) late compilation errors (comparable to some aspects of inline asm!), but I can see how a memory-heavy fallback would be attractive to anyone trying to avoid introducing high-level language features (which might not be enough anyway).

(While you can use something like indexing's "generative lifetime" trick to ensure values created with different vl values can't be used together, you would still have no real way to statically enforce exclusive usage of the vector unit, unless you ban function calls or do whole-program analysis, neither of which fits within a typesystem, maybe effect system at most - oh and I forgot the even simpler "running out of vector registers", you might as well force the intrinsics/types to use register numbers at that point)

@RalfJung
Copy link
Member

But the summary is, "a type whose size is a runtime constant determined when the program starts" (and references to this type can be thin) is sufficient and reasonable for both the ARM and RISC-V variant of scalable vectors?

@oli-obk
Copy link
Contributor

oli-obk commented Nov 12, 2024

But the summary is, "a type whose size is a runtime constant determined when the program starts" (and references to this type can be thin) is sufficient and reasonable for both the ARM and RISC-V variant of scalable vectors?

that's how I understood it. So imo an MVP could be to not allow them on the stack at all, but require them to be heap allocated for now. This likely defeats their purpose as every operation first needs to move it into registers from the heap, but it's an incremental way forward that requires little language features.

#![feature(extern_types)]
extern {
    type ScalableVector;
}

impl ScalableVector {
    pub fn len() -> usize {
        /* sufficiently advanced assembly */
    }
    pub fn new(data: Vec<f32>) -> Box<dyn Self> {
        unsafe { Box::from_raw(Box::into_raw(data.into_boxed_slice()) as *mut f32 as *mut dyn Self) }
    }
}

Ideally we'd have a central place that defines ScalableVector (potentially with a generic param for the element type? that needs a lang feature tho, but I don't see why extern types can't be generic).

@programmerjake
Copy link
Member

programmerjake commented Nov 12, 2024

But the summary is, "a type whose size is a runtime constant determined when the program starts" (and references to this type can be thin) is sufficient and reasonable for both the ARM and RISC-V variant of scalable vectors?

well, as long as no one makes a processor where some cores have different vlen than others...like what happened with icache line size, causing lots of pain for Mono

@eddyb
Copy link
Member

eddyb commented Nov 12, 2024

But the summary is, "a type whose size is a runtime constant determined when the program starts" (and references to this type can be thin) is sufficient and reasonable for both the ARM and RISC-V variant of scalable vectors?

Sufficient? Yes. Whether "necessary" or "reasonable" is debatable (arguably not necessary if most usage can avoid memory altogether, i.e. only appear in expression output and local variable types). If nothing else works, sure, it's an "acceptable compromise" at best.

So imo an MVP could be to not allow them on the stack at all, but require them to be heap allocated for now.

This feels very strange from the "these are not real data types, just abstract tokens to connect operations and allow the backend to do register allocation etc." point of view I started from, but I could see how one could arrive there starting at "these are weird [u8], [u8; dyn N] for an N that doesn't change during execution" instead.

In practice, heap allocation doesn't really make sense to bring up, and making size_of/size_of_val work at all for these types itself is secondary to them being placeable in memory at all, which is an extra choice on top of them being used to abstractly represent registers.


well, as long as no one makes a processor where some cores have different vlen than others...like what happened with icache line size

I've brought this up already, TL;DR is that C intrinsics already requiring vlenb invariance during execution already rules out that problem in practice (unless the execution environment looks like "Rust without C", in which case they can just as well disable the Rust RVV intrinsics).

@oli-obk
Copy link
Contributor

oli-obk commented Nov 12, 2024

This feels very strange from the "these are not real data types, just abstract tokens to connect operations and allow the backend to do register allocation etc." point of view I started from

well, we can either

  • have a ScalableVectorBox<f32> that the codegen backend will read into registers and deallocate whereever it can, and force that to happen at function boundaries, so the ABI is the magic register token ABI (please not, but it is an option)
  • then introduce a purely token type that you described, and have methods to convert between the two.

This way we can start out either with the always-heap version or the always-externref version and then figure out the other side. The backend can choose to spill registers to stack if it wants, but that's a backend thing we don't have to care about.

Having an explicit separation allows us to avoid the weird semantics of "externref up to a point then randomly you can take addresses of it".

@RalfJung
Copy link
Member

RalfJung commented Nov 12, 2024 via email

@eddyb
Copy link
Member

eddyb commented Nov 12, 2024

I'm sorry for stirring this back up just days too early, I suggest interested parties wait for @JamieCunliffe and/or @davidtwco to announce the next version of this RFC, or at least something more concrete.

@petermorrys

This comment was marked as off-topic.

@davidtwco
Copy link
Member

We’ve opened #3729 that would provide a way for scalable vectors to work without special cases in the type system.

@michaelmaitland
Copy link

We’ve opened #3729 that would provide a way for scalable vectors to work without special cases in the type system.

Strong +1 to this approach as it pertains to implementation of scalable vectors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-simd SIMD related proposals & ideas T-lang Relevant to the language team, which will review and decide on the RFC. T-types Relevant to the types team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.