Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] [Pre-RFC] ACLE in Rust #184

Closed
wants to merge 1 commit into from
Closed

[DO NOT MERGE] [Pre-RFC] ACLE in Rust #184

wants to merge 1 commit into from

Conversation

japaric
Copy link
Member

@japaric japaric commented Aug 20, 2018

This is a pre-RFC of an RFC meant for rust-lang/rfcs. The reasons why I'm
putting it up as a PR in this repository are that (a) PRs are nice for reviewing
these kind of documents and that (b) all ARM developers should be interested
in this and good chunk of the people subscribed to this repo are ARM developers.

The rationale for this RFC is in #63 but the TL;DR is that we want to be able to
use instructions like ISB, WFI and MRS BASEPRI on stable w/o depending on an
external assembler like arm-none-eabi-gcc.

The RFC is still incomplete but I'd like to get feedback on the proposed API for
memory barriers, hints and system register access.

Some issues I see with ACLE as a Cortex-M person:

  • No intrinsic for the BKPT instruction. There's an intrinsic for the DBG
    instruction but according to my tests that's treated as a NOP by the debugger,
    i.e. it can't be used as a hardware breakpoint.

  • No intrinsics for the CPSID and CPSIE instructions. These are used to
    implement the {disable,enable}_interrupt API in the cortex-m crate.
    However, the are intrinsics to write to PRIMASK. I think it should be possible to
    implement {disable,enable}_interrupt by writing to PRIMASK, but I think that
    implementation would result in more instructions being emitted.

Rendered version

cc @rust-embedded/cortex-m @parched @gnzlbg @hannobraun @paoloteti @andre-richter @wizofe

@japaric japaric requested review from dylanmckay, jcsoo and a team as code owners August 20, 2018 22:34
@adamgreig
Copy link
Member

Since CMSIS is also a vendor-published standard containing intrinsics, can we bring in BKPT from CMSIS? It also contains __enable_irq and __disable_irq (cpsie i, cpsid i) and __enable_fault_irq, __disable_fault_irq (cpsie f, cpsid f).

I can completely see the reasoning for implementing ACLE in core::arch::arm since it it's both lower level and more generic across more ARM families, but for Cortex-M targets we still really want intrinsics for BKPT and CPSID/CPSIE which CMSIS provides. Many (all?) of the rest of CMSIS could be implemented on top of ACLE so can live in cortex-m, but the ones that can't perhaps should go in with this RFC.

@gnzlbg
Copy link

gnzlbg commented Aug 21, 2018

Many (all?) of the rest of CMSIS could be implemented on top of ACLE so can live in cortex-m, but the ones that can't perhaps should go in with this RFC.

Does that mean that BKPT and the other intrinsics cannot be implemented on top of ACLE ?

@adamgreig
Copy link
Member

Does that mean that BKPT and the other intrinsics cannot be implemented on top of ACLE ?

Yes, see ACLE 12.1.5:

This release of ACLE does not define how to invoke a SVC (supervisor call), BKPT (breakpoint) and other related functionality.

@paoloteti
Copy link

paoloteti commented Aug 21, 2018

There are 3 worlds:

  1. pre-ACLE 1.0 (armcc v4 and v5)
  2. ACLE 1.0 and 1.1 (armcc v6)
  3. ACLE 2.0 (>= armcc v6.0)

armvcc v4/v5 support __breakpoint(), __disable_irq() and __enable_irq()

armcc v6 use a compatibility include file (arm_compat.h) to support few intrinsic not part of ACLE 1.0

So to enable __breakpoint() on armcc v6 you need to include arm_acle.h and arm_compat.h

We can have the same approach on Rust. That is more similar to a ACLE + some legacy intrinsic,
than "implement some CMSIS"

Also pay attention that:

__disable_irq() sets the I bit in the CPSR and returns the previous I bit value.
__disable_interrupt() set the PRIMASK for Cortex-M parts and sets the I and F bit in the CPSR for ARM parts.

@paoloteti
Copy link

Here some intrinsics that are supported in ARM Compiler 6 using <arm_compat.h>

@adamgreig
Copy link
Member

That is more similar to a ACLE + some legacy intrinsic, than "implement some CMSIS"

No objection to this way of describing it, though I think the end result is the same, but I note we require vendor-published specifications for intrinsics in core::arch, which CMSIS is, and which legacy intrinsics built in to ARMCC but not actually part of a specification may not be. If it is suitable to just implement the old ARMCC intrinsics that's fine, we can then do all of CMSIS on top of those in cortex-m.

__disable_irq() sets the I bit in the CPSR and returns the previous I bit value.
__disable_interrupt() set the PRIMASK for Cortex-M parts and sets the I and F bit in the CPSR for ARM parts.

Would you implement both then? In CMSIS for Cortex-M specifically there's no __disable_interrupt, only the irq and fault_irq variants, which are specified to use cpsid i and cpsid f respectively. We can already write to PRIMASK with the intrinsics from ACLE specified in the RFC, which could be used to disable interrupts, but that won't emit cpsid i.

@adamgreig
Copy link
Member

We'd be putting the intrinsics from ACLE into core::arch::arm so anyone could use them directly from core on stable; for many users they will do this directly e.g. for wfi(), nop(), etc. I don't think we'll be implementing the rest of ACLE anyway, though you could imagine writing an ACLE crate to provide the attributes via proc macros or something. From our point of view ACLE is just the specification to justify which intrinsics we put into core::arch::arm and what they do.

cortex-m can then use these (unsafe) intrinsics in its cortex_m::asm module to provide (safe) wrappers (in reality just marking many of these functions as safe I expect).

Right now cortex-m already implements several CMSIS functions, so I don't know if we'd want to move them into a new cmsis crate or just keep them there. I think that's out of scope for this RFC though.

@thejpster
Copy link
Contributor

I'm OK with core::arch::arm effectively becoming our ARLE. ACLE is a good starting point, but if it doesn't have everything we need, let's add more.

Copy link
Member

@hannobraun hannobraun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double-checked the memory barrier stuff against the ARM documentation and found some things that didn't quite add up. See comments for details. I only glanced over the rest of the document (ran out of time).

Please note that I know very little about ACLE, so I might be missing some context here and could be totally wrong.


[ACLE]: #acle

[ARM C Language Extensions Q2 2018](https://silver.arm.com/download/ARM_and_AMBA_Architecture/AR580-DA-70000-r0p0-06rel0/DDI0403E_c_armv7m_arm.pdf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This links to the ARMv7-M Architecture Reference Manual, not ACLE.

I believe this is the correct link:
https://developer.arm.com/products/software-development-tools/compilers/arm-compiler-5/docs/101028/latest/1-preface

// SY, LD, ST, ISH, ISHLD, ISHST, NSH, NSHST, OSH, OSHLD, OSHST
//
// Only SY implements the `Isb` trait
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this comment applies to the complete code example above)

You marked some of those parameters as being only present on aarch64, but according to this code, all others are available everywhere. This doesn't look quite right to me:

  1. I could find nothing in the ACLE docs on why the parameters you marked as #[cfg(target_arch = "aarch64")] should be marked such.
  2. For all three instructions, only the SY parameter is supported on Cortex-M. See ARMv6-M Architecture Reference Manual pages 121, 122, 124; ARMv7-M Architecture Reference Manual pages 236, 237, 241; ARMv8-M Architecture Reference Manual pages 422, 423, 431.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could find nothing in the ACLE docs on why the parameters you marked as #[cfg(target_arch = "aarch64")] should be marked such.

indeed. This piece of information is in the armasm Reference Guide. Though my main motivation for excluding those is that LLVM will reject asm!("DMB LD") when compiling for ARMv7 and older so this has to be done otherwise we can't implement the ACLE API.

For all three instructions, only the SY parameter is supported on Cortex-M.

Yes, the other parameters are reserved. We could conditionally exclude them from the API but AIUI ACLE doesn't exclude them. Also, LLVM is happy to accept e.g. DMB ST; it will simply be executed as DMB SY by the target system -- though the document you linked says the software shouldn't rely on this behavior.

#[cfg(not(/* like above */)]
unsafe fn dmb(&self) {
// No-op but still a compiler barrier because of "memory"
asm!("" : : : "memory" : "volatile");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above quote from ACLE states:

They may be no-ops (i.e. generate no code, but possibly act as a code motion barrier in compilers) on targets where the relevant instructions do not exist, but only if the property they guarantee would have held anyway.

Do we know that the property the instructions guarantee will hold with this implementation? I'm not saying they don't, just asking whether anyone has thought through this. I'm not familiar with the architectures in question.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am rather new to ACLE and CMSIS so please ignore if I am just about to show my ignorance ;)

I really wonder if the target gating here is a boon or a bane.

If I am a programmer knowledgeable enough that I want to insert something as low level as a DMB, I will probably anyways work with the architecture reference manual of my target CPU, and can just double check what my options are for DMB and then write something that explicitly emits what I want.

With a target-gated __dmb() in std, I additionally have to cross-check if Rust's implementation of it does for my architecture what I expect it to do (in contrast to doing the target gating in my own code).

I know that portability is why ACLE and CMSIS exist. Are they used often in the wild?
Because I wonder if we can be sure that there aren't cases where ACLE assumptions result in unwanted behavior for this or that target. Just thinking out loud... 😕

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a target-gated __dmb() in std, I additionally have to cross-check if Rust's implementation of it does for my architecture what I expect it to do (in contrast to doing the target gating in my own code).

The answer is in the ACLE spec:

The intrinsics in this section are available for all targets. They may be no-ops (i.e. generate no code, but possibly act as a code motion barrier in compilers) on targets where the relevant instructions do not exist, but only if the property they guarantee would have held anyway.

@japaric japaric mentioned this pull request Aug 21, 2018
1 task
@japaric japaric added the RFC label Aug 21, 2018
@japaric japaric modified the milestones: 2018 edition, RC1 Aug 21, 2018
> portable across compilers, and across Arm architecture variants, while
> exploiting the advanced features of the Arm architecture.

## Memory barriers
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole chapter screams "memory model". cc @ralfj @ubsan @rkruppe

/// specified scope) before memory accesses issued after the DMB.
///
/// For example, DMB should be used between storing data, and updating a flag
/// variable that makes that data available to another core.
Copy link

@gnzlbg gnzlbg Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't atomics enough for this operation? Or is this just implemented as an atomic operation ? (if so, it should state which one exactly).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that atomic::fence produces a DMB instruction (and that all non-relaxed atomic ops involve atomic fences) so I checked what that function produces for different arguments and different compilation targets:

For aarch64-unknown-linux-gnu, atomic::fence(Ordering::Acquire) produces DMB ISHLD and the other 3 (excluding Ordering::Relaxed) produce DMB ISH

For thumbv7m-none-eabi and similar, atomic::fence(*) (any ordering but relaxed) produces DMB SY.

Why aren't atomics enough for this operation?

The API in std::sync::atomic doesn't seem to cover all the possible ways one can use the DMB instruction. The aarch64-linux target in particular seems to be geared towards homogeneous multicore systems therefore it uses the ISH variants; my understanding is that heterogeneous multicore systems (Cortex-A + GPU, or Cortex-A + Cortex-M) and embedded devices (a core and peripherals connected to the same data bus) need the stronger SY (full system) variants for proper synchronization.

Or is this just implemented as an atomic operation ?

These will be implemented using inline assembly, not atomic::fence or any other LLVM (fence) intrinsic.

if so, it should state which one exactly

At the beginning of this comment I wrote what atomic::fence maps to today, but I don't know if it's guaranteed those mappings will hold in the future, or if changing the codegen options changes the mappings. My understanding is that replacing DMB ISH with DMB SY is valid because the later has stronger guarantees so that would be a valid change for LLVM to make in the future.

So I wouldn't write that "DMB ISH maps to atomic::fence(Ordering::SeqCst)" in the docs because I'm not sure it will always hold.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aarch64-linux target in particular seems to be geared towards homogeneous multicore systems therefore it uses the ISH variants; my understanding is that heterogeneous multicore systems (Cortex-A + GPU, or Cortex-A + Cortex-M) and embedded devices (a core and peripherals connected to the same data bus) need the stronger SY (full system) variants for proper synchronization.

I believe big.LITTLE systems are reasonably common on high-end mobile SoCs that run Linux, so I would be surprised if the usual atomics weren't enough for those sorts of systems. However, it sounds quite plausible that you'd need a stronger barrier for things like DMA transfers or memory shared with a GPU.

At the beginning of this comment I wrote what atomic::fence maps to today, but I don't know if it's guaranteed those mappings will hold in the future, or if changing the codegen options changes the mappings.

There are multiple "minimal" compilation strategies for atomics that are sound (see https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html), so the mapping could theoretically change even without "useless" strengthenings like DMB ISH -> DMB SY. One practical constraints, though, is that all code working on the same model needs to be compiled with the same mapping (again see the link).

I'd think the more important knowledge for a programmer is the opposite direction: which uses of these intrinsics subsume which atomics? For example, in which cases can atomic::fence(...); __dmb(...); be collapsed to just __dmb(..);? I don't have a good answer for that either, but at least in principle it can be answered by correlating the formal definitions of the relevant ISA memory model (note: there are multiple substantially different ones even for ARMv8) and language memory model.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@japaric

The API in std::sync::atomic doesn't seem to cover all the possible ways one can use the DMB instruction.

YES! And probably it can't. For example DMB OSH can be used only if "DMA is Bufferable" (CONFIG_ARM_DMA_MEM_BUFFERABLE on Linux) OR in SMP mode, DMB SY otherwise. But only an OS or a bare-metal applications can have such infos.

So like in the implementation of memory barriers these intrinsics will do
nothing on *some* sub-architectures.

## System register access
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are systems registers? Particularly, which operations modify them?

Is LLVM allowed to assume when calling an extern function that the contents of system registers won't change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System registers can be one of (this list may not be exhaustive): status register, program counter, link register, the stack pointer and special registers present in the Cortex-M and Cortex-R variants.

The status register can change whenever the processor executes a instruction as it contain flags that indicate if the last operation result in a zero, or if it overflowed, etc. The program counter changes every time the processor executes a instruction. The link register will change every time there's a function call via the BL (branch link) or BLX (branch link exchange) instruction. The compiler will generate instructions that change the stack pointer to allocate space for stack variables.

I know of no practical use case for doing RMW (read-modify-write) operations on those registers exactly because compiler optimizations make the contents of these registers unpredictable, if that's what you are concerned of.

There are use cases for reading things like the stack pointer and the linker register and then restoring them at a later time to implement context switching but those routines have to be written in assembly to work properly (you can't piece together intrinsics and get the behavior you want).

There is a use case for reading registers like the stack pointer for debugging / diagnostics purposes.

The Cortex-M / Cortex-R special register include registers like BASEPRI, PRIMASK, FAULTMASK. These are used to create critical sections and you will be doing reads and writes to these. The writes need to behave as compiler barriers ("memory" clobber in asm!) to get the correct behavior.

Is LLVM allowed to assume when calling an extern function that the contents of system registers won't change?

If by "extern" you mean FFI then the answer is no. As I mention in the beginning even single instructions will change some of these registers.

Copy link

@gnzlbg gnzlbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments, where I wonder basically how some of these intrinsics (barriers, writes to system registers) interact with Rust's / LLVM memory model.

With respect to memory barriers, their specification just reads too much like "memory model"-speak in the context of atomics. If they just perform atomic operations, they should just state exactly which atomic operation they perform. If they differ from the existing atomic operations, they should state exactly how do they differ, and what they allow that the current atomic operations do not. I am not opposed to adding aliases for already existing atomic operations, but if that's the case, I'd prefer these to just link to those operations appropriately so that we don't have to maintain memory model minutiae in this part of the std library. If these are new atomic operations that are currently not available in Rust, I'd suggest involving the memory model working group in this issue and their precise specification.

With respect to writes to system registers, I wonder if they can actually work correctly. On x86, for some registers, they do not - no idea about ARM but the RFC should convince me that this is the case. That is, that if I read a value from a system register, modify it, and write it back, that LLVM hasn't written something else to the system register while I was modifying it, and that my modification won't introduce undefined behavior. Sadly, I don't know what these systems registers are, or which state they store, and explaining all of it might not be relevant for the whole RFC, but maybe somebody can answer in a comment, and a small summary can be included in the final RFC.

@gnzlbg
Copy link

gnzlbg commented Aug 22, 2018

Since CMSIS is also a vendor-published standard containing intrinsics, can we bring in BKPT from CMSIS? It also contains __enable_irq and __disable_irq (cpsie i, cpsid i) and __enable_fault_irq, __disable_fault_irq (cpsie f, cpsid f).

IIUC in the last discussion, CMSIS isn't an ARM C API, but a hardware abstraction layer that builds on top of ACLE. The consensus was that it belongs in a separate crate, and some people were against exposing multiple ways of doing the same thing in std::arch (CMSIS re-exports some of the ACLE functionality under different names).

If there are some CMSIS APIs that cannot be implemented on top of ACLE, it would be nice to figure out which ones they are, and why they cannot be implemented. This will be easier once ACLE is implemented, and a cmsis crate is implemented on top. Based on the result, we should reconsider the decision of also exposing CMSIS from std::arch. I personally have no issues with exporting duplicated functionality: if this is what the vendors do, then that's just how it is.

It might be, however, worth it to leave CMSIS out for now, and focus on implementing ACLE in std::arch first. We can always add the CMSIS intrinsics later, and stabilize them independently, and with the proof that the cmsis crate cannot be implemented on top of ACLE, then doing so should be pretty uncontroversial.

@therealprof
Copy link
Contributor

@gnzlbg

IIUC in the last discussion, CMSIS isn't an ARM C API, but a hardware abstraction layer that builds on top of ACLE. The consensus was that it belongs in a separate crate, and some people were against exposing multiple ways of doing the same thing in std::arch (CMSIS re-exports some of the ACLE functionality under different names).

CMSIS is wild potpourri of a lot of different things, CMSIS-CORE is indeed a Cortex-M specific C API. ACLE is a set of compiler intrinsics and macros to allow easier access to various processing units (e.g. DSP, SIMD, Secure Elements...).

The only things CMSIS-CORE (and a few more building blocks like CMSIS-RTOS and CMSIS-DRIVER) and ACLE have in common is that they're really only relevant to C/C++ programming languages and in my opinion do not make a whole lot of sense in the Rust world at all.

@gnzlbg
Copy link

gnzlbg commented Aug 22, 2018

and in my opinion do not make a whole lot of sense in the Rust world at all.

What do you think would make more sense?

@therealprof
Copy link
Contributor

@gnzlbg Well, we need the assembler intrinsics so it's certainly a good idea to check CMSIS-CORE and ACLE for the supplied ones and pick the important ones.

Anything higher level and even the weird naming is irrelevant for Rust. After all CMSIS-CORE and ACLE are ARMs attempt to provide a framework for lowish-level developments and different compiler vendors; we already have those frameworks with cortex-m, cortex-rt and cortex-rtfm and there's only one Rust compiler. Also let's not forget that CMSIS is ARM only and we certainly want to cater for other architectures as well.

@paoloteti
Copy link

The only things CMSIS-CORE (and a few more building blocks like CMSIS-RTOS and CMSIS-DRIVER) and ACLE have in common is that they're really only relevant to C/C++ programming languages

The key point is that the goal here is to avoid asm on upper-level crates, move to stable etc.
In this sense I'm not at all against extra stuff on (rust)-ACLE and I agree that naming conv. is not relevant at all, but I still have some concern.

But usually when a spec. leave something out (for example __breakpoint()) that was part of armcc v5 (pre-ACLE) and now 'legacy' or armcc v6 (ACLE 1.0) I always suspect that there is a reason and in general the answer is: "lack of hw support in the future", "future silicon design direction" or just "discourage use of something" or "this is not the rigth place". (I think there is a reason for __breakpoint())

Also ACLE deprecate some 'dsp' intrinsics on Cortex-A in favour of similar 'neon' intrinsics tha CMSIS still support. This is a clear "future silicon design direction" case. Also there is nothing for Cortex-R on CMSIS and brand-new Cortex-R.

In a sense ACLE dictate "future direction" instead CMSIS is just a wrapper around different compilers and do not forget that CMSIS was born before ACLE.

@paoloteti
Copy link

paoloteti commented Aug 23, 2018

I try to explain why __breakpoint()/BKPT is not in ACLE.

Mainly a debugger insert a BKPT (aka sw breakpoint) in this way:

  1. Read memory location and save actual instruction.
  2. Write BKPT instruction to the memory location.
  3. Restore PC etc etc

Of course cache coherency issues might arise when writing a BKPT instruction, so some debugger
flush the cache or use CP14 to control under the hood the cache behaviour in debug mode or read-back the BKPT.

In general on Cortex-R/A the debug software must carefully program certain debug events to prevent the
processor from entering an unrecoverable state.

Also BKPT have a different behaviour on Prefect Abort events if processor is or not in debug mode, so debugger control under the hood the DBGDSCR register in this case.

On Cortex-R/A a Prefetch Abort or Data Abort handler must check the value of the CP15 Fault Status Register too.
On Cortex-R in lockstep mode BKPT is a mess and unpredictable without a professional debugger doing extra work for you.

So BKPT/__breakpoint() makes sense only on small Cortex-M CPUs with an SVD interface, otherwise the answer is "let your professional debugger do the job for you (for Cortex-M too!)".

@japaric
Copy link
Member Author

japaric commented Aug 31, 2018

I have put up an implementation of the proposed API in the arle repository.

There's one (*) deviation from the RFC / ACLE:

If a non-SIMD instruction is not available for the target architecture then
the API won't be available for that architecture. ACLE says that the API should
be available but that the operation should be NOP at runtime at least in the
cases of hints. This felt unrusty to me so I went with the "if it's not going to
work then it shouldn't compile" approach.

To check that you can implement the CMSIS-CORE API with this I made a cmsis
crate that build most of the CMSIS-CORE API (mainly the part that we actually
use in practice) using only arle. The only function that I couldn't implement
was __BKPT -- why that cannot be implemented has already been mentioned in
this thread. Also the 4 function listed below differ in the instruction that the
compiler emits when comparing to a proper CMSIS implementation to my cmsis
crate but I think the implementations are still semantically equivalent.

  • disable_fault_irq
  • disable_irq
  • enable_fault_irq
  • enable_irq

I have left out all the DSP / SIMD32 intrinsics as they are not priority right
now. I have at least indicated on which architecture these intrinsics are
available and the list of intrinsics mentioned in ACLE.

Finally, to actually implement this I had to whitelist a few more target
features because I need more fine grained control over conditional compilation.
I checked that the implementation doesn't make LLVM error when compiling the
crate for the older architectures that rustc supports (v4T, v5TE). This is the
list of target features that need to be whitelisted:

  • aclass, i.e. ARMv7-A
  • v5te, ARMv5TE instructions
  • v6k, ARMv6K instructions
  • v6t2, ARMv6T2 instructions

The source code of arle contains comments that explain how these target
features are used to conditionally compile code. The crate level
documentation
has a summary about the 8 different ARM instruction sets
that LLVM supports (most of them are compatible as in one is a subset of the
other) and covers which instruction sets are available to built-in targets like
thumbv7m-non-eabi.

Also API docs:

(*) OK, I lied. There's actually one more minor differences:

  • ACLE uses functions like __arm_rsr for register access. Since users writing
    idiomatic Rust will write arm::__arm_rsr() the arm_ part seemed redundant
    to me so I dropped it from these functions.

@gnzlbg does this seem like a reasonable implementation / API to you?

@sekineh
Copy link

sekineh commented Sep 2, 2018

Don't we have a plan to support 'ANGEL' interface which can be used in QEMU -semihosting environment?

I'm trying to write my own using asm but if crate support is available it would be useful.
FYI, my purpose is just to quit QEMU cleanly in a CI test.

jamesmunns/irr-embedded-2018#4 (comment)

Angel Interface

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0471c/Bgbjjgij.html

Update

Currently the crate cortex-m-semihosting provides support for the above angel interface.
/~https://github.com/rust-embedded/cortex-m-semihosting/blob/master/src/lib.rs

@gnzlbg
Copy link

gnzlbg commented Sep 3, 2018

If a non-SIMD instruction is not available for the target architecture then
the API won't be available for that architecture. ACLE says that the API should
be available but that the operation should be NOP at runtime at least in the
cases of hints. This felt unrusty to me so I went with the "if it's not going to
work then it shouldn't compile" approach.

I think this is fine, and we can always change that in a backwards compatible way to provide these as NOP if this for whatever reason turns out to be problematic.

ACLE uses functions like __arm_rsr for register access. Since users writing
idiomatic Rust will write arm::_arm_rsr() the arm part seemed redundant
to me so I dropped it from these functions

In my experience what people do is write use core::arch::arm::* directly since often you need many intrinsics and not just on. This change can be done consistently, and seems perfectly justifiable to me though, so I don't see any issues with it.


So I finally managed to review the whole crate. I think you can pretty much send it as a PR (the arle part) to stdsimd as is. I wrote some notes about the things that would have to be done, but are pretty much just nitpicks that we can resolve on the PR or during the RFC. I think this is at an state where we can just merge this, and then incrementally improve on what we have.

@parched
Copy link

parched commented Sep 3, 2018

Hi @japaric thanks for writing this RFC, I am happy to see progress being made on this front 😄. My apologies for not replying sooner, I wanted to talk to my colleagues at Arm who write the ACLE first to see if they had anything to add or any recommendations.

First some general notes

  • Arm generally prefers cross architecture portable solutions over adding intrinsics. For example, C11 atomics and in rust std::sync::atomic::spin_loop_hint is the portable YIELD.
  • ACLE were written for C and have some legacy, so should only be used as a guideline. Anything that can be improved or done better in rust should be.

Some feedback specific to the RFC

  • I think the __/__arm_ prefixes can be dropped, that's just a C-ism, they are all going in the std::arch::arm module anyway so there's no chance of conflict.
  • The barrier and hint intrinsics don't need to be marked unsafe, only the system register access ones need to be unsafe.
  • To increase the portability, don't make the barrier argument availability conditional on target, just upgrade to next strongest when not available.
  • How about using an enum for the barrier parameter, rather than the specialization of some generic parameter? This will require a match for now as Rust can't yet enforce compile time arguments, but that match should be removed by the compiler when it's inlined. And hopefully be backwards compatible with const arguments in the future. What do you think @gnzlbg?
  • yield isn't needed becuase it's already available as std::sync::atomic::spin_loop_hint.
  • Could wfi and nop be portable functions instead like above? Most architectures have equivalent instructions, and those that don't could just do nothing/compiler barrier.
  • Unfortunately rsr/wsr must take a &str parameter to allow for arbitrary "S_x_x_x_x" register names in future. For now, that means a match on the value to make a constant to pass to LLVM, but that should be fine given the limited number of named registers required for this initial M profile support. In future when compile time arguments are supported the match can be removed and all strings allowed.
  • Implementation of these intrinsics should probably use llvm.{arm,aarch64}.{hint,dmb,dsb.isb} and llvm.{read,write}_register rather than inline assembly as this gives LLVM (potentially) better understanding of what is going on.
  • I would just put all the intrinsics in std::arch::arm (gated on target in the
    future if needed) rather than duplicating them in std::arch::aarch64.
  • You probably don't need to bother with the 32-bit SIMD intrinsics for now, they can come later with the Advanced SIMD intrinsics if required, to keep the scope of the RFC small. Or does someone require them now, @paoloteti maybe?
  • I'm curious, what's your use case for the BKPT intrinsic?

@paoloteti
Copy link

@parched

Or does someone require them now, @paoloteti maybe?

dsp/simd32 are already implemented (I sent 4-5 PRs in the past). Following this RFC We need to reorganize the code a bit, so I agree to keep them in this RFC.

@gnzlbg
Copy link

gnzlbg commented Sep 4, 2018

@parched

Anything that can be improved or done better in rust should be.

I pretty much agree with everything you said, and I think there should be a crate built on stable Rust on top of std::arch that provides a safe ergonomic interface around the intrinsics.

The situation is pretty much the following: If we add ACLE intrinsics to core::arch as is without "innovating" we can pretty much stabilize those instantly with a tiny RFC, and build everything else in stable Rust 6 weeks later. If we want to design a new safe ergonomic interface we have to battle test the design on nightly, that is, wait till it gets some production use and we are sure its right (it took ~1 year to realize that the x86 intrinsics old design wasn't going to cut it). It will also need an RFC explaining and motivating the new design, comparing it to ACLE, etc. Depending on the number of intrinsics and APIs in the RFC, reviewing might take a while.

So at this point I still think that the best is to just implement ACLE as close to the C spec as possible, even if it is horrible to use, to make the RFC process go as smoothly as possible, and just build a better API on top in a separate crate on crates.io that just doesn't have to go through the process and that we can evolve over time (hopefully ARM can contribute its experience to that crate).

This might sound like we are choosing an imperfect solution for std::arch because of "politics", but as long as we can still build a "perfect" crate on top of it I'd like to think of this as optimizing to arrive at the end result much faster (instead of potentially delaying this for years) while minimizing risks.


@japaric I am going to post some of @parched comments in the PR in stdsimd so that we can discuss how to "fix" them (or whether they can be fixed there).

@japaric japaric removed this from the RC1 milestone Sep 4, 2018
@japaric
Copy link
Member Author

japaric commented Sep 4, 2018

(As per this comment #63 (comment) I'm going to remove this from the edition milestone).

@adamgreig adamgreig mentioned this pull request Nov 14, 2018
@ketsuban
Copy link

rust-console could use a way to emit a swi instruction without inline assembly; the GBA/NDS firmware uses it to provide some utility functions such as decompression and waiting for interrupts.

@eddyp
Copy link

eddyp commented Jan 9, 2019

This release of ACLE does not define how to invoke a SVC (supervisor call), BKPT (breakpoint) and other related functionality.

Please note that SVC, depending on the service call handler/application, would need to return a u32 type or something a little bit more heavy. Ideally, it should probably be better if SVC can be declared with a customizable return type.

@eddyp
Copy link

eddyp commented Jan 10, 2019

SVC, depending on the service call handler/application, would need to return a u32 type or something a little bit more heavy

The ARM AAPCS (ARM IHI 0042F) says the returning convention for a function is to use registers the following way:

The manner in which a result is returned from a function is determined by the type of that result.
For the base standard:

  • A Half-precision Floating Point Type is returned in the least significant 16 bits of r0.
  • A Fundamental Data Type that is smaller than 4 bytes is zero- or sign-extended to a word and returned in r0.
  • A word-sized Fundamental Data Type (e.g., int, float) is returned in r0.
  • A double-word sized Fundamental Data Type (e.g., long long, double and 64-bit containerized vectors) is returned in r0 and r1.
  • A 128-bit containerized vector is returned in r0-r3.
  • A Composite Type not larger than 4 bytes is returned in r0. The format is as if the result had been stored in memory at a word-aligned address and then loaded into r0 with an LDR instruction. Any bits in r0 that lie outside the bounds of the result have unspecified values.
  • A Composite Type larger than 4 bytes, or whose size cannot be determined statically by both caller and callee, is stored in memory at an address passed as an extra argument when the function was called (§5.5, rule A.4). The memory to be used for the result may be modified at any point during the function call.

So, to me it looks like the SVC intrinsic should be able to specify a return type, which, on the back end, for reasons of interoperability with C code, should follow these conventions.

Note that even if, in theory and for optimization reasons, one could extend this to other registers, that would not work since only r0-r3, r12, lr, pc and xPSR are saved on the exception stack frame, except, maybe abusing r12, but that's probably too much trouble for little gain.

@Disasm
Copy link
Member

Disasm commented Feb 26, 2019

@japaric Does this Pre-RFC still need some work to do?

@gnzlbg
Copy link

gnzlbg commented Feb 26, 2019

This might need to be updated with what is implemented on nightly. Also for an RFC, this goes into a lot of detail about each of the APIs. If one checks the std::arch RFC, when the x86 intrinsics were stabilized, the RFC just specified from which headers they come from, how were the C intrinsics mapped to Rust (which programmatic rules were used), how is their API validated, how are they tested etc. IIRC some critical intrinsics that might interact weirdly with the memory model were then covered (e.g. non-temporal load/stores), but 99% of the intrinsics were not critical. So something like this might be enough for ACLE as well. Might be worth it to mention the ACLE / CMSIS distinction, target features involved, which can be detected at run-time, which are compile-time only, etc, since the RFC would also need to stabilize those.

@japaric
Copy link
Member Author

japaric commented Feb 27, 2019

We are unlikely to propose this RFC as it is so I'm going to close this PR. Let's continue the discussion on stabilization in #63.

@japaric japaric closed this Feb 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.