Skip to content

Instruction Set

Jeff Bush edited this page Oct 31, 2018 · 19 revisions

Introduction

This processor uses a custom load/store instruction set. It has features to support parallel compute kernels like a GPU, including a wide vector unit and hardware multithreading, but also has general purpose instructions to enable more flexible use cases. Unlike most modern processors, which have separate sets of vector and scalar arithmetic instructions, this uses a unified set of operations that can take vectors, scalars, or mix of them as operands.

This architecture has 32 scalar registers and 32 vector registers. Vector registers have 16 lanes. Each scalar register or vector lane is 32-bits wide and can contain a single precision floating point value or an integer. How they are interpreted is based on the instruction. Each hardware thread also has a 32-bit program counter register, but it is not accessible by software.

All instructions that take vectors support predication, which allows them to control which lanes of the vector the instruction updates. Compute kernels that use SIMD parallelism can be use this to support branch divergence, where vector lanes take different branches.

Instructions are 32-bit fixed length and must be four byte aligned in memory. There are four main instruction types: arithmetic, memory access, branch, and cache control, which the next sections describe in more detail.

Arithmetic

Arithmetic instructions operate only on register values and do not support memory operands. An arithmetic instruction specifies a destination register and one or two source registers. For example, to add an integer in scalar register 2 to scalar register 3 and store the result in register 1:

add_i s1, s2, s3

This architecture does not have separate integer and floating point register files. Depending on the operation, an instruction may treat register values as floating point values, two's complement signed integers, or unsigned integers. Floating point values use the IEEE 754 "binary32" format. Vector operations compute each of the 16 lanes independently. Vector registers have a 'v' prefix. This adds vector register 2 to 3 and puts the result in register 1:

add_i v1, v2, v3

An instruction may mix scalar and vector registers. The scalar value is duplicated to all 16 lanes.

add_i v4, v5, s2

A vector instruction may specify a mask register, which controls which lanes in the destination register it updates. When present, it is the second parameter in assembly form. It is one of the scalar registers.

add_i_mask v1, s2, v2, v3

Each bit in the low 16 bits of the mask corresponds to one lane in the vector register. The least significant bit corresponds to lane 0. A one bit in the mask writes the value to the lane. A zero bit does not: the lane retains the old value.

For example:

instruction-set/predication.png

A vector comparison sets each bit in the low 16 bits of the destination corresponding to result of the comparison of the associated lanes, with the same format as a masked instruction. For example:

cmpgt_i s1, v1, v2

Programs can use the results of comparisons as masks for parallel execution. For example, say we have this high level construct:

if (a > b)
   b = a - c
else
   a = b - c

The following runs 16 instances of this in parallel (with each vector lane representing a single instance of the program):

cmpgt_i    s1, v1, v2      # s1 stores the result of the comparison v1 > v2
mul_i_mask v3, s1, v2, v4  # The first part of if clause: only cases where compare is true
xor        s1, s1, -1      # Invert comparison
mul_i_mask v2, s1, v3, v5  # Else clause

This runs both clauses of the if construct, using masks to control which lanes it writes results to. Programs can nest conditionals by using multiple mask registers.

A scalar comparison sets the destination register to 0xffff if the comparison matches, 0 otherwise.

Register/Register Encoding

instruction-set/fmtr.png
  • fmt - Describes operand types and mask usage. Using a format not in this list raises an invalid instruction fault.

    Value

    dest

    src1

    src2

    Masked

    000

    Scalar

    Scalar

    Scalar

    N

    001

    Vector

    Vector

    Scalar

    N

    010

    Vector

    Vector

    Scalar

    Y

    100

    Vector

    Vector

    Vector

    N

    101

    Vector

    Vector

    Vector

    Y

  • mask - When the fmt field indicates a mask is present, the processor fetches the value from this register. The low 16 bits of the register specify which lanes to update. A comparison instruction does not use a mask register.

  • dest - The instruction stores its result in this register. The 'fmt' field determines the type (vector or scalar), which is the same as op1 type, except for a comparison instruction. A comparison always has a scalar destination.

  • src1 - The instruction fetches a value from this register and uses it as the first source operand. The fmt field indicates whether this is a vector or scalar register.

  • src2 - The instruction fetches a value from this register and uses it as the second source operand. Unary operations use src2 as their operand.

  • opcode - Arithmetic operation. The implementation does not raise a fault if an instruction has an invalid opcode

    Opcode

    Mnemonic

    Description

    000000

    or

    Bitwise Logical Or

    000001

    and

    Bitwise logical And

    000010

    syscall

    Raise system call trap. 5

    000011

    xor

    Bitwise logical exclusive or

    000101

    add_i

    Integer addition

    000110

    sub_i

    Integer subtraction

    000111

    mull_i

    Low 32 bits of integer multiplication

    001000

    mulh_u

    Upper 32 bits of unsigned integer multiplication

    001001

    ashr

    Shift right (signed) 1

    001010

    shr

    Shift right (unsigned) 1

    001011

    shl

    Shift left 1

    001100

    clz

    Count leading zeros 6 2

    001101

    shuffle

    Shuffle vector elements. The second argument is a vector where each lane is an index into the first vector.

    001110

    ctz

    Count trailing zeros 6 2

    001111

    move

    Copy. Uses op2 as the result. Supports immediate and register transfers.

    010000

    cmpeq_i

    Compare and set if integers equal

    010001

    cmpne_i

    Integer not equal

    010010

    cmpgt_i

    Integer greater (signed)

    010011

    cmpge_i

    Integer greater or equal (signed)

    010100

    cmplt_i

    Integer less than (signed)

    010101

    cmple_i

    Integer less than or equal (signed)

    010110

    cmpgt_u

    Integer greater

    010111

    cmpge_u

    Integer greater or equal

    011000

    cmplt_u

    Integer less than

    011001

    cmple_u

    Integer less than or equal

    011010

    getlane

    Copy vector lane into scalar register

    011011

    ftoi

    Convert floating point value to signed integer 6

    011100

    reciprocal

    Ra ~= 1.0 / rB. Floating point reciprocal estimate with 6 bits of precision. 6

    011101

    sext8

    Sign extend an 8 bit value to 32 bits 6

    011110

    sext16

    Sign extend a 16 bit value to 32 bits 6

    011111

    mulh_i

    Upper 32 bits of integer multiplication, sign extended values

    100000

    add_f

    Floating point add 3

    100001

    sub_f

    Floating point subtract 3

    100010

    mul_f

    Floating point multiply 3

    101010

    itof

    Convert signed integer to floating point 6

    101100

    cmpgt_f

    Compare floating point numbers and and set if greater than 4

    101101

    cmpge_f

    Floating point greater or equal 4

    101110

    cmplt_f

    Floating point less than 4

    101111

    cmple_f

    Floating point less than or equal 4

    110000

    cmpeq_f

    Floating point equal 4

    110001

    cmpne_f

    Floating point not equal 4

    111110

    break

    Raise breakpoint trap. Does not take any operands

  1. Only uses the low 5 bits of the shift amount, ignoring high bits
  2. If the the parameter is zero, this returns 32.
  3. Floating point operations always use round toward nearest, tie to even. If either operand is NaN, the result is always encoded as 0x7fffffff. IEEE 754 recommends (but does not require) propagating the NaN operand (the significand can be any non-zero value), but it is simpler to use a consistent value.
  4. Floating point comparisons return false if either operand is NaN.
  5. System calls must use the immediate encoding, format 0. The immediate operand represents a syscall number, which is copied into the 'syscall index' control register when the instruction executes.
  6. Unary Operation

Immediate Encoding

This format encodes a small constant integer in the instruction as the second parameter.

add_i s1, s2, 27
add_i v1, v2, 13
instruction-set/fmti.png
  • fmt - describes operand types.

    Value

    dest

    src1

    Masked

    Encoding

    00

    Scalar

    Scalar

    N

    B

    01

    Vector

    Vector

    N

    B

    10

    Scalar

    n/a

    N

    C

    11

    Vector

    Vector

    Y

    A

  • mask, dest, src1 - Same as register arithmetic, described above

  • imm - Immediate operand. Processor sign extends this value and uses it as the second operand. If this instruction does not have a mask, it uses the alternative format with a larger immediate field.

  • opcode - Immediate instructions share the same opcodes as register arithmetic instructions, but only encode the low 5 bits.

The 'move' instruction can load a small constant into a register:

move s0, 17

'movehi' is a special instruction that uses format 2 with the 'move' opcode. It copies the 19 bit immediate value to the most significant 19 bits of the destination register and sets the least significant 13 bits of the destination register to zero. It can be followed by an 'or' instruction to copy a 32-bit constant into a register.

The assembler expands the 'li' pseudo instruction to a movehi/or pair that copies a 32-bit constant into a register. For example the assembler expands this:

li s0, 0x12345678

To this:

movehi s0, 0x91A2
or s0, s0, 0x1678

The lea pseudo instruction moves the absolute address of a label into a register:

lea s0, my_label

The assembler converts this to a movehi/or combination like 'li', but also emits ELF relocations so the linker/loader adjusts the values depending on where the label ends up in memory.

The nop (no operation) instruction is encoded as 00000000. This disassembles to 'or s0, s0, 0'--which has no effect--but hardware detects this specific encoding and can optimize handling of it.

Memory access

A memory instruction moves a value between a register and a memory location. It computes the memory address using a base register and a constant offset. Scalar memory operations can access 8, 16, or 32 bit values. The base address and offset are always multiples of bytes regardless of the access size, but these instructions raise an alignment fault if an access address is not aligned on its size. Memory accesses use little-endian byte order. A load instruction that is smaller than 32 bits and has the 's' suffix sign extends the result. An instructions with the 'u' suffix zero extends it.

This is the syntax:

load_32 s1, (s2)   # load the value from address s2 into s1
load_32 s1, 12(s2) # load the value from address s2 + 12 into s1
load_s8 s1, (s2)   # Load and sign extend byte from addr s2 into s1
load_u16 s1, (s2)  # Load and zero extend short from addr s2 into s1
store_32 s1, (s2)  # Store the 32-bit value from s1 into addr s2
store_8 s1, (s2)   # Store the byte from s1 into addr s2

Memory accesses in the physical address range 0xffff0000 to 0xffffffff perform memory mapped device IO. These must be 32-bit scalar load/stores, aligned on a 32-bit boundary as normal loads. If these are not, behavior is undefined.

There are two vector addressing modes. A block load uses a scalar base pointer and loads 16 contiguous 32-bit values. It adds an optional offset to the base pointer. If the computed address is not aligned on a multiple of the vector size (64 bytes), the instruction raises an alignment fault. For example:

load_v v1, 4(s6)

Which is equivalent to this pseudocode (assuming bit 0 of the mask is the least significant bit)

for lane = 0 to 15
    if mask[lane] or instruction not masked
        dest[lane] = memory[pointerReg + lane - 4 + immediate offset]

A scatter/gather memory access uses each lane of a source vector as a pointer for the same lane in the src/dest register. It may add an immediate offset to each one. It only loads 32-bit values. If a computed lane address is not aligned on a 32-bit boundary, the instruction raises an alignment fault. It may also specify a mask:

load_gath v1, 12(v2)
load_gath v1, s1, 12(v2)

It is equivalent to:

for lane = 0 to 15
    if mask[lane] or instruction not masked
         dest[lane] = memory[pointerReg[lane] + immediate offset]

Encoding

There are two formats of memory access instructions: one with a mask, one without. The op field encodes whether the mask field (which is only used for vector operations) is present. An instructions that has a mask field has the _masked suffix on its mnemonic.

instruction-set/fmtm.png
  • L - If this is one, load from memory to a register, otherwise store from a register to memory. For control register accesses, a load indicates a transfer from a control register to a general purpose register, and vice versa.
  • dest/src – For loads, write the fetched value into this register. For stores, store this value from this register to memory. A synchronized store also updates this register to indicate success.
  • ptr – For control register accesses, this is the index of the control register. Other memory transfers use this register value to compute addresses based on the access type (described below).
  • mask - Controls which lanes to transfer. Lanes disabled via the mask do not generate memory access cycles.
  • offset – Offset in bytes to add to pointer (in a manner dependent on addressing mode). The processor sign extends this before addition, allowing negative and positive offsets.
  • op – Encodes the access type. Using an op not in this list raises an invalid instruction fault.

Atomic Memory Operations

Special load/store instructions enable multiprocessor synchronization (Some other architectures refer to these as load linked/store conditional). Programs can use these to implement atomic arithmetic and spinlocks. When the processor executes a synchronized load, the L2 cache records the address. The following synchronized store checks if any other threads have written to that cache line since the load. If one has, the instruction doesn't update the cache and writes 0 to the 'source' register. Otherwise, it updates the cache and writes 1 to the source register. The mnemonics have a _sync suffix:

load_sync s0, 12(s1)
store_sync s3, (s1)

In this example, the store_sync instruction updates s3 as a side effect, depending on success. Synchronized accesses are 32-bits wide.

Here is an example of code that performs sync_fetch_and_add. The address of the memory location is in s0. When this returns, the register s3 contains the return value (previous value before the increment):

loop:  load_sync s1, (s0)     # Load current value
       add_i s2, s1, 1        # Compute incremented value
       move s3, s1            # Save previous value
       store_sync s2, (s0)    # Try to write back increment value
       bz s2, loop            # If store_sync failed, retry

Consistency Model

This system uses a relaxed memory consistency model. Although it guarantees read-to-read and write-to-write ordering between threads, it does not guarantee write-to-read ordering. This is sometimes referred to as Processor Consistency.

For example, given the following code:

volatile int a = 0;
volatile int b = 0;

void core1()
{
    a = 1;
    printf("b = %d", b);
}

int core2()
{
    b = 1;
    printf("a = %d", a);
}

At first glance, it might not seem possible for this to output “a = 0 b = 0,” because at least one assignment must occur before either of the prints. But this can happen because this design can reorder stores after loads.

Reordering only occurs *between- hardware threads. We guarantee that threads always see *their own- stores in order. For example, the following program would never print “a = 0.”

int a = 0;

void core1()
{
    a = 1;
    printf("a = %d", a);
}

When needed, software can force write-to-read ordering using the membar instruction, which the section on cache control below describes. The **sync_synchronize intrinsic emits this from C. For example, the following code always prints at least one of the variables as 1:

volatile int a = 0;
volatile int b = 0;

void core1()
{
    a = 1;
    **sync_synchronize();
    printf("b = %d", b);
}

int core2()
{
    b = 1;
    **sync_synchronize();
    printf("a = %d", a);
}

Control Registers

Although these instructions use the memory instruction encoding, they do not access memory. The getcr instruction transfers a value from a control registers to a general purpose register. The setcr instruction does the opposite.

setcr s10, 12
getcr s8, 30

In the table below, registers marked 'thread' have a separate instance per thread. Those marked 'shared' have a single instance per core that is visible to all thread (but not by threads on other cores).

Accessing any control register (read or write) when the processor is not in supervisor mode raises a privileged operation fault. Although it might seem useful to allow some of these registers to be accessible from user mode, doing so would violate Popek & Goldberg virtualization requirements because these are "behavior sensitive instructions."

Accessing an index not in this list has undefined behavior.

Index Read/Write Shared/Per-Thread Description
0 r thread Global Thread ID 1
1 rw shared Exception Handler Address:sup:2
2 rw thread Trap PC:sup:2
3 r thread Trap Cause 2
4 rw thread Processor flags 4
5 r thread Access address of last memory trap 2
6 r shared Count of processor cycles since reset
7 rw shared TLB miss handler address (physical address) 3
8 rw thread Saved flags value 2
9 rw thread Current address space ID 3
10 rw thread Current page directory base 6
11 rw thread Scratchpad register 0 5
12 rw thread Scratchpad register 1
13 rw thread Subcycle of instruction that a trap interrupted (for scatter/gather memory access)
14 rw thread Interrupt Enable 2
15 w thread Interrupt Ack 2
16 r thread Interrupt Pending 2
17 rw shared Interrupt Trigger mode 2
18 rw shared Bi-directional JTAG data transfer
19 r thread Index of last syscall instruction.
  1. Bits 31-2 are the core index, bits 1-0 are the thread ID (assuming four threads per core).

  2. See Trap/Interrupt Handling section.

  3. See Virtual Memory section.

  4. Flags

    Bit

    Usage

    0

    Interrupt enable. The processor does not dispatch interrupts on the thread while this is 0.

    1

    MMU enable. When this is one, the TLB performs address translations for the current thread. If not, physical addresses are the same as virtual addresses.

    2

    Supervisor mode. This enables access to memory pages with the supervisor bit set and system instructions described elsewhere in this document (eret, setcr, etc.)

  5. Available to stash general purpose register values temporarily during the interrupt service routine.

  6. Since TLB misses are handled by software, this is just a convenient way to pass the page directory pointer to the TLB miss handler routine. However, if a hardware page walker were implemented, it would read from this register.

Branch

A branch instruction causes the program counter to jump to another address. A conditional branch uses the value in source scalar register to determine whether to take the branch.

b label
b s0
bnz s0, label
bz s0, label
call label
call s0
eret

Encoding

Executing a branch instruction with an operation that is not in this list raises an invalid instruction fault.

'ret' is a pseudo instruction that expands to a indirect branch to the return address register (b ra).

Cache Control

itlbinsert s0, s1    (where s0 is the virtual address and s1 is the TLB contents)
dtlbinsert s0, s1
tlbinval s0
tlbinvalall
dinvalidate s0
iinvalidate s0
dflush s0
membar

Encoding

instruction-set/fmtc.png
  • ptr – Base address, scalar register. For TLB insertion, this is the virtual address.
  • tlb entry - Index of register that contains 32 bit TLB contents (described in the the

Virtual Memory section) for dtlb/itlbinsert - op - cache operation. mode in the table below indicates when it is legal to use this instruction. S is supervisor only, U/S is user or supervisor mode. If the program attempts to use a supervisor only instruction in user mode, it raises a privileged operation fault.

  1. dflush, dinvalidate, and iinvalidate do not wait for the operation to complete, but a membar instruction issued after one of these instructions does.
  2. A TLB instruction only updates entries on the core it is issued on. An invalidate instruction only affects the entry for the current ASID, unless the page at the address is global, in which case it invalidates it for all ASIDs. It can take a few cycles for the new translation to take effect because of pipeline delay, but the updates are not queued and membar does not wait for the update to complete. The Virtual Memory section describes this in more detail.

Other notes - If a program dynamically generates code (for example, JIT compilation), it must execute the iinvalidate instruction to reload instruction cache lines, as the processor does not do this automatically as a side effect of memory stores. - A program should execute dinvalidate to reload data that an external device or processor DMAed into system memory. - If a processor must execute code DMAed or loaded into memory by an external device or host processor, it must execute dinvalidate (to reload the L2 cache) followed by iinvalidate (to reload the icache).

Trap/Interrupt Handling

A trap may occur for several reasons:

  • A instruction faults by performing an invalid action, like an unaligned memory access.
  • An interrupt occurs when an external device requests access. These occur at an instruction boundary. The processor only dispatches an interrupt if the thread has enabled it in the processor flags control register.
  • The processor executes a syscall or breakpoint instruction. These switch from user to supervisor mode and jump into kernel code.

Traps are precise in this architecture: the processor guarantees that all instructions before the instruction at which the trap occurred have completed, and all instruction after that instruction have not. This allows the thread to resume properly after the trap handler software routine returns.

When a trap occurs, the processor performs the following actions:

  1. Copy the contents of the processor flags control register into the saved flags control register
  2. Copy the contents of the current program counter into the trap program counter control register
  3. For a multi-cycle instruction like scatter/gather memory accesses, copy the current subcycle into the subcycle control register.
  4. Disable interrupts
  5. Set the value of the trap cause control register
  6. Set the mode to supervisor in the processor flags control register.
  7. Set the program counter to the address in the exception handler address control register

The processor handles a TLB miss exception a bit differently, described in the Virtual Memory section

The trap cause control register has the following format:

instruction-set/trap-cause.png
  • C Cache type. Only valid for memory access traps (IDs 5-10 below)

    • 0 - Instruction Cache
    • 1 - Data Cache
  • S Store. Only valid for memory access traps.

    • 0 - load operation caused trap
    • 1 - store operation caused trap
  • Type of trap:

    ID

    Description

    0

    No trap (reset)

    1

    Illegal instruction

    2

    Privileged operation in user mode (setcr, eret, itlbinsert, dtlbinsert)

    3

    External interrupt

    4

    syscall (system call) instruction executed

    5

    Unaligned memory access

    6

    Page fault (page not present)

    7

    TLB miss

    8

    Write to read only page

    9

    Attempt to access supervisor page from user process

    10

    Attempt to fetch instruction from a page without the execute bit set

    11

    breakpoint instruction executed

When the trap handler completes, it can use the eret instruction to return. This performs the following actions:

  1. Copy the contents of the saved flags control register back to the processor flags control register (which may switch out of supervisor mode and re-enable interrupts).
  2. If this is a multi-cycle instruction, restore the subcycle from the subcycle control register.
  3. Jump to the address in the trap PC control register.

Nyuzi supports two levels of nested traps, which allows handling a TLB miss while in the middle of another trap service routine. When a trap occurs, it saves the following trap specific control registers. The eret instruction restores them:

Index Description
2 Trap PC
3 Reason last trap occurred
5 Access address of last memory trap
8 Saved flags value
11 Scratchpad register 0
12 Scratchpad register 1
13 Subcycle of instruction that a trap interrupted

There are four control registers (14-17) that configure external interrupt handling. In each register, there is one bit per external interrupt, with bit 0 corresponding to interrupt source 0.

  • Interrupt Mask: Configure which external interrupt sources cause an interrupt for the each thread. A 1 bit enables.
  • Interrupt Ack: This is only valid for edge triggered interrupts. For each set bit, this resets the latched interrupt for the corresponding source. An ISR must do this before returning for an edge triggered interrupt.
  • Interrupt Pending: Indicates which interrupts are pending. The software routine can iterate through the bits and handle the interrupts in turn.
  • Interrupt Trigger Mode: A 1 bit indicates the interrupt is level triggered and is be raised if the interrupt input is in the high state. A 0 bit indicates the interrupt is edge triggered and is latched when the signal transitions from low to high.

Virtual Memory

Each thread can independently enable virtual address translation for itself setting the MMU enable bit in the processor flags control register. When the processor boots, translation is disabled for all threads.

Cores have separate translation lookaside buffers (TLBs) for the L1 instruction and data caches. Each core's TLBs are shared by all threads on it, but independent of other cores. Each TLB entry includes:

  • Physical page number
  • A set of flag bits
  • An address space ID

Flag bits include:

  • G: Global If this is set, ignore the address space ID when performing TLB lookup. The page appears in all address spaces (see below).
  • S: Supervisor Thread can only access page if it is in supervisor mode. If it accesses this page when not in supervisor mode, it raises a supervisor memory access fault.
  • X: Executable If a thread attempts to execute an instruction from a page that does not have this bit set, it raises an execute fault.
  • W: Writeable If this is not set on a data TLB entry and the thread attempts to write, it raises a write protect fault.
  • P: Present If this is not set in the TLB entry, accessing the page raises a page fault.

TLBs are managed by software. If a thread has enabled the MMU and accesses an address that doesn't have a valid TLB entry, it raises a TLB miss trap. An interrupt service routine reads the page table and inserts an entry into the TLB using the itlbinsert and dtlbinsert instructions, which use the following format for TLB contents:

instruction-set/tlb-format.png

A TLB miss behaves like a normal trap, except:

  1. It disables MMU translation for the trapping thread.
  2. Instead of jumping to the normal trap handler (which is a virtual address), it jumps to the value contained in the TLB trap handler control register, which is a physical address.

The TLB miss handler runs with the MMU disabled to avoid a nested TLB miss occurring in the middle of it, which would cause an infinite loop. This also allows it to access page tables without needing to map them into virtual memory.

Address Space Identifiers

Each entry in the TLB has an address space identifier (ASID). Each thread also stores an ASID in a control register. The hardware uses these in two ways:

  • When looking up an address in the TLB, if the global flag is not set on the TLB entry, the processor only matches it if its ASID matches the ASID control register.
  • The itlbinsert and dtlbinsert instructions copy the ASID from the control register into the TLB entry.

The ASID has two purposes:

  1. Allow threads on the same core to run in different address spaces. Since all threads share the TLB, the same virtual address may have different physical mappings for each thread.
  2. As an optimization to allow context switches without having to flush the entire TLB.

The ASID is 8 bits, which allows up to 256 virtual address spaces. If there are more than that, an operating system may dynamically reassign them, but must execute a full TLB flush instruction (tlbinvalall) when switching address spaces to invalidate other mappings for that ASID.

Clone this wiki locally