Instruction Set

Introduction

This processor uses a custom load/store instruction set. It has features to support parallel compute kernels like a GPU, including a wide vector unit and hardware multithreading, but also has general purpose instructions to enable more flexible use cases. Unlike most modern processors, which have separate sets of vector and scalar arithmetic instructions, this uses a unified set of operations that can take vectors, scalars, or mix of them as operands.

This architecture has 32 scalar registers and 32 vector registers. Vector registers have 16 lanes. Each scalar register or vector lane is 32-bits wide and can contain a single precision floating point value or an integer. How they are interpreted is based on the instruction. Each hardware thread also has a 32-bit program counter register, but it is not accessible by software.

All instructions that take vectors support predication, which allows them to control which lanes of the vector the instruction updates. Compute kernels that use SIMD parallelism can be use this to support branch divergence, where vector lanes take different branches.

Instructions are 32-bit fixed length and must be four byte aligned in memory. There are four main instruction types: arithmetic, memory access, branch, and cache control, which the next sections describe in more detail.

Arithmetic

Arithmetic instructions operate only on register values and do not support memory operands. An arithmetic instruction specifies a destination register and one or two source registers. For example, to add an integer in scalar register 2 to scalar register 3 and store the result in register 1:

add_i s1, s2, s3

This architecture does not have separate integer and floating point register files. Depending on the operation, an instruction may treat register values as floating point values, two's complement signed integers, or unsigned integers. Floating point values use the IEEE 754 "binary32" format. Vector operations compute each of the 16 lanes independently. Vector registers have a 'v' prefix. This adds vector register 2 to 3 and puts the result in register 1:

add_i v1, v2, v3

An instruction may mix scalar and vector registers. The scalar value is duplicated to all 16 lanes.

add_i v4, v5, s2

A vector instruction may specify a mask register, which controls which lanes in the destination register it updates. When present, it is the second parameter in assembly form. It is one of the scalar registers.

add_i_mask v1, s2, v2, v3

Each bit in the low 16 bits of the mask corresponds to one lane in the vector register. The least significant bit corresponds to lane 0. A one bit in the mask writes the value to the lane. A zero bit does not: the lane retains the old value.

For example:

A vector comparison sets each bit in the low 16 bits of the destination corresponding to result of the comparison of the associated lanes, with the same format as a masked instruction. For example:

cmpgt_i s1, v1, v2

Programs can use the results of comparisons as masks for parallel execution. For example, say we have this high level construct:

if (a > b)
   b = a - c
else
   a = b - c

The following runs 16 instances of this in parallel (with each vector lane representing a single instance of the program):

cmpgt_i    s1, v1, v2      # s1 stores the result of the comparison v1 > v2
mul_i_mask v3, s1, v2, v4  # The first part of if clause: only cases where compare is true
xor        s1, s1, -1      # Invert comparison
mul_i_mask v2, s1, v3, v5  # Else clause

This runs both clauses of the if construct, using masks to control which lanes it writes results to. Programs can nest conditionals by using multiple mask registers.

A scalar comparison sets the destination register to 0xffff if the comparison matches, 0 otherwise.

Register/Register Encoding

fmt - Describes operand types and mask usage. Using a format not in this list raises an invalid instruction fault.

Value	dest	src1	src2	Masked
000	Scalar	Scalar	Scalar	N
001	Vector	Vector	Scalar	N
010	Vector	Vector	Scalar	Y
100	Vector	Vector	Vector	N
101	Vector	Vector	Vector	Y

mask - When the fmt field indicates a mask is present, the processor fetches the value from this register. The low 16 bits of the register specify which lanes to update. A comparison instruction does not use a mask register.
dest - The instruction stores its result in this register. The 'fmt' field determines the type (vector or scalar), which is the same as op1 type, except for a comparison instruction. A comparison always has a scalar destination.
src1 - The instruction fetches a value from this register and uses it as the first source operand. The fmt field indicates whether this is a vector or scalar register.
src2 - The instruction fetches a value from this register and uses it as the second source operand. Unary operations use src2 as their operand.

opcode - Arithmetic operation. The implementation does not raise a fault if an instruction has an invalid opcode

Opcode	Mnemonic	Description
000000	or	Bitwise Logical Or
000001	and	Bitwise logical And
000010	syscall	Raise system call trap. ⁵
000011	xor	Bitwise logical exclusive or
000101	add_i	Integer addition
000110	sub_i	Integer subtraction
000111	mull_i	Low 32 bits of integer multiplication
001000	mulh_u	Upper 32 bits of unsigned integer multiplication
001001	ashr	Shift right (signed) ¹
001010	shr	Shift right (unsigned) ¹
001011	shl	Shift left ¹
001100	clz	Count leading zeros ⁶ ²
001101	shuffle	Shuffle vector elements. The second argument is a vector where each lane is an index into the first vector.
001110	ctz	Count trailing zeros ⁶ ²
001111	move	Copy. Uses op2 as the result. Supports immediate and register transfers.
010000	cmpeq_i	Compare and set if integers equal
010001	cmpne_i	Integer not equal
010010	cmpgt_i	Integer greater (signed)
010011	cmpge_i	Integer greater or equal (signed)
010100	cmplt_i	Integer less than (signed)
010101	cmple_i	Integer less than or equal (signed)
010110	cmpgt_u	Integer greater
010111	cmpge_u	Integer greater or equal
011000	cmplt_u	Integer less than
011001	cmple_u	Integer less than or equal
011010	getlane	Copy vector lane into scalar register
011011	ftoi	Convert floating point value to signed integer ⁶
011100	reciprocal	Ra ~= 1.0 / rB. Floating point reciprocal estimate with 6 bits of precision. ⁶
011101	sext8	Sign extend an 8 bit value to 32 bits ⁶
011110	sext16	Sign extend a 16 bit value to 32 bits ⁶
011111	mulh_i	Upper 32 bits of integer multiplication, sign extended values
100000	add_f	Floating point add ³
100001	sub_f	Floating point subtract ³
100010	mul_f	Floating point multiply ³
101010	itof	Convert signed integer to floating point ⁶
101100	cmpgt_f	Compare floating point numbers and and set if greater than ⁴
101101	cmpge_f	Floating point greater or equal ⁴
101110	cmplt_f	Floating point less than ⁴
101111	cmple_f	Floating point less than or equal ⁴
110000	cmpeq_f	Floating point equal ⁴
110001	cmpne_f	Floating point not equal ⁴
111110	break	Raise breakpoint trap. Does not take any operands

Only uses the low 5 bits of the shift amount, ignoring high bits
If the the parameter is zero, this returns 32.
Floating point operations always use round toward nearest, tie to even. If either operand is NaN, the result is always encoded as 0x7fffffff. IEEE 754 recommends (but does not require) propagating the NaN operand (the significand can be any non-zero value), but it is simpler to use a consistent value.
Floating point comparisons return false if either operand is NaN.
System calls must use the immediate encoding, format 0. The immediate operand represents a syscall number, which is copied into the 'syscall index' control register when the instruction executes.
Unary Operation

Immediate Encoding

This format encodes a small constant integer in the instruction as the second parameter.

add_i s1, s2, 27
add_i v1, v2, 13

fmt - describes operand types.

Value

dest

src1

Masked

Encoding

00

Scalar

Scalar

N

B

01

Vector

Vector

N

B

10

Scalar

n/a

N

C

11

Vector

Vector

Y

A
mask, dest, src1 - Same as register arithmetic, described above
imm - Immediate operand. Processor sign extends this value and uses it as the second operand. If this instruction does not have a mask, it uses the alternative format with a larger immediate field.
opcode - Immediate instructions share the same opcodes as register arithmetic instructions, but only encode the low 5 bits.

The 'move' instruction can load a small constant into a register:

move s0, 17

'movehi' is a special instruction that uses format 2 with the 'move' opcode. It copies the 19 bit immediate value to the most significant 19 bits of the destination register and sets the least significant 13 bits of the destination register to zero. It can be followed by an 'or' instruction to copy a 32-bit constant into a register.

The assembler expands the 'li' pseudo instruction to a movehi/or pair that copies a 32-bit constant into a register. For example the assembler expands this:

li s0, 0x12345678

To this:

movehi s0, 0x91A2
or s0, s0, 0x1678

The lea pseudo instruction moves the absolute address of a label into a register:

lea s0, my_label

The assembler converts this to a movehi/or combination like 'li', but also emits ELF relocations so the linker/loader adjusts the values depending on where the label ends up in memory.

The nop (no operation) instruction is encoded as 00000000. This disassembles to 'or s0, s0, 0'--which has no effect--but hardware detects this specific encoding and can optimize handling of it.

Memory access

A memory instruction moves a value between a register and a memory location. It computes the memory address using a base register and a constant offset. Scalar memory operations can access 8, 16, or 32 bit values. The base address and offset are always multiples of bytes regardless of the access size, but these instructions raise an alignment fault if an access address is not aligned on its size. Memory accesses use little-endian byte order. A load instruction that is smaller than 32 bits and has the 's' suffix sign extends the result. An instructions with the 'u' suffix zero extends it.

This is the syntax:

load_32 s1, (s2)   # load the value from address s2 into s1
load_32 s1, 12(s2) # load the value from address s2 + 12 into s1
load_s8 s1, (s2)   # Load and sign extend byte from addr s2 into s1
load_u16 s1, (s2)  # Load and zero extend short from addr s2 into s1
store_32 s1, (s2)  # Store the 32-bit value from s1 into addr s2
store_8 s1, (s2)   # Store the byte from s1 into addr s2

Memory accesses in the physical address range 0xffff0000 to 0xffffffff perform memory mapped device IO. These must be 32-bit scalar load/stores, aligned on a 32-bit boundary as normal loads. If these are not, behavior is undefined.

There are two vector addressing modes. A block load uses a scalar base pointer and loads 16 contiguous 32-bit values. It adds an optional offset to the base pointer. If the computed address is not aligned on a multiple of the vector size (64 bytes), the instruction raises an alignment fault. For example:

load_v v1, 4(s6)

Which is equivalent to this pseudocode (assuming bit 0 of the mask is the least significant bit)

for lane = 0 to 15
    if mask[lane] or instruction not masked
        dest[lane] = memory[pointerReg + lane - 4 + immediate offset]

A scatter/gather memory access uses each lane of a source vector as a pointer for the same lane in the src/dest register. It may add an immediate offset to each one. It only loads 32-bit values. If a computed lane address is not aligned on a 32-bit boundary, the instruction raises an alignment fault. It may also specify a mask:

load_gath v1, 12(v2)
load_gath v1, s1, 12(v2)

It is equivalent to:

for lane = 0 to 15
    if mask[lane] or instruction not masked
         dest[lane] = memory[pointerReg[lane] + immediate offset]

Encoding

There are two formats of memory access instructions: one with a mask, one without. The op field encodes whether the mask field (which is only used for vector operations) is present. An instructions that has a mask field has the _masked suffix on its mnemonic.

L - If this is one, load from memory to a register, otherwise store from a register to memory. For control register accesses, a load indicates a transfer from a control register to a general purpose register, and vice versa.
dest/src – For loads, write the fetched value into this register. For stores, store this value from this register to memory. A synchronized store also updates this register to indicate success.
ptr – For control register accesses, this is the index of the control register. Other memory transfers use this register value to compute addresses based on the access type (described below).
mask - Controls which lanes to transfer. Lanes disabled via the mask do not generate memory access cycles.
offset – Offset in bytes to add to pointer (in a manner dependent on addressing mode). The processor sign extends this before addition, allowing negative and positive offsets.
op – Encodes the access type. Using an op not in this list raises an invalid instruction fault.

Atomic Memory Operations

Special load/store instructions enable multiprocessor synchronization (Some other architectures refer to these as load linked/store conditional). Programs can use these to implement atomic arithmetic and spinlocks. When the processor executes a synchronized load, the L2 cache records the address. The following synchronized store checks if any other threads have written to that cache line since the load. If one has, the instruction doesn't update the cache and writes 0 to the 'source' register. Otherwise, it updates the cache and writes 1 to the source register. The mnemonics have a _sync suffix:

load_sync s0, 12(s1)
store_sync s3, (s1)

In this example, the store_sync instruction updates s3 as a side effect, depending on success. Synchronized accesses are 32-bits wide.

Here is an example of code that performs sync_fetch_and_add. The address of the memory location is in s0. When this returns, the register s3 contains the return value (previous value before the increment):

loop:  load_sync s1, (s0)     # Load current value
       add_i s2, s1, 1        # Compute incremented value
       move s3, s1            # Save previous value
       store_sync s2, (s0)    # Try to write back increment value
       bz s2, loop            # If store_sync failed, retry

Consistency Model

This system uses a relaxed memory consistency model. Although it guarantees read-to-read and write-to-write ordering between threads, it does not guarantee write-to-read ordering. This is sometimes referred to as Processor Consistency.

For example, given the following code:

volatile int a = 0;
volatile int b = 0;

void core1()
{
    a = 1;
    printf("b = %d", b);
}

int core2()
{
    b = 1;
    printf("a = %d", a);
}

At first glance, it might not seem possible for this to output “a = 0 b = 0,” because at least one assignment must occur before either of the prints. But this can happen because this design can reorder stores after loads.

Reordering only occurs *between- hardware threads. We guarantee that threads always see *their own- stores in order. For example, the following program would never print “a = 0.”

int a = 0;

void core1()
{
    a = 1;
    printf("a = %d", a);
}

When needed, software can force write-to-read ordering using the membar instruction, which the section on cache control below describes. The **sync_synchronize intrinsic emits this from C. For example, the following code always prints at least one of the variables as 1:

volatile int a = 0;
volatile int b = 0;

void core1()
{
    a = 1;
    **sync_synchronize();
    printf("b = %d", b);
}

int core2()
{
    b = 1;
    **sync_synchronize();
    printf("a = %d", a);
}

Control Registers

Although these instructions use the memory instruction encoding, they do not access memory. The getcr instruction transfers a value from a control registers to a general purpose register. The setcr instruction does the opposite.

setcr s10, 12
getcr s8, 30

In the table below, registers marked 'thread' have a separate instance per thread. Those marked 'shared' have a single instance per core that is visible to all thread (but not by threads on other cores).

Accessing any control register (read or write) when the processor is not in supervisor mode raises a privileged operation fault. Although it might seem useful to allow some of these registers to be accessible from user mode, doing so would violate Popek & Goldberg virtualization requirements because these are "behavior sensitive instructions."

Accessing an index not in this list has undefined behavior.

Index	Read/Write	Shared/Per-Thread	Description
0	r	thread	Global Thread ID ¹
1	rw	shared	Exception Handler Address:sup:2
2	rw	thread	Trap PC:sup:2
3	r	thread	Trap Cause ²
4	rw	thread	Processor flags ⁴
5	r	thread	Access address of last memory trap ²
6	r	shared	Count of processor cycles since reset
7	rw	shared	TLB miss handler address (physical address) ³
8	rw	thread	Saved flags value ²
9	rw	thread	Current address space ID ³
10	rw	thread	Current page directory base ⁶
11	rw	thread	Scratchpad register 0 ⁵
12	rw	thread	Scratchpad register 1
13	rw	thread	Subcycle of instruction that a trap interrupted (for scatter/gather memory access)
14	rw	thread	Interrupt Enable ²
15	w	thread	Interrupt Ack ²
16	r	thread	Interrupt Pending ²
17	rw	shared	Interrupt Trigger mode ²
18	rw	shared	Bi-directional JTAG data transfer
19	r	thread	Index of last syscall instruction.

Bits 31-2 are the core index, bits 1-0 are the thread ID (assuming four threads per core).
See Trap/Interrupt Handling section.
See Virtual Memory section.

Flags

Bit	Usage
0	Interrupt enable. The processor does not dispatch interrupts on the thread while this is 0.
1	MMU enable. When this is one, the TLB performs address translations for the current thread. If not, physical addresses are the same as virtual addresses.
2	Supervisor mode. This enables access to memory pages with the supervisor bit set and system instructions described elsewhere in this document (eret, setcr, etc.)

Available to stash general purpose register values temporarily during the interrupt service routine.
Since TLB misses are handled by software, this is just a convenient way to pass the page directory pointer to the TLB miss handler routine. However, if a hardware page walker were implemented, it would read from this register.

Branch

A branch instruction causes the program counter to jump to another address. A conditional branch uses the value in source scalar register to determine whether to take the branch.

b label
b s0
bnz s0, label
bz s0, label
call label
call s0
eret

Encoding

Executing a branch instruction with an operation that is not in this list raises an invalid instruction fault.

'ret' is a pseudo instruction that expands to a indirect branch to the return address register (b ra).

Cache Control

itlbinsert s0, s1    (where s0 is the virtual address and s1 is the TLB contents)
dtlbinsert s0, s1
tlbinval s0
tlbinvalall
dinvalidate s0
iinvalidate s0
dflush s0
membar

Encoding

ptr – Base address, scalar register. For TLB insertion, this is the virtual address.
tlb entry - Index of register that contains 32 bit TLB contents (described in the the

Virtual Memory section) for dtlb/itlbinsert - op - cache operation. mode in the table below indicates when it is legal to use this instruction. S is supervisor only, U/S is user or supervisor mode. If the program attempts to use a supervisor only instruction in user mode, it raises a privileged operation fault.

dflush, dinvalidate, and iinvalidate do not wait for the operation to complete, but a membar instruction issued after one of these instructions does.
A TLB instruction only updates entries on the core it is issued on. An invalidate instruction only affects the entry for the current ASID, unless the page at the address is global, in which case it invalidates it for all ASIDs. It can take a few cycles for the new translation to take effect because of pipeline delay, but the updates are not queued and membar does not wait for the update to complete. The Virtual Memory section describes this in more detail.

Other notes - If a program dynamically generates code (for example, JIT compilation), it must execute the iinvalidate instruction to reload instruction cache lines, as the processor does not do this automatically as a side effect of memory stores. - A program should execute dinvalidate to reload data that an external device or processor DMAed into system memory. - If a processor must execute code DMAed or loaded into memory by an external device or host processor, it must execute dinvalidate (to reload the L2 cache) followed by iinvalidate (to reload the icache).

Trap/Interrupt Handling

A trap may occur for several reasons:

A instruction faults by performing an invalid action, like an unaligned memory access.
An interrupt occurs when an external device requests access. These occur at an instruction boundary. The processor only dispatches an interrupt if the thread has enabled it in the processor flags control register.
The processor executes a syscall or breakpoint instruction. These switch from user to supervisor mode and jump into kernel code.

Traps are precise in this architecture: the processor guarantees that all instructions before the instruction at which the trap occurred have completed, and all instruction after that instruction have not. This allows the thread to resume properly after the trap handler software routine returns.

When a trap occurs, the processor performs the following actions:

Copy the contents of the processor flags control register into the saved flags control register
Copy the contents of the current program counter into the trap program counter control register
For a multi-cycle instruction like scatter/gather memory accesses, copy the current subcycle into the subcycle control register.
Disable interrupts
Set the value of the trap cause control register
Set the mode to supervisor in the processor flags control register.
Set the program counter to the address in the exception handler address control register

The processor handles a TLB miss exception a bit differently, described in the Virtual Memory section

The trap cause control register has the following format:

C Cache type. Only valid for memory access traps (IDs 5-10 below)
- 0 - Instruction Cache
- 1 - Data Cache
S Store. Only valid for memory access traps.
- 0 - load operation caused trap
- 1 - store operation caused trap

Type of trap:

ID	Description
0	No trap (reset)
1	Illegal instruction
2	Privileged operation in user mode (setcr, eret, itlbinsert, dtlbinsert)
3	External interrupt
4	syscall (system call) instruction executed
5	Unaligned memory access
6	Page fault (page not present)
7	TLB miss
8	Write to read only page
9	Attempt to access supervisor page from user process
10	Attempt to fetch instruction from a page without the execute bit set
11	breakpoint instruction executed

When the trap handler completes, it can use the eret instruction to return. This performs the following actions:

Copy the contents of the saved flags control register back to the processor flags control register (which may switch out of supervisor mode and re-enable interrupts).
If this is a multi-cycle instruction, restore the subcycle from the subcycle control register.
Jump to the address in the trap PC control register.

Nyuzi supports two levels of nested traps, which allows handling a TLB miss while in the middle of another trap service routine. When a trap occurs, it saves the following trap specific control registers. The eret instruction restores them:

Index	Description
2	Trap PC
3	Reason last trap occurred
5	Access address of last memory trap
8	Saved flags value
11	Scratchpad register 0
12	Scratchpad register 1
13	Subcycle of instruction that a trap interrupted

There are four control registers (14-17) that configure external interrupt handling. In each register, there is one bit per external interrupt, with bit 0 corresponding to interrupt source 0.

Interrupt Mask: Configure which external interrupt sources cause an interrupt for the each thread. A 1 bit enables.
Interrupt Ack: This is only valid for edge triggered interrupts. For each set bit, this resets the latched interrupt for the corresponding source. An ISR must do this before returning for an edge triggered interrupt.
Interrupt Pending: Indicates which interrupts are pending. The software routine can iterate through the bits and handle the interrupts in turn.
Interrupt Trigger Mode: A 1 bit indicates the interrupt is level triggered and is be raised if the interrupt input is in the high state. A 0 bit indicates the interrupt is edge triggered and is latched when the signal transitions from low to high.

Virtual Memory

Each thread can independently enable virtual address translation for itself setting the MMU enable bit in the processor flags control register. When the processor boots, translation is disabled for all threads.

Cores have separate translation lookaside buffers (TLBs) for the L1 instruction and data caches. Each core's TLBs are shared by all threads on it, but independent of other cores. Each TLB entry includes:

Physical page number
A set of flag bits
An address space ID

Flag bits include:

G: Global If this is set, ignore the address space ID when performing TLB lookup. The page appears in all address spaces (see below).
S: Supervisor Thread can only access page if it is in supervisor mode. If it accesses this page when not in supervisor mode, it raises a supervisor memory access fault.
X: Executable If a thread attempts to execute an instruction from a page that does not have this bit set, it raises an execute fault.
W: Writeable If this is not set on a data TLB entry and the thread attempts to write, it raises a write protect fault.
P: Present If this is not set in the TLB entry, accessing the page raises a page fault.

TLBs are managed by software. If a thread has enabled the MMU and accesses an address that doesn't have a valid TLB entry, it raises a TLB miss trap. An interrupt service routine reads the page table and inserts an entry into the TLB using the itlbinsert and dtlbinsert instructions, which use the following format for TLB contents:

A TLB miss behaves like a normal trap, except:

It disables MMU translation for the trapping thread.
Instead of jumping to the normal trap handler (which is a virtual address), it jumps to the value contained in the TLB trap handler control register, which is a physical address.

The TLB miss handler runs with the MMU disabled to avoid a nested TLB miss occurring in the middle of it, which would cause an infinite loop. This also allows it to access page tables without needing to map them into virtual memory.

Address Space Identifiers

Each entry in the TLB has an address space identifier (ASID). Each thread also stores an ASID in a control register. The hardware uses these in two ways:

When looking up an address in the TLB, if the global flag is not set on the TLB entry, the processor only matches it if its ASID matches the ASID control register.
The itlbinsert and dtlbinsert instructions copy the ASID from the control register into the TLB entry.

The ASID has two purposes:

Allow threads on the same core to run in different address spaces. Since all threads share the TLB, the same virtual address may have different physical mappings for each thread.
As an optimization to allow context switches without having to flush the entire TLB.

The ASID is 8 bits, which allows up to 256 virtual address spaces. If there are more than that, an operating system may dynamically reassign them, but must execute a full TLB flush instruction (tlbinvalall) when switching address spaces to invalidate other mappings for that ASID.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instruction Set

Introduction

Arithmetic

Register/Register Encoding

Immediate Encoding

Memory access

Encoding

Atomic Memory Operations

Consistency Model

Control Registers

Branch

Encoding

Cache Control

Encoding

Trap/Interrupt Handling

Virtual Memory

Address Space Identifiers

Clone this wiki locally