Anti RISC-V instruction encoding”

Copyright © Goran Dakov 2024

Document License Terms

1. This document cannot be used to implement a pointer bounds checking architecture version of a computer processor that bypasses the requirement of a free software only microkernel. It is not mandatory for it to not support kernel servers, as long as protections apply, which can be implemented by guarding a physical memory map pointer from being stored to normal paged memory. The physical map pointer store protection would require the implementation of a sub-kernel mode which would also protect critical CSR from being over-written.

2. The micro-kernel must have accurate source available requirement in its license and require that the modules have accurate source code available, and it doesn’t allow modules that violate this requirement.

3. The micro-kernel must not have a mandatory debugging protection from regular users of the operating system modules, as long as the users do the debugging themselves.

4. User identifier functionality should be build into the micro-kernel and it shouldn’t prevent a user from debugging code that runs under the same user identifier.

5. IF you use the Document to implement the processor architecture, you might not sue the author of the Document for patent infringement due to the implementation of the author and his licensees if any.

6. You acknowledge that the pointer bounds checking architecture is not necessarily safe from electromagnetic interference, even if coming from the processor core itself. This is especially true if implemented with large data buses that aren’t dual rail per bit with one rail the inverse of the other. Only dynamic logic can be safe in this regard if large amount of cores at the same time as high clock speed is required.

End of Document License Terms


Preface


RISC-V claims to have a fusable instruction set, but it has several flaws.

Example 1:

Fusion requires extra write port for register write.

Fusion can require 3 instruction for large size structure field access.


This can be resolved by dumping the 16 bit instruction set, to allow easy fusion of two 32 bit instructions and sacrificing unaligned structure field access except by fused instructions.


Bit ordering in the document

Bit fields specified from large to small, but encoding is little endian within the bit fields.

Encoding of Load/Store:

14 bit immediate

1 travelling stop bit

1 bit indicating store

1 bit indicating subdomain selection (further sub-selected by lower address bits for the FPU)

3x 4 bit register fields: base, index, source/target data

2 bits power 2 scale

1 bit indicating whether it is load/store instruction.

Double load/store domain is aligned to 32 bits rather than 64 so that a 128 bit load can do load pair.

256 and bigger load/stores are required to be aligned even if there is the extra bit.

Domains of FPU: single, double, int and custom (e.g. extended precision but can be other)

r15 is zero if used as a index register.

If r15 is used as source/target register for data, it is a fused load/store prefix instruction of the follow up instruction which might omit being preserved after its execution.

The store prefix is for load-modify-store instructions.

If r15 is used as base register, it is replaced by r31, the stack pointer.


Encoding of Immediate format:

14 bit immediate sign extended

3 bit opcode

3x 4 bit register field (index register field is merged with immediate by the alu).

If index register is r15 it is still merged.

Register numbers r8-r14 are reserved for other instructions. Implied merge register is r15.

For non 16 bit addition r14 is subtract and r13 is add with overflow as lt. For 64 bit, there is also r12 is add with carry out as inverted LT, and r11 is add with carry in set and set carry out in extra flag bit, r10 is add and or carry out with extra carry bit as inverted LT condition.

For and instruction, if r14, the flags are copied into the target register. If r13, the second source register or immediate is copied into the flags.

Alu and FPU instruction zero r15 after they use it except the load upper instructions.

3 bit MAJOR opcode, 4 used for load, 1 used for immediate

OPCODES:

value

meaning

add

64 bit addition

add32

32 bit addition zero extended

and

64 bit and

or

64 bit or

xor

64 bit xor

add32s

16 bit sign extend addition with 16 bit overflow as LT condition

movq

Move instruction, sub opcode applies in src reg

small_imm_subset

Small immediate, imm7 in upper and 7 bit opcode


small_imm_opcode:

3 bit op (redundant op for shift instruction with encoding as in small_imm_subset, but with other codes)

1 bit alt-imm

1 bit alt-imm-val

2 bits condition (le,lt,eq,ne)

movq sub-opcode:

movq


movl


movb

Imm enc=reverse subtract rs1=lower rT=upper

movw

Imm enc=shift insn but using upper registers

movslq

Imm enc=mul16sx; also index=r14 for mul16sx

movsbq

Imm enc=mulq; also index=r14 for mulq reg; if r13 index it is hmulq

movswq

Imm enc=mull; Also for reg-reg if index is r13, it is movswl otherwise movswq; if r14, it is mull

movsbl

imm enc=mullq; if index is r14, then reg reg is mullq; if index is r13, it is himulq

Cmpq.s

index=r14 → double lower for reg-reg

Cmpq.s.swp

index=r14 → double lower for reg-reg

Cmpq.u.

index=r14 → double lower compare for unord (reg-reg)

Cmpq.u.swp


Cmpl.s

index=r14 → single lower for reg-reg

Cmpl.s.swp

index=r14 → single lower for reg-reg

Cmpl.u.

index=r14 → single lower compare for unord (reg-reg)

Cmpl.u.swp



Shift encoding

Same as small_imm_opcode, but with index register field indicating as follows:

shll


shrl


sarl


shrq


sarq


shlq


reserved

others

subl

(for register-register, but still implemented)

subls

(for register-register, but still implemented)

andl

Zero extended

orl

Zero extended

xorl

Zero extended

nxorq


nxorl


Test for pointer bit

Set the ne condition for the pointer bit if pointer bit is implemented. Imm7 has to be zero.

bswapl

Swap int bytes

Index register is r15

immediate is fused at the same offset as for large immediate

If the 20 bits contains in the lower nibble 1001, 1010, 1100 it is a shift and and, where each byte io the upper 16 bits is a byte in the mask and the index bit matches the beginning. Byte masks are handled on an odd-even basis rather than shift the whole thing basis.

64 bit upper bit deposit helper shift and and is not cheap to implement and should not be implemented in the future.

The shift operations are zero or sign extended depending on bit 6 of the short immediate. It is ignored for 64 bit shift unless there is a 128 bit alu, but this is unlikely to be done. Sign extension versions can be added in the future.

If there is 128 bit alu, then one of sign or zero extend adds has to be removed.

Register-Register ALU:

Lower 4 bits of immediate are rs2.

Next 3 bits are upper bits of the register file. (forward register not included in register-register opcodes. It is reserved must be r15).

Next 7 bits all zeroes for the base integer opcodes when they exist


Register-Register FPU:



Same as register-register ALU, but:

Forward bits are concatenated with opcode to form 7 bit opcode.

The extra 7 bits are used with one opcode bit for double permute instruction and a single permute instruction also uses another bit to indicate whether to swap sub registers in a double sized register pair.

This is 3 bit opcode zero but it can have more instructions

For other opcodes, the extra 7 bits contain 3 rounding mode bits, 2 criss-cross bits, and 2 subdomain selection bits (lower and upper).

Zero domain instructions are a no-operations.

The domain are 128 bit wide.

128 bit and less wide loads copy the data in both domains.

256 bit load writes separate data in each domain.

The criss-cross bits only criss-cross the second source operand, no support for quad operand processing.

This is because it will result in dependency accumulation or require sub-splitting in 4 domains which is considered too complicated.

It will require enormous amount of scheduler ports.

It is assumed that there will be half the scheduler ports per domain of FPU than they are for ALU, and the ALU ports could be split half-half per FPU ports. (no excessive usage of scheduler ports that way).

15 data types:

single, double, byte, word, int and 64 bit int, custom (e.g. extended)

Thus there are 14 opcodes per each data type:

fpu

alu

add

and

sub

or

mul

xor

mul-pre-fma

and_not

add-lower-fma

add

sub-lower-fma

sub

nmul-pre-fma

Linear search mask upper bit(lower domain only and no criss cross allowed, but can be implied by the decode to do the linear search on a pair of 64 bit registers) index in flags register

Packed add

Pcmp le

Packed sub

Pcmp lt

Packed mul

Pcmp eq

Packed mul-pre-fma

Pcmp ne

Packled add-lower-fma

OP_7_A

Packed sub-lower-fma

OP_7_B

Packed nmul-pre-fma

OP_7_C



Another 7 opcode without type are possible:

opcode

meaning

cvtlsu


cvtqdu


cvtsl


cvtdq


cvtlss


cvtqds


reserved

Could be packed convert but it is too expensive



OP_7_A is reserved for double function table lookup and update and one single correction

OP_7_B is reserved for double function correction and 3 single corrections

OP_7_C is for and, or, xor, and_not and four double comparisons as per ALU condition codes

12 opcode zero instruction remaining

7 of them can be used for the same as OP_7_C but with single precision

1 opcode jump indirect

1 opcode return from call (address=rs1, others reserved)

1 opcode jump unconditional 24 bit offset (broken in two parts)

2 opcode jump unconditional aligned (28 bit offset)

Final major opcode:

OP 0: compare and jump

index register field: 2 bit comparison criterion, 2 bit size (bwlq)

immediate the same as in immediate encoding but left shifted by 2 bits.

Rs1 and data become rs1 and rs2.

OP 2,6:

JALR: 28 bit offset and shl 16 for 64 byte alligned function (6 is for odd 32 byte address)

Link register r1

1,3,4,5: conditional jump, 28 bit offset not shl

7: OP_4_BIT

encoding

value

OP_CSR

rs1=read write jump return from csr, others reserved (jump and return only from designated csr)

Test imm

2 bit condition, 2 bit size

LUI

20 bit immediate shl 14

reserved


reserved


reserved


reserved


reserved


reserved


reserved


Reserved for ptr_bnd_per_bucket

If implementing pointer bit secutity with floating point bounds

User extension


User extension


Reserved for ptr_bnd_register

Imm bits all zeroes except lowest 4, which is rs2

cloop

Source register=cond, then exit=1/iterate=0, then whether to decrement the target register before comparing to zero.

If lowest of the 14 bits are not zero, it is taken as “cleave” count – count of instructions that will be executed before the control reaches the destination. If the condition fails, they are not executed. The cleaved instructions can’t contain any branch instructions, or unpredictable things might happen. Security checks shouldn’t be bypassed by the incorrect execution of the cleaved instructions.

The condition is anded with the cloop register being of value zero after the optional decrement.

Shift and Add

Lowest 4 bits of imm rs2. Two next bits select for rs2 x1 x-1 x2 x2, next 2 bits select for rs1 x1 x4 x8 x8. Only 64 bit shift and add is supported.

Other immediate bits are zero (other values reserved might be allowed in the future to be extensions).



Allocation buckets are presumed if the pointer bit extension is implemented.

Allocation buckets have 5 bit exponent, 7 bit size, and use the upper 20 bits as exponent, lower 7 bits, upper 7 bits, and lower 7th bit to determine which bound we are on.

Pointer bit requires shenanigans with the main memory to either store them as out-of-band bits or use a separate region for them.

Minimum allocation bucket size 32 bytes.

Ptr_bnd_register might be implemented as micro-code.

Source register is the size of the region.

The address must be bucket aligned.

Stack frames can be allocated per function entry sequence and set bounds on the local variables as a whole.

Stack by value passed structures and variables would have to be passed via a pointer which is with set bounds.

The pointer bit security option requires that there be a physical memory mappable micro-kernel in the BIOS which will have a license that prohibits non free modules and cannot be disabled.

A register and return address save area is accessed via the stack pointer.

The register file format for the FPU has an extra exponent bit, that is the inverse of the topmost exponent bit if the numbers are not denormal.

There are 2 type tag bits per 64 bit entry.

Those are used for FMA operation on the single data type to indicate rounding information from the lower fma add to get correct round to tie.

Exceptional condition that are not handled by hardware:

Denormal IEEE result, can trap to mask the significant in software

Condition where the forwarding of fma bits could trigger a wrong round to tie, which is done if the upper half add is aligned with the addend and the lower part of the multiply is exact power of two.

The “extra” exponent bit from single precision is used by double and extended precision FMA instead of the type tags.

The “extra” exponent bit is inverted before and after a FPU logical operation.

The existence of trap-on-denormal for FPU logical operation is an optional extension, to allow accurate IEEE double logical operations.

Upon logical false, the rs1 register is copied to the result rather than keeping the old value.

Branch instruction behaviour

Branch instructions of the conditional kind trap if they are taken and the offset is zero. The offset is from the beginning of the next instruction.

An implementation might hard-code the branches above 4 in a 32 byte bundle as non-predicted but it must still execute them somehow. It is acceptable to execute them via an exception.

ALU behaviour

ALU operations other that compare and test do not set the condition codes.

Upon logical false, the rs1 register is copied to the result rather than keeping the old value.

Load-Store unit

Unaligned loads allowed except for bigger than 128 bit units.

Unaligned sequential loads and stores might fault or stall the LSU when they are not aligned on 4 bytes. Thus two adjacent 16 bit stores will only use 1 store unit if they are done one after the other even if there are 2 store units.

Two adjacent loads of 16 bit values will replay the load, but it might reduce latency as well as throughput. This can happen even if those are not adjacent in the code.

The same can happen with up to 4 adjacent byte loads and up to 4 adjacent stores.

In a OoO implementation it should support at least 1 load and one store of byte and word datums but it is not required to support more even if it has more load and store units.

Big endian byte ordering is the only supported byte ordering.

However, the registers are little endian and the shift operations operate as if they are little endian, rather that slice the bytes.

Integer loads are zero extend rather than sign extend.

Condition code register format

{Signed Compare, Carry, Overflow, sign, zero,1 bit stored but unused}

The Carry flag is not inverted.

FPU compares are as if unsigned.

Overflow compare compares in eq bits.

Unordered result is not equal, but whether it is less than is undefined due to condition register swapping to achieve more conditions for conditional execution.

The Carry bit is set for unordered.