“Anti RISC-V instruction encoding”
Copyright © Goran Dakov 2024
Document License Terms
1. This document cannot be used to implement a pointer bounds checking architecture version of a computer processor that bypasses the requirement of a free software only microkernel. It is not mandatory for it to not support kernel servers, as long as protections apply, which can be implemented by guarding a physical memory map pointer from being stored to normal paged memory. The physical map pointer store protection would require the implementation of a sub-kernel mode which would also protect critical CSR from being over-written.
2. The micro-kernel must have accurate source available requirement in its license and require that the modules have accurate source code available, and it doesn’t allow modules that violate this requirement.
3. The micro-kernel must not have a mandatory debugging protection from regular users of the operating system modules, as long as the users do the debugging themselves.
4. User identifier functionality should be build into the micro-kernel and it shouldn’t prevent a user from debugging code that runs under the same user identifier.
5. IF you use the Document to implement the processor architecture, you might not sue the author of the Document for patent infringement due to the implementation of the author and his licensees if any.
6. You acknowledge that the pointer bounds checking architecture is not necessarily safe from electromagnetic interference, even if coming from the processor core itself. This is especially true if implemented with large data buses that aren’t dual rail per bit with one rail the inverse of the other. Only dynamic logic can be safe in this regard if large amount of cores at the same time as high clock speed is required.
End of Document License Terms
RISC-V claims to have a fusable instruction set, but it has several flaws.
Example 1:
Fusion requires extra write port for register write.
Fusion can require 3 instruction for large size structure field access.
This can be resolved by dumping the 16 bit instruction set, to allow easy fusion of two 32 bit instructions and sacrificing unaligned structure field access except by fused instructions.
Bit fields specified from large to small, but encoding is little endian within the bit fields.
14 bit immediate
1 travelling stop bit
1 bit indicating store
1 bit indicating subdomain selection (further sub-selected by lower address bits for the FPU)
3x 4 bit register fields: base, index, source/target data
2 bits power 2 scale
1 bit indicating whether it is load/store instruction.
Double load/store domain is aligned to 32 bits rather than 64 so that a 128 bit load can do load pair.
256 and bigger load/stores are required to be aligned even if there is the extra bit.
Domains of FPU: single, double, int and custom (e.g. extended precision but can be other)
r15 is zero if used as a index register.
If r15 is used as source/target register for data, it is a fused load/store prefix instruction of the follow up instruction which might omit being preserved after its execution.
The store prefix is for load-modify-store instructions.
If r15 is used as base register, it is replaced by r31, the stack pointer.
14 bit immediate sign extended
3 bit opcode
3x 4 bit register field (index register field is merged with immediate by the alu).
If index register is r15 it is still merged.
Register numbers r8-r14 are reserved for other instructions. Implied merge register is r15.
For non 16 bit addition r14 is subtract and r13 is add with overflow as lt. For 64 bit, there is also r12 is add with carry out as inverted LT, and r11 is add with carry in set and set carry out in extra flag bit, r10 is add and or carry out with extra carry bit as inverted LT condition.
For and instruction, if r14, the flags are copied into the target register. If r13, the second source register or immediate is copied into the flags.
Alu and FPU instruction zero r15 after they use it except the load upper instructions.
3 bit MAJOR opcode, 4 used for load, 1 used for immediate
OPCODES:
value |
meaning |
add |
64 bit addition |
add32 |
32 bit addition zero extended |
and |
64 bit and |
or |
64 bit or |
xor |
64 bit xor |
add32s |
16 bit sign extend addition with 16 bit overflow as LT condition |
movq |
Move instruction, sub opcode applies in src reg |
small_imm_subset |
Small immediate, imm7 in upper and 7 bit opcode |
3 bit op (redundant op for shift instruction with encoding as in small_imm_subset, but with other codes)
1 bit alt-imm
1 bit alt-imm-val
2 bits condition (le,lt,eq,ne)
movq |
|
movl |
|
movb |
Imm enc=reverse subtract rs1=lower rT=upper |
movw |
Imm enc=shift insn but using upper registers |
movslq |
Imm enc=mul16sx; also index=r14 for mul16sx |
movsbq |
Imm enc=mulq; also index=r14 for mulq reg; if r13 index it is hmulq |
movswq |
Imm enc=mull; Also for reg-reg if index is r13, it is movswl otherwise movswq; if r14, it is mull |
movsbl |
imm enc=mullq; if index is r14, then reg reg is mullq; if index is r13, it is himulq |
Cmpq.s |
index=r14 → double lower for reg-reg |
Cmpq.s.swp |
index=r14 → double lower for reg-reg |
Cmpq.u. |
index=r14 → double lower compare for unord (reg-reg) |
Cmpq.u.swp |
|
Cmpl.s |
index=r14 → single lower for reg-reg |
Cmpl.s.swp |
index=r14 → single lower for reg-reg |
Cmpl.u. |
index=r14 → single lower compare for unord (reg-reg) |
Cmpl.u.swp |
|
Same as small_imm_opcode, but with index register field indicating as follows:
shll |
|
shrl |
|
sarl |
|
shrq |
|
sarq |
|
shlq |
|
reserved |
others |
subl |
(for register-register, but still implemented) |
subls |
(for register-register, but still implemented) |
andl |
Zero extended |
orl |
Zero extended |
xorl |
Zero extended |
nxorq |
|
nxorl |
|
Test for pointer bit |
Set the ne condition for the pointer bit if pointer bit is implemented. Imm7 has to be zero. |
bswapl |
Swap int bytes |
Index register is r15
immediate is fused at the same offset as for large immediate
If the 20 bits contains in the lower nibble 1001, 1010, 1100 it is a shift and and, where each byte io the upper 16 bits is a byte in the mask and the index bit matches the beginning. Byte masks are handled on an odd-even basis rather than shift the whole thing basis.
64 bit upper bit deposit helper shift and and is not cheap to implement and should not be implemented in the future.
The shift operations are zero or sign extended depending on bit 6 of the short immediate. It is ignored for 64 bit shift unless there is a 128 bit alu, but this is unlikely to be done. Sign extension versions can be added in the future.
If there is 128 bit alu, then one of sign or zero extend adds has to be removed.
Lower 4 bits of immediate are rs2.
Next 3 bits are upper bits of the register file. (forward register not included in register-register opcodes. It is reserved must be r15).
Next 7 bits all zeroes for the base integer opcodes when they exist
Same as register-register ALU, but:
Forward bits are concatenated with opcode to form 7 bit opcode.
The extra 7 bits are used with one opcode bit for double permute instruction and a single permute instruction also uses another bit to indicate whether to swap sub registers in a double sized register pair.
This is 3 bit opcode zero but it can have more instructions
For other opcodes, the extra 7 bits contain 3 rounding mode bits, 2 criss-cross bits, and 2 subdomain selection bits (lower and upper).
Zero domain instructions are a no-operations.
The domain are 128 bit wide.
128 bit and less wide loads copy the data in both domains.
256 bit load writes separate data in each domain.
The criss-cross bits only criss-cross the second source operand, no support for quad operand processing.
This is because it will result in dependency accumulation or require sub-splitting in 4 domains which is considered too complicated.
It will require enormous amount of scheduler ports.
It is assumed that there will be half the scheduler ports per domain of FPU than they are for ALU, and the ALU ports could be split half-half per FPU ports. (no excessive usage of scheduler ports that way).
15 data types:
single, double, byte, word, int and 64 bit int, custom (e.g. extended)
Thus there are 14 opcodes per each data type:
fpu |
alu |
add |
and |
sub |
or |
mul |
xor |
mul-pre-fma |
and_not |
add-lower-fma |
add |
sub-lower-fma |
sub |
nmul-pre-fma |
Linear search mask upper bit(lower domain only and no criss cross allowed, but can be implied by the decode to do the linear search on a pair of 64 bit registers) index in flags register |
Packed add |
Pcmp le |
Packed sub |
Pcmp lt |
Packed mul |
Pcmp eq |
Packed mul-pre-fma |
Pcmp ne |
Packled add-lower-fma |
OP_7_A |
Packed sub-lower-fma |
OP_7_B |
Packed nmul-pre-fma |
OP_7_C |
Another 7 opcode without type are possible:
opcode |
meaning |
cvtlsu |
|
cvtqdu |
|
cvtsl |
|
cvtdq |
|
cvtlss |
|
cvtqds |
|
reserved |
Could be packed convert but it is too expensive |
OP_7_A is reserved for double function table lookup and update and one single correction
OP_7_B is reserved for double function correction and 3 single corrections
OP_7_C is for and, or, xor, and_not and four double comparisons as per ALU condition codes
12 opcode zero instruction remaining
7 of them can be used for the same as OP_7_C but with single precision
1 opcode jump indirect
1 opcode return from call (address=rs1, others reserved)
1 opcode jump unconditional 24 bit offset (broken in two parts)
2 opcode jump unconditional aligned (28 bit offset)
OP 0: compare and jump
index register field: 2 bit comparison criterion, 2 bit size (bwlq)
immediate the same as in immediate encoding but left shifted by 2 bits.
Rs1 and data become rs1 and rs2.
OP 2,6:
JALR: 28 bit offset and shl 16 for 64 byte alligned function (6 is for odd 32 byte address)
Link register r1
1,3,4,5: conditional jump, 28 bit offset not shl
7: OP_4_BIT
encoding |
value |
OP_CSR |
rs1=read write jump return from csr, others reserved (jump and return only from designated csr) |
Test imm |
2 bit condition, 2 bit size |
LUI |
20 bit immediate shl 14 |
reserved |
|
reserved |
|
reserved |
|
reserved |
|
reserved |
|
reserved |
|
reserved |
|
Reserved for ptr_bnd_per_bucket |
If implementing pointer bit secutity with floating point bounds |
User extension |
|
User extension |
|
Reserved for ptr_bnd_register |
Imm bits all zeroes except lowest 4, which is rs2 |
cloop |
Source register=cond, then exit=1/iterate=0, then whether to decrement the target register before comparing to zero. If lowest of the 14 bits are not zero, it is taken as “cleave” count – count of instructions that will be executed before the control reaches the destination. If the condition fails, they are not executed. The cleaved instructions can’t contain any branch instructions, or unpredictable things might happen. Security checks shouldn’t be bypassed by the incorrect execution of the cleaved instructions. The condition is anded with the cloop register being of value zero after the optional decrement. |
Shift and Add |
Lowest 4 bits of imm rs2. Two next bits select for rs2 x1 x-1 x2 x2, next 2 bits select for rs1 x1 x4 x8 x8. Only 64 bit shift and add is supported. Other immediate bits are zero (other values reserved might be allowed in the future to be extensions). |
Allocation buckets are presumed if the pointer bit extension is implemented.
Allocation buckets have 5 bit exponent, 7 bit size, and use the upper 20 bits as exponent, lower 7 bits, upper 7 bits, and lower 7th bit to determine which bound we are on.
Pointer bit requires shenanigans with the main memory to either store them as out-of-band bits or use a separate region for them.
Minimum allocation bucket size 32 bytes.
Ptr_bnd_register might be implemented as micro-code.
Source register is the size of the region.
The address must be bucket aligned.
Stack frames can be allocated per function entry sequence and set bounds on the local variables as a whole.
Stack by value passed structures and variables would have to be passed via a pointer which is with set bounds.
The pointer bit security option requires that there be a physical memory mappable micro-kernel in the BIOS which will have a license that prohibits non free modules and cannot be disabled.
A register and return address save area is accessed via the stack pointer.
The register file format for the FPU has an extra exponent bit, that is the inverse of the topmost exponent bit if the numbers are not denormal.
There are 2 type tag bits per 64 bit entry.
Those are used for FMA operation on the single data type to indicate rounding information from the lower fma add to get correct round to tie.
Exceptional condition that are not handled by hardware:
Denormal IEEE result, can trap to mask the significant in software
Condition where the forwarding of fma bits could trigger a wrong round to tie, which is done if the upper half add is aligned with the addend and the lower part of the multiply is exact power of two.
The “extra” exponent bit from single precision is used by double and extended precision FMA instead of the type tags.
The “extra” exponent bit is inverted before and after a FPU logical operation.
The existence of trap-on-denormal for FPU logical operation is an optional extension, to allow accurate IEEE double logical operations.
Upon logical false, the rs1 register is copied to the result rather than keeping the old value.
Branch instructions of the conditional kind trap if they are taken and the offset is zero. The offset is from the beginning of the next instruction.
An implementation might hard-code the branches above 4 in a 32 byte bundle as non-predicted but it must still execute them somehow. It is acceptable to execute them via an exception.
ALU operations other that compare and test do not set the condition codes.
Upon logical false, the rs1 register is copied to the result rather than keeping the old value.
Unaligned loads allowed except for bigger than 128 bit units.
Unaligned sequential loads and stores might fault or stall the LSU when they are not aligned on 4 bytes. Thus two adjacent 16 bit stores will only use 1 store unit if they are done one after the other even if there are 2 store units.
Two adjacent loads of 16 bit values will replay the load, but it might reduce latency as well as throughput. This can happen even if those are not adjacent in the code.
The same can happen with up to 4 adjacent byte loads and up to 4 adjacent stores.
In a OoO implementation it should support at least 1 load and one store of byte and word datums but it is not required to support more even if it has more load and store units.
Big endian byte ordering is the only supported byte ordering.
However, the registers are little endian and the shift operations operate as if they are little endian, rather that slice the bytes.
Integer loads are zero extend rather than sign extend.
{Signed Compare, Carry, Overflow, sign, zero,1 bit stored but unused}
The Carry flag is not inverted.
FPU compares are as if unsigned.
Overflow compare compares in eq bits.
Unordered result is not equal, but whether it is less than is undefined due to condition register swapping to achieve more conditions for conditional execution.
The Carry bit is set for unordered.