Designing a machine langue for a "next gen" cpu.

ratchetfreak

#7480

July 5, 2016

As a bit of a thought exercise I started wondering what the new design of machine language would be if we could dump the current "legacy" of x86/x64 and C's domination on the machine language design.

With C's domination I mean that char is always 8 bits (technically it doesn't really need to be but everyone assumes it) the requirement that a char* exists, that any pointer can be converted to it and back and that all pointers are the same size.

These things require that CPUs have byte operations and that all pointers are granular to a single byte even if alignment restrictions would disallow that.

However per-byte operations are only used (of the top of my head) in 5 locations; the machine code itself, string processing and the trio of IO, compression and encryption. Everything else can easily work with 32 or 64 bit integers.

Unless I'm very much mistaken these per-byte algorithms could easily function if all per-byte operations were simd over a 32 bit register (turning off the carry after 8 bits). The code would become more complicated (as simd tends to do) but the naive solution of splatting out each byte over its own 32-bit register would still work.

That would be the first set of operations: compare, integer addition and subtract with full carry, without carry on bit 8, 16 and 24 and without carry on bit 16. Multiply and divide would be restricted to 32 bit granularity.

The real bonus would be that the deference operation can be 4 byte granular. Which means that the 32 bit memory space is 16 gigs and would have bought a few more years before the switch to 64 bit.

The only problem remaining is the machine code, 4 bytes for each operand would be massive ad a bit of a waste. 16 bit operands are a bit more sane but would require some extra consideration with jumps. One solution is to have the program pointer be 33 bits and to only allow absolute jumps to be to even locations and relative jumps remaining arbitrary.

A simpler solution would be to just disallow code to be in the top 8 gigs of virtual memory, the program counter just cannot point to the top half of the memory. This requires more consideration when needing to do run-time code generation and where code can be loaded in.

I do not think that such a machine would be compliant to C. Although if you set the size of char to be 32 bits it would still work. Though it would require rewriting all programs that assume the size of a char to be 8 bits (which is everything).

Comments?

Mārtiņš Možeiko

#7487

July 5, 2016

While thinking like this from software point of view is nothing wrong, you really need to have hardware design skills/experience to reason about new ISA in any successful way. Having efficient CPU hardware puts restrictions on ISA you design.

ratchetfreak

#7490

July 5, 2016

As for designing the the actual machine code: I know that the opcode is separated in some chunks getting sent to what are essentially multiplexers and other bits being a set of flags for the circuitry actually doing the operation. (overly simplified but I'm not a hardware designer).

Maybe but seeing the awkward machine code emitted for common operations and the opcode range dedicated to stuff that's not use all that often anymore which screw over the rest of the commands forcing them to use multi-byte opcodes.

CS is full of historical decisions that looked like a small one but end up defining future development or hamstringing most of the field. I mean just look at the web...

Here's an old ABI design decision that kinda screws over boolean logic: true is represented by 1. C spec says that any non-zero value is true however the ABI says that if you return a boolean and the value is true then it must be 1.

To use that boolean to select one of two values the true-is-1 standard forces either a branch, a multiply or some kind of bit splat to use it as a mask. I mean simd operation have all bits set as true value t use as a mask.

Andrew Bromage

#7527

July 8, 2016

Every so often someone comes up with the idea of implementing a new ISA without byte addressing. MIPS did it, ARM did it, and Alpha did it. In each case, within a couple of ISA revisions, byte (and other small word) addressing was added in.

Strictly speaking, nobody needs byte addressing for general-purpose code, in the sense that it can be emulated using word addressing and manipulation instructions, and that's what the ISA designers figured. There are essentially two reasons why byte addressing was added back in all of these architectures:

Device driver performance suffered measurably. It turns out that whatever software people think of it, hardware designers insist on creating devices that require uncached byte addressing.
Using instruction cache effectively turns out to be very important. While text processing exists, programs will want to manipulate bytes. Requiring more instructions to do this means using more instruction cache to do the same job.

Incidentally, the instruction cache issue is also the reason why x86-64 still uses variable-length instruction encodings which go down to one byte in size. Decoding these instructions uses a crapload of power (which is why your smart phone doesn't have a x86 family CPU), but because more common instructions are shorter, you end up transferring less data between RAM and instruction cache for the same program.

By all means try, but that's what you're up against.

ratchetfreak

#7530

July 8, 2016

Pseudonym
Every so often someone comes up with the idea of implementing a new ISA without byte addressing. MIPS did it, ARM did it, and Alpha did it. In each case, within a couple of ISA revisions, byte (and other small word) addressing was added in.

Strictly speaking, nobody needs byte addressing for general-purpose code, in the sense that it can be emulated using word addressing and manipulation instructions, and that's what the ISA designers figured. There are essentially two reasons why byte addressing was added back in all of these architectures:

Device driver performance suffered measurably. It turns out that whatever software people think of it, hardware designers insist on creating devices that require uncached byte addressing.

Using instruction cache effectively turns out to be very important. While text processing exists, programs will want to manipulate bytes. Requiring more instructions to do this means using more instruction cache to do the same job.

Incidentally, the instruction cache issue is also the reason why x86-64 still uses variable-length instruction encodings which go down to one byte in size. Decoding these instructions uses a crapload of power (which is why your smart phone doesn't have a x86 family CPU), but because more common instructions are shorter, you end up transferring less data between RAM and instruction cache for the same program.

By all means try, but that's what you're up against.

uncached reads should (hopefully) not be a problem for devices. Unless a read access is used as a signal but who would do that...

So some kind of uncached byte granular write that doesn't touch adjacent bytes is then important. That can be done using a byte-granular masked write. This is under the assumption that they don't put misalign multibyte mapped registers.

The instruction cache is why I went for 16 bit opcodes instead of 32 bit. I'd have to gather statistics about which instructions are most common in programs to see which ones have to be short (maybe a few 2-in-1 instructions...) and which ones can be splatted out across multiple sub-words.

Bill Strong

#7546

July 10, 2016

Intel and AMD chips present x86 and x86-64 as a front end to the world, but the actual chip turns it into a reduced subset. So there is nothing preventing anyone implementing that strategy, other than possible performance characteristics or ease of backwards compatibility. (This is that decode step previously mentioned.)

There is a project that has a more interesting design, The Mill family of CPUs. IT is a general purpose hybrid Harvard Architecture. It's designed to operate on 30+ instructions per cycle. It uses one cache line to encode these 30+ instructions, and operates on one cache line at a time.

It is byte addressable, so not quite what you are looking for, but its memory addressing is 64bit on all families.

It doesn't have registers, it instead uses something they call The Belt. The Belt is basically a conveyor belt of values that fall off after it reaches the end of The Belt. At the start of every cycle, new data is placed at the beginning of The Belt, so if you have 3 new values being used, the last three values get dropped off. If the program running needs those values, it will have issued the instruction to save them. All values saved are placed at the top of The Belt at the same time as the new values. The hardware keeps track of the belt addressing, so you don't have to mess with the concept of registers anymore.

The families I have mentioned are basically different configurations of The Mill. Some configurations may have a Belt size of 16 values, and operate on 8 instructions at a time, larger ones may have a Belt size of 64, and operate on 32 instructions. (My numbers may be wrong.) They all use the same Instructions Set, and each family has a "Specializer" that can patch up for any differences between the families.

It has native NaN types. It's memory system is brilliant. So, You create a new program, and create and chunk of memory that is 500MB. It is automatically 0. The memory address will be created, and has some metadata that tells it that nothing has been written to it yet, so the hardware mmio returns 0, rather than going out to memory to check the value. If you store a value, it goes to the CPU cache first, and is lazily pushed out to system memory.

There are a lot of things to like about the design. Maybe you might get some ideas from their approach?

ratchetfreak

#7574

July 13, 2016

The belt stuff is nice and all but their most interesting part (on the topic of machine language) is their encoding: http://millcomputing.com/wiki/Encoding

tl;dr they group instructions into "extended basic blocks" that are 2 separate streams of instructions. One stream is computation the other is execution flow and memory access. Branches only have a single entry point (not counting returns).

For a multi-issue machine (doing multiple separate operations per cycle side by side) this would work very well.

Bill Strong

#7580

July 13, 2016

I agree that their encoding system is interesting, but their using that particular strategy may lower the rate of packing of instructions. It is a very cool hardware hack to help reduce the price of the chip by using two smaller high speed on die chips, but I am fairly certain they could make a chip without the split instructions and it would just work.

The belt has a nice side effect, i.e. it makes it possible to eliminate a large amount of silicon, thanks to not needing register hardware, and helps eliminate a number of instructions related to registers.

They have a lot of interesting ideas, such as the coarse grained memory security control scheme, but I leave that for the reader to discover.

Edited by Bill Strong on July 13, 2016, 7:55pm Reason: Added thoughts on belt.