User:Schol-r-lea/CPU Implementation
N.B.: This is one of a group of forum posts extracted so that the information in them can be re-worked into a less confrontational, more informational set of wiki pages. They are here only temporarily.
Note also that part of this discussion came from me mentioning to Geri a flight-of-fancy idea I'd had in the late 1990s, which I called 'Reduced State Vector Processor' (RSVP), based on the idea of a manycore stack machine with a very small set of registers - four to five - and what for the time were huge caches, where the ALUs attached to the cache sections rather than the CPU cores. It was a daydream, possibly suitable for some future research project but certainly not a workable system as it was, and I was using it to contrast how I saw that one the one hand, and how Geri saw SubLEq, on the other.
Here's a rule of thumb: for hardware CPU implementations, simple instructions operating exclusively on explicit registers, and only load, store, and instruction read operations working on memory, is the most efficient design. In a pseudo-machine (such as JVM), the opposite is true - complex operations with no registers (e.g., a stack machine) work best.
There are exceptions to both, but OISCs aren't among them. The main one right now is, of course, x86, where the complex instructions get broken down, re-merged, re-ordered, and who knows what else by a massively complex instruction pipelining and interpretation mechanism. I am still not convinced that a RISC cannot do better, but right now, none of them do, because no one has any reason to make them better.
Realistically, reconciling either of them into a single model for both native and interpreted code is probably impossible, even for the embodiment of the abstract principle of The Reconciliation of Opposites fnord (don't ask, I don't understand it and never wanted to be it - thanks for the psychological booby prize, Eris!). Reconciling either of them with an OISC is even more daunting.
In any case, none of this really matters, because, who the hell is going to spend $5 billion on the silicon wafer development and design work for it? It's the same problem that Geri has, but at least I recognize that it is a problem. He seems to think that a 'simpler' ISA means a simpler and hence less expensive development program, which is absolutely false (in both particulars - it would require a great deal of silicon to implement, and the baked-in costs of new chip development approach or even outweigh the particular design cost in any case - $2 billion is about the opening cost for any new CPU regardless of complexity, because of how ICs are made). Developing silicon for an OISC - any OISC - as a hardware CPU, regardless of performance, would cost as much as the development of a new x86 CPU microarchitecture, if not more - at least Intel and AMD are on familiar ground with that piece of trash.
https://www.youtube.com/watch?v=CUAXCeRjw3c
I also took a look at a video on Mill while looking for it, an architecture which has been mentioned here numerous times before but which I never took the time to really look into, and along the way ran across the central design idea of it, the belt machine model. I think if I ever revisit RSVP as a design idea, I would probably need to consider using the conveyor structure rather than the stack I had originally envisioned. If I ever get my life together, and get into grad school on a PhD track (as I should have done 25 years ago), I might want to see if I can get a project for a prototype RSVP put together in some university's prototyping fab. Where I would get the $1 million to $5 million grants needed for that would be another matter.
I am sure several of you will note the x100 difference in what I am saying this would cost, versus my earlier minimum estimate for developing a commercial product version of an OISC. The key words here are 'prototype' and 'commercial product'.
Building an OISC using a field-programmable device? pretty much the cost of the FPLA or FPGA, the device programmer (the device that writes to the PLD, not the human software developer), and a breadboard, plus the time to program it - maybe $30 US if you already have the device burner. Building a TTL OISC, maybe $50-$100, and don't even consider how it performs (for the record, terribly, by current standards - lightspeed considerations ensure that no TTL system can compete with integrated silicon, even compared to a programmable logic array, because it will take longer for a signal to go from one chip to another than it would take for the PLD to complete the operation).
Designing an ASIC and getting it fabricated? Several thousand dollars, to hundreds of thousands depending on how far you take it, and its performance probably won't be significantly better than the FPGA implementation (they are basically the same technology, but a custom ASIC can be better optimized - sometimes). This, by the way, is the technology most chipsets use, or at least used to (I am guessing it has moved on since then, but I would need to check).
Designing an unoptimized prototype or proof of concept in a small-scale silicon foundry of the sort owned by some universities? $500,000, minimum. Yeah, that's quite a jump. Note that while the cost for creating the RISC-I is usually given as around $100,000 IIRC, that was in 1981 - the adjusted cost would be around $850,000, I think, so the real cost has dropped considerably.
Developing a commercially viable, optimized IC layout and implementing it in a fab's development line? There's the $2 billion I quoted earlier.
There is a world of difference between a hobbyist making an OISC implementation with an FPGA or a hand-wrapped breadboard full of discrete ICs, and even the lowest end consumer-grade implementations. It is like comparing a Ford or a Toyota made in a factory to a life-size model made from Legos by some ambitious teens.
A student project on designing a new architecture using a dinky University-owned fab isn't at all the same as a professional product, either. Interestingly enough, for what it can actually accomplish, the University foundry will cost a lot more overall for the results than the professional foundry, it's just that what can be accomplished by a University fab and what a company like Intel or GlobalFoundry can do are not really comparable.
One of the rarely-mentioned aspects of RISC is that the Berkeley RISC I and Stanford MIPS 1 designs were originally developed because simple, no-frills CPUs with no microcode was the most that the fabs owned by UC-Berkeley and Stanford U. were up to producing, as well as being the most which could be covered in a two-semester graduate courseload. It was a real shock to find that these stripped down architectures proved to be as fast or faster than the commercial microprocessors of the time, and even some of the minicomputers (this was back when a TTL design actually could outperform a contemporary microprocessor, sometimes, due to the limits of IC designs, transistor densities, and silicon real estate of the day). The RISC revolution got underway precisely because of how inefficient the mainstream designs of the time were.
This is itself yet another nail in the coffin of Geri's claim that an OISC can be made by a startup on a shoestring and then sold for US$1 apiece in the sort of volumes that a startup can make them in (i.e., production runs in the hundreds rather than millions, even when going fabless and contracting the actual production - which means that the unit costs would have to be astronomical to defray the development cost).
Note that this is after repeated upgrades; a page I found states the process used in 1981 was 5μm (5000 nm). I have not been able to find the process density used for the MIPS 1 at Stanford, but I expect it was similar.
To further put this in context, according to Wicked-Pedo, that's the same process density that Intel was using on the 8085, four years earlier. The process Intel used for the original 8086 in 1978 was 3μm, as were the processes used in the 8088 a year later, the Commodore-Mos Tech WDC-65C02 in 1982, and the ARM I in 1985 (!!!). Just to drive home the point about Intel's lead over most of the other companies, the 68000 was using a 3.5μm process in 1979, the same year as the 8088 - and Motorola was Intel's primary competitor. Intel, AMD, and Acorn were using 350nm processes in 1997, meaning that the gap in time for the implementation of higher density processes has widened tremendously, though it may just be because the universities don't need more (after all, they can only cover so much in a few semesters, and foundries are very expensive - even for research, the schools simply can't match the budgets of the corpse-rat developers).
The take-away:
- Right now, Intel, AMD, and nVidia are just about neck-and-neck, though AMD seems to have pulled ahead right now. No one else seems to comes close to them, even the memory, ASIC, and FPGA manufacturers. Even Oracle is using a 20nm process equivalent to that used on Ivy Bridge, though that may be partly due to Oracle not really giving much of a damn about SPARC or anything else they got from Sun except MySQL (they bought the company primarily to kill off that single 'product', or at least get control over it; the rest is just excess baggage, and their main interest in them is in how much they can get people to cough up for Java certification classes).
- With the exception of the introduction of 32-bit p-mode in 1985, and 64-bit long mode in 2001, most of the improvements in performance for the x86 (and processors in general) since then have depended on the improved process. These come mainly from lower signal delays, but have also been because they could simply do more per square nm of chip real estate (and now cubic nm, with the multilayer processes). Intel and AMD have thrown a mountain of work on using those improvements to speed up the x86, and are now really good at it, but I can't help but wonder what would have happened if that effort had been put into a design such as MIPS, SPARC (which descends from the Berkeley RISC), PowerPC, Alpha, or even the M680x0 or the Transputer (and yes, INMOS was a big influence on my RSVP idea). The point is, these speed ups wouldn't even have been possible in 1978, even for the design principles which were already known at the time (which was most of them, actually - most have their roots in the CDC mainframes and Cray minisupers), simply because the chips couldn't have held all of it. The rise of RISC was compensating for something that is no longer an issue.
- A lot of the reason why RISC was so much better at the time was because most of the commercial microprocessor designers were focusing on features rather than performance. The real take-away from RISC was that trying to do too much in hardware rather than software was actually hurting performance, especially as increased use of compiled languages made the assembly language bells and whistles of something like the VAX or the 68000 redundant.