Schol-R-LEA/Reduced State Vector Processor
The Reduced State Vector Processor (RSVP)
J Osako, 22 September 1998
The current emphasis in CPU design on fast instruction performance is, in the opinion of this author, misplaced. The primary overhead in current software is in I/O and interrupt latency, not instruction execution; most systems today are bound to a slow turnaround in context switching, and are usually idle while waiting for user events. Further, most of the activity in modern systems is bound to processes that involve large vector and matrix calculation, particularly those needed for audio and video processing. While most such work is usually offloaded to DSPs and other coprocessors, some still must be handled by the CPU, and the CPU ought to be designed to handle such calculations efficiently.
To these ends, I propose a CPU design meant to minimze processor state, maximize parallel processing efficiency and enhance vector calculations. I call this the Reduced State Vector Processor (RSVP) concept.
The basic concept is to eliminate the majority of the registers in the CPU, replacing them with large (>1M) write-through caches. The processor would have separate caches for code, data and stack, and would have at least two seperate caches for each of these functions. The intention is that each process would be allocated a separate cache set, and that in the case of a cache miss, the CPU would automatically trigger a task switch to a process which is in another cache set while the missed cache is updated in background.
All operations would be direct-to-memory, with the ALU operating directly on the caches; in effect, the caches would act as an extremely large register set. This allows the CPU to be reduced to a set of four state registers : Instruction Pointer, Stack Pointer, Frame Pointer and Cache Pointer. Given this minimal state, and the fact that all operations are cached, allows for a single-cycle, single instruction context switch.
As said earlier, the actual calculations are performed directly upon the caches, the ALU has to be able to operate on any section of the cache as needed. Taking this idea further, it can be seen that, logically, the ALU should be bound to the cache it works on, not the CPU, leaving the CPU to perform only the control functions. ALU operations that require more than one cycle should trigger a context switch, just as a a cache miss would, allowing the ALU to operate independently. Further, since it can operate on arbitrary sections of the cache, it should be possible for it to operate on multiple operands, or operands of arbitrary size. This allows for efficient matrix and VLW operations as a natural extension of the ALU design. It should be able, in principle, to operate upon the entire addressable memory; the ALU would call cache refills and continue operating completely independent of the CPU itself.
The final logical step of this design is to add multiple CPUs. Since there are already redundant caches, and the CPU structure is so minimal, it should be possible to have several of these simplified CPUs on a single chip, switching between a large collection of semi-autonomous cache/ALU sets as needed. A reasonable plan, given current chip densities, is a 4 CPU, 8 Cache-set design, which would have a total of 24M of cache memory on-chip. While this is a very large amount of memory, it is well within current densities, and the simplified structure of the processor overall eliminates most of the processor hardware.