for
CSCI 301: Great Ideas in Computer Science
(c) 1993, 1994, 1995 John E. Howland, All Rights Reserved
Department of Computer Science
Trinity University
Stadium Drive
San Antonio, Texas 78212-7200
Internet: jhowland@ariel.cs.trinity.edu
Even if we didn't have important computational problems for which present day computers are two slow, many of us would still want to go faster for the same reason we want to own cars that are capable of 150 mph speeds even though the speed limits on most roads do not exceed 65 mph. When we have faster computers, we find ourselves attempting solution of more complex problems which in turn require faster computers ...

John Von Neuman's contribution to this design was that both programs and data were to be stored in the same memory system. When this design is used for a fast computer we find that the fetching of instructions interferes with the fetching and storing of data because both of these operations must occur over the single data path which connects the processor and the memory system. The restriction of this data and instruction flow over the data path is sometimes called the Von Neuman Bottleneck.

Notice that a heavier line is used in the above diagram to indicate that a wider data path exists between the main memory and the cache memory than the data path between the processor and the cache memory.
The cache memory system works according to the following rules. When the processor requests either an instruction or data operand, the cache sees the request and checks its very fast memory to see if the requested item is already in the cache. If so, the cache returns that instruction or datum to the processor at a very high rate of speed. If the cache memory does not contain the requested item, then the cache fetches not only that item, but also several other items, which are located close by the requested item, from the main memory. The idea behind fetching more than is needed is to set up the cache contents so that the next time the processor fetches an instruction or datum it will already be in the cache. This process works well because instructions are stored sequentially in memory and data references often tend to be located close by one another. The cache memory acts as a speed buffer between a very fast processor and a relatively slow main memory system. Nearly all modern processor designs use cache designs to achieve higher performance.
Of course, caches can be used in Harvard style designs to improve their performance as indicated in the following diagram.
1. Fetch an instruction from memory using a memory address which is stored in the program counter register. Store this instruction in the instruction register and increment the program counter register.
2. Execute the instruction just fetched.
These two processes can be setup as a two stage instruction pipeline so that the processor can fetch the next instruction at the same time it is executing the current instruction. Parallelism is achieved since both of these steps are performed during the same clock cycles.
The execute stage may also be broken down into several sub-stages which can operate in parallel in situations where a sequece of several of the same instruction are to be performed on different data operands. For example, the execution of an add instruction might be broken down into four stages as:

The pipline stages are designed so that they can perform their processing simultaneously. Suppose that each add stage takes 1 clock cycle. Then a single add instruction will take 4 clock cycles to complete. If a program contains a sequence of 100 additions, then 4 clock cycles elapse before the first sum is available, 3 clock cycles elapse before the second sum is finished, 2 clock cycles elapse before the third sum is completed and one clock cycle each elapses before each of the remaing 97 sums are completed. The end result is that there is a nearly 4 times speedup for most of the addition results excepting for the longer times required for the first 3 sums as the pipeline stages fill up.

We call such a design a MIMD (Multiple Instruction, Multiple data) processor. A variant of this design uses separate (but interconnected) memories for each processor.

MIMD machines are good parallel solutions for computational problems which can be decomposed into several concurrent, but independent, processes. For example, a computer system which must support several concurrent interactive, users each running a different program at the same time, as is often done on Unix time-sharing systems, can make good use of a MIMD style machine. Most Unix system vendors offer high performance multiple processor MIMD machines for customers which require large numbers of simultaneous users.
Another design for multiple processors assumes that each processor is fed the same instruction from a single shared instruction memory. However each processor has its own data memory and usually there is come kind of interconnection between the data memories.

This design is referred to as a SIMD (Single Instruction, Multiple Data) machine. Since each processor will be executing the same instruction (for example, an add instruction), the data for a problem must be distributed amongst the memories. For example, suppose the problem is to add together the respective elements two lists, (a1 a2 ... an) and (b1 b2 ... bn). If we have 10 processors and n = 1,000,000, then we could put the first 100,000 elements of each of the lists in the first memory, the second 100,000 elements of each of the lists in the second memory and so on. The program which each processor runs performs 100,000 additions. Since 10 processors are performing this task simultaneously, all 1,000,000 additions are performed in (/ 1 10) the time required for one processor. Actually, we can't achieve a full ten times speedup since there will be some overhead in distributing the problem data to each processor and collecting the final results.
Some designs for SIMD machines use thousands of processors to implement what is called a massively parallel super computer. One vendor, Thinking Machines Inc., builds super computers, called Connection Machines, which employ as many as 65536 processors. Connection Machines are among the fastest machines in existance.
In general, the ideal parallel computer design would increase the speed of a computer by a factor of n whenever the design incorporates a parallel factor of n. However, this goal is never actually achieved, but the degree to which it is achieved can be used as a measure of how good the parallel design is.
The scientific community has invested much time and effort through the years developing programs, written mostly in Fortran, which solve a variety of different kinds of problems. Much effort has been expended on the problem of trying to make these Fortran programs usable on modern parallel processing machines. This work has not been entirely successful since each parallel processor design needs its own special parallel programming language inorder to allow programs achieve high processing rates on that machine. This means that when a new parallel design is developed, there is usually a substantial time delay before software becomse available which fully exploits the parallel features of the machine.