OLR6: The CPU

 

 

The CPU is sometimes called the control processor or just the processor.  However, many of today's high-powered computers contain multiple CPUs, so talking about "the CPU" may not always be completely accurate.  None the less, most of the discussion in these notes apply equally well to each a single CPU system or a multi-CPU system. The CPU components shown in the diagram above are discussed in the following subsections.  In addition, modern computers have at least one cache interposed between the CPU and PM.  This is not shown in the diagram above, but will be considered below in Section 6.6.  Most modern CPUs are organized internally as one or more pipelines.  This will be discussed below in Section 6.7.

 

An internal bus connects all of the components in the CPU.  Data is transferred between components using this bus and control signals are sent from the CU to other components using control lines, which are not shown in the diagram above. The program counter (PC) usually contains the address of the next instruction to be executed.  The user register file (URF) is a set of very fast access registers that are used as short-term storage for the partial results of a computation, the stack pointer, array indexes, and so forth.  In some machines the PC is also part of the URF. The instruction register (IR) holds the instruction that is currently being executed.  Logically it is part of the CU. The arithmetic-logic unit (ALU) is the component that performs all of the operations specified by the arithmetic and logical instructions.

 

Here we examine the major components of the Central Processing Unit (CPU) in more detail.  Much of this discussion will be in general terms so it will apply to most modern CPUs, but the model we use will be a simple implementation of an ARM-like CPU.  The figure below shows the components and structure of our simple model (without cache, pipelining, ...). Control signals

                                                                                              

 

 

Here the User Register File (URF) is expanded and the ARM register names are used.  In addition to IR, MAR, and MDR we spoke of earlier, which are non-visible registers, three additional non-visible registers, Y, Z, and T have been added.  The control lines are not shown.  Control lines carry the control signals from the CU to every component of the CPU.  They will be discussed below.

 

6.1 The Arithmetic-Logic Unit

 

The ALU's inputs are the value in the Y register and the value on the CPU bus which passes through the BS.  The ALU's output is latched into the Z register.

 

 

 

                                                                                                   The single control line to the Add-Subtract unit selects either addition or subtraction, in the Add-Subtract Unit.  The ALU is a combinational circuit that performs all operations simultaneously.  The control lines to the MUX select the result of the desired operation and pass that single value on to the Z register.  All ALU operations are register to register.  To perform an ALU operation, the left source operand is loaded into Y.  In the next clock period, the right source operand is put on the bus and passes through the BS (where it may or may not be shifted), the appropriate ALU control lines are asserted, and the selected result is latched into Z.  In the third clock period, the contents of Z are transferred to the destination register.  The reason three clocks are required is because of a fundamental property of buses - there can never be more than one value on a bus during a single clock period.  Since three different values have to be transferred on the bus to accomplish an ALU operation, the minimum time required is three clocks.

 

6.2 The Control Unit

 

It is the control unit (CU) within the CPU that coordinates the actions of the CPU's components, and to some extent, certain actions of the PM and the IO devices.  At the heart of the CU is a clock that emits a square wave of a fixed frequency.  One cycle of the clock output is referred to as a clock period, usually just called a clock.  Using the clock's output, the CU sends control signals to various parts of the CPU.  These signals cause various actions to take place at specified times.  Every instruction consists of a sequence of microsteps, each of which takes place in one clock.  Most instructions require more than one clock to complete their execution.  To cause the execution of an instruction, the CU sends out the appropriate set of control signals during each clock.

 

6.3 The Bus Interface Unit

 

The  bus interface unit (BIU) is used to communicate with PM and the IO devices.  The contents of the memory address register (MAR) is the address that is put on the address bus and the contents of the memory buffer register (MBR) is the item that is put onto, or taken off of, the data bus.  The control lines in the CPU are not shown, but there are control lines to the BIU that supply the control signals that the BIU puts on the control bus.  To perform a PM read access (or fetch), the CU puts the PM address into the MAR and sends a read signal to the BIU, which then puts the contents of the MAR on the address bus and the read signal on the control bus.  When the BIU receives the requested item, it is copied into the MBR.  To perform a PM write access (or store), the CU puts the PM address into the MAR, the item to be stored into the MBR, and sends a write signal to the BIU, which then puts the contents of the MAR on the address bus, the contents of the MBR on the data bus, and the write signal on the control bus.

 

6.4 Instruction Set Architecture

 

The instruction set architecture for a computer specifies all of its instructions, their formats, the complete effect of executing each instruction, all of the visible registers  (a visible register is one  that can be directly accessed by a program), and any other aspects that affect how the computer is programmed.  The instructions can be grouped into four classes.  Data processing instructions, such as, integer and floating point arithmetic, logical and shifting, and multiply-accumulate (used in DSP, i.e., digital signal processing).  Data movement instructions, such as, move between two registers, move between register and memory, and input-output.  Control flow instructions, such as, branching, procedure call and return, and looping.  Special instructions such as: switching between user and system mode, cache management, and exception management.    

 

An instruction consists of an opcode and zero or more operand specifications.  The opcode specifies the operation to be done by the instruction and the operand specifications indicate where to find the data to which the operation is to be applied.  An operand specification may be the operand itself (a constant in the instruction, which is called immediate data), the name of a register (including a port address) that contains the data, the address of a PM location containing the data, or information that allows the CPU to calculate the location of the data.  The maximum number of operand specifications that computational instructions may have depends on the machine.  If the maximum is N, it is called an N-address machine.  The components of an instruction and the action of the instruction for the various values of N are

N = 4    op  d, s1, s2, addNxtInst d = s1 op s2 and take the next instruction

from PM at addNxtInst

N = 3               op  d, s1, s2            d = s1 op s2        

N = 2               op  d, s                 d = d op s        

N = 1               op  s                    acc = acc op s        

N = 0               op                       stack_top = stack_top op

 

one_below_stack_top Here, d stands for the destination operand and s1, s2, and s all stand for the source operands.  Both 4-address and 0-address machines are rare.  In a 4-address machine, every instruction contains the address of the next instruction.  In all the others, the address of the next instruction is normally the contents of the PC, which is incremented by the length of the current instruction immediately after that instruction is fetched from PM. However, if the current instruction is a branch instruction, the address of the next instruction is specified by that instruction.  A 0-address machine is called a stack machine.  It has instructions that push data onto the stack and pop data off the stack, but all the operands of a computational instruction must be on the stack.  A 1-address machine has a special register, called the accumulator, which always contains one of the source operands and which is always the destination operand.  The ARM, MIPS, PowerPC, and most RISC chips are 3-address machines.  The Intel I86 chips are 2-address machines.  Many of the earliest computers and today's simple pocket calculators (s is always the keypad) are 1-address machines. 

 

6.5 The Fetch-Execute Cycle

 

The operation of the CPU is basically an infinite sequence of fetch-execute cycles.  A fetch-execute cycle consists of two major parts.  In the first part (fetch), the next instruction is fetched from PM.  In the second part (execute), the operation specified by the opcode is performed.  In general, all of a RISC's instructions are the same length (e.g., 32 bits).  However, a CISC (e.g., Intel I86) usually has varying length instructions.  In these machines, the first word of the instruction contains enough information for the CU to be able to determine the number of bytes in the complete instruction.  Let us look at the two steps in the fetch-execute cycle in somewhat more detail.

 

- Fetch:  The next instruction is fetched from PM.  The actions in this step are the same for every instruction.    

1) Send the contents of the PC (i.e., address of the next instruction) to the MAR and a read signal to the BIU.    

2) Increment the PC - for a single word instruction this would be the length of the instruction (e.g., 2 or 4 bytes) and for a multiword instruction this would be the length of the first word (e.g., 1 or 2 bytes).    

3) Wait, if necessary, for PM to return the instruction or the first word of the instruction.    

4) Copy the contents of the MBR to the IR.    

5) Additional fetches are required in the case of a multiword instruction.

 

- Execute:  Perform the actions specified by the instruction's opcode.  The actions in this step vary greatly depending on the class of instruction.  All the instructions in a particular class require the same actions.  Here we look at only three classes just to get a flavor of the kind of detailed steps required.    

1) Computational instructions when all operands are in registers.       

a) Transfer the source operands from their registers to the ALU.        

b) Send control signals to the ALU directing it to perform the operation specified in the instruction.       

 c) Transfer the result from the ALU to the destination register.

2) Load or store a register from or into PM.        

a) Obtain the PM address.  This address may be in the instruction, in a register, or may need to be computed  from data in the instruction and/or one or more registers.        

b) Send the PM address to the MAR.            

i) If a load instruction, send a read signal to the BIU, wait for the item to be returned, and copy the contents of the MBR to the destination register.            

ii) If a store instruction, send the contents of the source register to the MBR and a write signal to the BIU.    

3) A conditional branch-type instruction.        

a) Compute the target address.        

b) If the condition is true then replace the contents of the PC with the target address.

 

6.6 The Cache

 

All modern computers have one or more cache memories.  A cache is much smaller, faster, and more expensive memory than PM.  When referring to memory, fast means short access time and slow means longer access time. Access time is the interval of time between the initiation of a memory request and its completion.  Items can be fetched from, and written to, a cache in much less time than if they were transferred from or to PM.  The general idea is that during a time period when a small set of instructions and data is being frequently referenced, these instructions and data are kept in the cache.  If a sufficiently high percentage of a program's references are to instructions and data that are in a cache, the execution time of the program will be significantly reduced.    

 

The most recent computers have multilevel caches.  The primary cache (L1), which is closest to the registers in the CPU, is actually two separate caches, one for data and one for instructions.  The next level cache (L2) is a single cache that stores both data and instructions.  The L2 cache is larger and slower than either of the L1 caches.    

 

Instructions and data are transferred between PM and cache in blocks.  With modern PM, a block transfer is faster than transferring the same number of words one word at a time.  The block size depends on the particular implementation.  The size of a block is usually a power of 2, such as 8, 16, or 32 bytes.  Clearly the BUI and the PMC in a computer with a cache will be much different from the ones described above.  Chapter 9 explores caches in detail.

 

6.7 The Pipeline

 

In a pipelined CPU, each microstep of an instruction's execution is done by a separate component, which is called a stage. These stages are connected in series with a buffer between each pair of stages.  This arrangement of stages and buffers is called a pipeline.  The results of each step are passed to the next stage in the series via the buffer between them.  Only one of these stages is working during each clock (recall that each microstep requires one clock), the remaining stages are idle.  Pipelining is achieved by have these idle stages work on other instructions. Thus if there are five stages, the CPU can simultaneously be doing different microsteps from five instructions.  A pipelined CPU does not execute instructions any faster than a non-pipelined CPU, but the completion rate (throughput) is much higher.  This is because, once the pipeline is fully engaged, one instruction completes its execution every clock.  With the "right mix" of instructions, the throughput for a CPU with a five stage pipeline would improve by a factor of five (500%) compared to a non-pipelined CPU.  Unfortunately the right mix does not always occur, so the increase in throughput, while enough to be worth the cost, will average out to be quite a bit less than 500%.    

 

A pipeline is most effective in a RISC with at least one cache.  An example should show this.  Assume a RISC with separate instruction and data caches.  Recall that in a RISC all instructions are the same length, so only one fetch from the instruction cache is needed to get the next instruction.  Only the load and store instructions access the data memory; one operand is a memory location and the other operand is a register.  All the operands for the other instructions are registers.  Looking at the fetch-execute steps of the three instructions in Section 6.5 above, we can see how a five stage pipeline might be organized.  The actions of  each of the five stages are:        

 

1) Fetch the next instruction from the instruction cache and increment the PC.         2) Decode the instruction and, unless it is a load, fetch the source operands from the register file.

 3) For all instructions except load, store, and branch, do the instruction's operation; for a load, store and branch, compute a memory address.        

4) For a load, fetch a word from the data cache using the address computed in stage 3; for a branch, replace the contents of the PC with the address computed in stage 3; for all other instructions, do nothing.        

5) For all instructions except load, store, and branch, write the ALU result to the register file; for a load, write the word fetched in stage 4 to the register file; for a store, write the word fetched in stage 2 to the data cache.  

Each stage in a pipeline should take approximately the same amount of time to do its part in the execution of an instruction.  The maximum of these times will then be the duration of the CPU's clock period.    

 

When a CPU has two or more pipelines in parallel it is called superscalar.  More and more CPU chips are being designed to be superscalar.  For example, the PowerPC has four parallel pipelines. With the right mix of instructions, this pipeline could result in a four-fold increase in throughput compared to the same CPU with only one pipeline.  Again, the right mix does not always occur.  OLR 10 explores pipelines in more detail.

 

6.8  Input and Output

 

The IOC is the controller for both the keyboard, which provides text input, and the display, which accepts text output.  The IOC contains four registers: an 8-bit input register (INR), an 8-bit output register (OUTR), a control register (CONT), and a status register (STAT).  A print character is one that produces a visible mark (e.g., * a H and $) when it is sent to a display or a printer.  When the key for a print character on the keyboard is pressed, the keyboard controller puts the ASCII code for that character into the INR.  ASCII is a 7-bit code.  In the computer, an ASCII code is usually stored in a byte.  In addition to the print characters, there are ASCII codes for control characters, such as, line feed (often written as lf), form feed (ff), and carriage return (cr). After the ASCII code has been put into the INR by the IOC, a program can fetch it by executing an instruction that reads from the INR port.  To display a print character, a program executes an instruction that writes the ASCII code to the OUTR port.  The IOC then takes the contents of the OUTR and sends it to the display.  Some control characters, when written into the OUTR, cause the display to do something other that displaying a character (e.g., cr causes the cursor to move to the beginning of the current screen line and lf causes the cursor to move down to the next line on the screen).  The type of IO and other aspects of the IOC are set by executing an instruction that writes the appropriate bit pattern to the CONT port.  The contents of STAT can be read by executing an instruction that reads the STAT port.  Various bits in this value indicate the outcome of the most recent input and output (e.g., parity error, no input available, output completed).  Chapter 8 explores input-output in detail.

 

6.9 Secondary Storage

 

A disk, shown in the diagram above, is a secondary storage device.  The disk is often called an IO device since, from the CPU's standpoint, its behavior is similar to an IO device, such as a keyboard or a display, that is, the disk has a device controller that has data, control, and status registers that are all accessed via ports on the system bus.  A disk is also characterized as external memory, since it is a form of random access memory similar to PM, but is accessed like an IO device, which is classified as an external device.  Access to a true IO device is strictly sequential. A disk is a non-volatile memory in contrast to PM, which is volatile memory.  Any data in a volatile memory is lost when power to the memory is shut off, while data in a non-volatile memory remains until it is explicitly erased even if the power is off.     The DMA (direct memory access) in the diagram above, makes possible the transfer of a block of data between the disk and PM without continual involvement of the CPU.  The CPU sets up the transfer by first sending the disk address of the beginning of the block to the disk controller's ADD port and the direction of transfer to its CTRL port.  Next the CPU sends the PM address of the beginning of the block to the DMA's ADD port, the number of items in the block to its CNT port, and the direction of transfer to its CTRL port.  Once the DMA's CTRL port has be set, the DMA takes control of the transfer and the CPU then goes on to execute other instructions in parallel with the operation of the DMA.  The DMA communicates with the disk controller and emits the necessary signals to the disk controller and PM to effect the transfer of each item in the block.  To effect the block transfer the CPU needs to execute only a few instructions to start the transfer.  To transfer a block of bytes between PM and a true IO device, such as the display, the CPU would need to execute at least two instructions for every byte in the block - one to fetch the byte from PM and one to send the byte to the display's OUTR port.  Chapter 9 explores secondary storage in detail.

 

 

 

 

 

 

 

 

 

 

 

Adapted from © 2000 Robert M. Graham