OLR6:
The CPU
The
CPU is sometimes called the control processor or just the processor. However, many of today's high-powered
computers contain multiple CPUs, so talking about "the CPU" may not always be
completely accurate. None the less,
most of the discussion in these notes apply equally well to each a single CPU
system or a multi-CPU system. The CPU components shown in the diagram above are
discussed in the following subsections.
In addition, modern computers have at least one cache interposed between
the CPU and PM. This is not shown
in the diagram above, but will be considered below in Section 6.6. Most modern CPUs are organized
internally as one or more pipelines.
This will be discussed below in Section 6.7.
An
internal bus connects all of the components in the CPU. Data is transferred between components
using this bus and control signals are sent from the CU to other components
using control lines, which are not shown in the diagram above. The program
counter (PC) usually contains the address of the next instruction to be
executed. The user register file
(URF) is a set of very fast access registers that are used as short-term storage
for the partial results of a computation, the stack pointer, array indexes, and
so forth. In some machines the PC
is also part of the URF. The instruction register (IR) holds the instruction
that is currently being executed.
Logically it is part of the CU. The arithmetic-logic unit (ALU) is the
component that performs all of the operations specified by the arithmetic and
logical instructions.
Here
we examine the major components of the Central Processing Unit (CPU) in more
detail. Much of this discussion
will be in general terms so it will apply to most modern CPUs, but the model we
use will be a simple implementation of an ARM-like CPU. The figure below shows the components
and structure of our simple model (without cache, pipelining, ...). Control
signals

![]()


Here
the User Register File (URF) is expanded and the ARM register names are
used. In addition to IR, MAR, and
MDR we spoke of earlier, which are non-visible registers, three additional
non-visible registers, Y, Z, and T have been added. The control lines are not shown. Control lines carry the control signals
from the CU to every component of the CPU.
They will be discussed below.
6.1
The Arithmetic-Logic Unit
The
ALU's inputs are the value in the Y register and the value on the CPU bus which
passes through the BS. The ALU's
output is latched into the Z register.

The
single control line to the Add-Subtract unit selects either addition or
subtraction, in the Add-Subtract Unit.
The ALU is a combinational circuit that performs all operations
simultaneously. The control lines
to the MUX select the result of the desired operation and pass that single value
on to the Z register. All ALU
operations are register to register.
To perform an ALU operation, the left source operand is loaded into
Y. In the next clock period, the
right source operand is put on the bus and passes through the BS (where it may
or may not be shifted), the appropriate ALU control lines are asserted, and the
selected result is latched into Z.
In the third clock period, the contents of Z are transferred to the
destination register. The reason
three clocks are required is because of a fundamental property of buses - there
can never be more than one value on a bus during a single clock period. Since three different values have to be
transferred on the bus to accomplish an ALU operation, the minimum time required
is three clocks.
6.2
The Control Unit
It
is the control unit (CU) within the CPU that coordinates the actions of the
CPU's components, and to some extent, certain actions of the PM and the IO
devices. At the heart of the CU is
a clock that emits a square wave of a fixed frequency. One cycle of the clock output is
referred to as a clock period, usually just called a clock. Using the clock's output, the CU sends
control signals to various parts of the CPU. These signals cause various actions to
take place at specified times.
Every instruction consists of a sequence of microsteps, each of which
takes place in one clock. Most
instructions require more than one clock to complete their execution. To cause the execution of an
instruction, the CU sends out the appropriate set of control signals during each
clock.
6.3
The Bus Interface Unit
The bus interface unit (BIU) is used to
communicate with PM and the IO devices.
The contents of the memory address register (MAR) is the address that is
put on the address bus and the contents of the memory buffer register (MBR) is
the item that is put onto, or taken off of, the data bus. The control lines in the CPU are not
shown, but there are control lines to the BIU that supply the control signals
that the BIU puts on the control bus.
To perform a PM read access (or fetch), the CU puts the PM address into
the MAR and sends a read signal to the BIU, which then puts the contents of the
MAR on the address bus and the read signal on the control bus. When the BIU receives the requested
item, it is copied into the MBR. To
perform a PM write access (or store), the CU puts the PM address into the MAR,
the item to be stored into the MBR, and sends a write signal to the BIU, which
then puts the contents of the MAR on the address bus, the contents of the MBR on
the data bus, and the write signal on the control bus.
6.4
Instruction Set Architecture
The
instruction set architecture for a computer specifies all of its instructions,
their formats, the complete effect of executing each instruction, all of the
visible registers (a visible
register is one that can be
directly accessed by a program), and any other aspects that affect how the
computer is programmed. The
instructions can be grouped into four classes. Data processing instructions, such as,
integer and floating point arithmetic, logical and shifting, and
multiply-accumulate (used in DSP, i.e., digital signal processing). Data movement instructions, such as,
move between two registers, move between register and memory, and
input-output. Control flow
instructions, such as, branching, procedure call and return, and looping. Special instructions such as: switching
between user and system mode, cache management, and exception management.
An
instruction consists of an opcode and zero or more operand specifications. The opcode specifies the operation to be
done by the instruction and the operand specifications indicate where to find
the data to which the operation is to be applied. An operand specification may be the
operand itself (a constant in the instruction, which is called immediate data),
the name of a register (including a port address) that contains the data, the
address of a PM location containing the data, or information that allows the CPU
to calculate the location of the data.
The maximum number of operand specifications that computational
instructions may have depends on the machine. If the maximum is N, it is called an
N-address machine. The components
of an instruction and the action of the instruction for the various values of N
are
N
= 4 op d, s1, s2, addNxtInst d = s1 op s2 and
take the next instruction
from
PM at addNxtInst
N
= 3
op d, s1, s2 d
= s1 op s2
N
= 2
op d, s
d = d op s
N
= 1
op s
acc = acc op s
N
= 0
op
stack_top = stack_top op
one_below_stack_top
Here, d stands for the destination operand and s1, s2, and s all stand for the
source operands. Both 4-address and
0-address machines are rare. In a
4-address machine, every instruction contains the address of the next
instruction. In all the others, the
address of the next instruction is normally the contents of the PC, which is
incremented by the length of the current instruction immediately after that
instruction is fetched from PM. However, if the current instruction is a branch
instruction, the address of the next instruction is specified by that
instruction. A 0-address machine is
called a stack machine. It has
instructions that push data onto the stack and pop data off the stack, but all
the operands of a computational instruction must be on the stack. A 1-address machine has a special
register, called the accumulator, which always contains one of the source
operands and which is always the destination operand. The ARM, MIPS, PowerPC, and most RISC
chips are 3-address machines. The
Intel I86 chips are 2-address machines.
Many of the earliest computers and today's simple pocket calculators (s
is always the keypad) are 1-address machines.
6.5
The Fetch-Execute Cycle
The
operation of the CPU is basically an infinite sequence of fetch-execute
cycles. A fetch-execute cycle
consists of two major parts. In the
first part (fetch), the next instruction is fetched from PM. In the second part (execute), the
operation specified by the opcode is performed. In general, all of a RISC's instructions
are the same length (e.g., 32 bits).
However, a CISC (e.g., Intel I86) usually has varying length
instructions. In these machines,
the first word of the instruction contains enough information for the CU to be
able to determine the number of bytes in the complete instruction. Let us look at the two steps in the
fetch-execute cycle in somewhat more detail.
-
Fetch: The next instruction is
fetched from PM. The actions in
this step are the same for every instruction.
1) Send the contents of the PC
(i.e., address of the next instruction) to the MAR and a read signal to the
BIU.
2) Increment the PC - for a single
word instruction this would be the length of the instruction (e.g., 2 or 4
bytes) and for a multiword instruction this would be the length of the first
word (e.g., 1 or 2 bytes).
3) Wait, if necessary, for PM to
return the instruction or the first word of the instruction.
4) Copy the contents of the MBR to
the IR.
5) Additional fetches are required
in the case of a multiword instruction.
-
Execute: Perform the actions
specified by the instruction's opcode.
The actions in this step vary greatly depending on the class of
instruction. All the instructions
in a particular class require the same actions. Here we look at only three classes just
to get a flavor of the kind of detailed steps required.
1) Computational instructions when
all operands are in registers.
a) Transfer the source operands
from their registers to the ALU.
b) Send control signals to the ALU
directing it to perform the operation specified in the instruction.
c) Transfer the result from the ALU to
the destination register.
2) Load or store a register from or
into PM.
a) Obtain the PM address. This address may be in the instruction,
in a register, or may need to be computed
from data in the instruction and/or one or more registers.
b) Send the PM address to the
MAR.
i) If a load instruction, send a
read signal to the BIU, wait for the item to be returned, and copy the contents
of the MBR to the destination register.
ii) If a store instruction, send
the contents of the source register to the MBR and a write signal to the
BIU.
3) A conditional branch-type
instruction.
a) Compute the target address.
b) If the condition is true then
replace the contents of the PC with the target address.
6.6 The
Cache
All modern
computers have one or more cache memories.
A cache is much smaller, faster, and more expensive memory than PM. When referring to memory, fast means
short access time and slow means longer access time. Access time is the interval
of time between the initiation of a memory request and its completion. Items can be fetched from, and written
to, a cache in much less time than if they were transferred from or to PM. The general idea is that during a time
period when a small set of instructions and data is being frequently referenced,
these instructions and data are kept in the cache. If a sufficiently high percentage of a
program's references are to instructions and data that are in a cache, the
execution time of the program will be significantly reduced.
The most
recent computers have multilevel caches.
The primary cache (L1), which is closest to the registers in the CPU, is
actually two separate caches, one for data and one for instructions. The next level cache (L2) is a single
cache that stores both data and instructions. The L2 cache is larger and slower than
either of the L1 caches.
Instructions and data are
transferred between PM and cache in blocks. With modern PM, a block transfer is
faster than transferring the same number of words one word at a time. The block size depends on the particular
implementation. The size of a block
is usually a power of 2, such as 8, 16, or 32 bytes. Clearly the BUI and the PMC in a
computer with a cache will be much different from the ones described above. Chapter 9 explores caches in
detail.
6.7 The
Pipeline
In a
pipelined CPU, each microstep of an instruction's execution is done by a
separate component, which is called a stage. These stages are connected in
series with a buffer between each pair of stages. This arrangement of stages and buffers
is called a pipeline. The results
of each step are passed to the next stage in the series via the buffer between
them. Only one of these stages is
working during each clock (recall that each microstep requires one clock), the
remaining stages are idle.
Pipelining is achieved by have these idle stages work on other
instructions. Thus if there are five stages, the CPU can simultaneously be doing
different microsteps from five instructions. A pipelined CPU does not execute
instructions any faster than a non-pipelined CPU, but the completion rate
(throughput) is much higher. This
is because, once the pipeline is fully engaged, one instruction completes its
execution every clock. With the
"right mix" of instructions, the throughput for a CPU with a five stage pipeline
would improve by a factor of five (500%) compared to a non-pipelined CPU. Unfortunately the right mix does not
always occur, so the increase in throughput, while enough to be worth the cost,
will average out to be quite a bit less than 500%.
A pipeline
is most effective in a RISC with at least one cache. An example should show this. Assume a RISC with separate instruction
and data caches. Recall that in a
RISC all instructions are the same length, so only one fetch from the
instruction cache is needed to get the next instruction. Only the load and store instructions
access the data memory; one operand is a memory location and the other operand
is a register. All the operands for
the other instructions are registers.
Looking at the fetch-execute steps of the three instructions in Section
6.5 above, we can see how a five stage pipeline might be organized. The actions of each of the five stages are:
1) Fetch the next instruction from
the instruction cache and increment the PC.
2) Decode the instruction and, unless it is a load, fetch the source
operands from the register file.
3) For all instructions except load,
store, and branch, do the instruction's operation; for a load, store and branch,
compute a memory address.
4) For a load, fetch a word from
the data cache using the address computed in stage 3; for a branch, replace the
contents of the PC with the address computed in stage 3; for all other
instructions, do nothing.
5) For all instructions except
load, store, and branch, write the ALU result to the register file; for a load,
write the word fetched in stage 4 to the register file; for a store, write the
word fetched in stage 2 to the data cache.
Each stage
in a pipeline should take approximately the same amount of time to do its part
in the execution of an instruction.
The maximum of these times will then be the duration of the CPU's clock
period.
When a CPU
has two or more pipelines in parallel it is called superscalar. More and more CPU chips are being
designed to be superscalar. For
example, the PowerPC has four parallel pipelines. With the right mix of
instructions, this pipeline could result in a four-fold increase in throughput
compared to the same CPU with only one pipeline. Again, the right mix does not always
occur. OLR 10 explores pipelines in
more detail.
6.8 Input and Output
The IOC is
the controller for both the keyboard, which provides text input, and the
display, which accepts text output.
The IOC contains four registers: an 8-bit input register (INR), an 8-bit
output register (OUTR), a control register (CONT), and a status register
(STAT). A print character is one
that produces a visible mark (e.g., * a H and $) when it is sent to a display or
a printer. When the key for a print
character on the keyboard is pressed, the keyboard controller puts the ASCII
code for that character into the INR.
ASCII is a 7-bit code. In
the computer, an ASCII code is usually stored in a byte. In addition to the print characters,
there are ASCII codes for control characters, such as, line feed (often written
as lf), form feed (ff), and carriage return (cr). After the ASCII code has been
put into the INR by the IOC, a program can fetch it by executing an instruction
that reads from the INR port. To
display a print character, a program executes an instruction that writes the
ASCII code to the OUTR port. The
IOC then takes the contents of the OUTR and sends it to the display. Some control characters, when written
into the OUTR, cause the display to do something other that displaying a
character (e.g., cr causes the cursor to move to the beginning of the current
screen line and lf causes the cursor to move down to the next line on the
screen). The type of IO and other
aspects of the IOC are set by executing an instruction that writes the
appropriate bit pattern to the CONT port.
The contents of STAT can be read by executing an instruction that reads
the STAT port. Various bits in this
value indicate the outcome of the most recent input and output (e.g., parity
error, no input available, output completed). Chapter 8 explores input-output in
detail.
6.9
Secondary Storage
A
disk, shown in the diagram above, is a secondary storage device. The disk is often called an IO device
since, from the CPU's standpoint, its behavior is similar to an IO device, such
as a keyboard or a display, that is, the disk has a device controller that has
data, control, and status registers that are all accessed via ports on the
system bus. A disk is also
characterized as external memory, since it is a form of random access memory
similar to PM, but is accessed like an IO device, which is classified as an
external device. Access to a true
IO device is strictly sequential. A disk is a non-volatile memory in contrast to
PM, which is volatile memory. Any
data in a volatile memory is lost when power to the memory is shut off, while
data in a non-volatile memory remains until it is explicitly erased even if the
power is off. The
DMA (direct memory access) in the diagram above, makes possible the transfer of
a block of data between the disk and PM without continual involvement of the
CPU. The CPU sets up the transfer
by first sending the disk address of the beginning of the block to the disk
controller's ADD port and the direction of transfer to its CTRL port. Next the CPU sends the PM address of the
beginning of the block to the DMA's ADD port, the number of items in the block
to its CNT port, and the direction of transfer to its CTRL port. Once the DMA's CTRL port has be set, the
DMA takes control of the transfer and the CPU then goes on to execute other
instructions in parallel with the operation of the DMA. The DMA communicates with the disk
controller and emits the necessary signals to the disk controller and PM to
effect the transfer of each item in the block. To effect the block transfer the CPU
needs to execute only a few instructions to start the transfer. To transfer a block of bytes between PM
and a true IO device, such as the display, the CPU would need to execute at
least two instructions for every byte in the block - one to fetch the byte from
PM and one to send the byte to the display's OUTR port. Chapter 9 explores secondary storage in
detail.
Adapted
from © 2000 Robert M. Graham