Understanding Some Simple Processor Performance Limits

Understanding

by

P.

G.

Emma

some simple

processor-

performance

limits

To

understand processor Performance,

it

is

essential to use metrics that are intuitive, and

it

is essential to be familiar with a few aspects

of a simple scalar pipeline before attempting

to understand more complex structures. This

paper shows that cycles per instruction (CPI)

is a simple dot product of event frequencies

and event penalties, and that

it

is far more

intuitive than its more popular cousin,

instructions per cycle (IPC). CPI is separable

into three components that account for the

inherent work, the pipeline, and the memory

hierarchy, respectively. Each of these

components is a fixed upper limit, or “hard

bound,” for the superscalar equivalent

components. In the last decade, the memory-

hierarchy component has become the most

dominant of the three components, and in the

next decade, queueing at the memory data bus

will become a very significant part of this. In a

reaction to this trend, an evolution in bus

protocols will ensue. This paper provides a

general sketch of those protocols. An

underlying theme in this paper is that power

constraints have been a driving force in

computer architecture since the first

computers were built fifty years ago. In CMOS

technology, power constraints will shape

future microarchitecture in a positive and

surprising way. Specifically, a resurgence of

the RlSC approach is expected in high-

performance design which will cause the client

and server microarchitectures to converge.

1.

Introduction

As

late as the 17th century in many European universities,

only the very best students were told that they could

someday hope to conquer long division if they applied

themselves. At that time, Roman numerals were the

preferred system of numerical representation. This

demonstrates that if a bad system is chosen to represent

aspects of a problem, relatively simple problems can be

made quite difficult.

The popular instructions per cycle (IPC) is a poor

metric to use in discussing processor performance because

it does not lend itself to intuition about what the

components of that performance are. That is, if a pipeline

that runs at

x

IPC has improvements made to it

so

that it

runs at

x

+

A

IPC, there is no way to intuit

A

(in units of

IPC) on the basis of those improvements.

The first point made in this paper is that while IPC

makes for good marketing, its inverse is what makes for

good engineering intuition. This paper strongly advocates

the use of cycles per instruction (CPI) as the metric of

choice in processor microarchitecture discussions.

of

this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to

republish

any other

reproduction is done without alteration and

(2)

the

Journal

reference and

IBM

copyright notice are included

on

the first page. The title and abstract, but

no

other portions,

portion

of

this paper must be obtained from the Editor.

W18-8648/g7/$5.W

8

1997 IBM

IBM

J.

RES. DEVELOP.

VOL.

41

NO.

3

MAY 1997 P.

G.

EMMA

The second point made in this paper is a corollary to

this first point. That is, scalar pipeline performance (in

units of CPI) can be split into three relatively independent

pieces: one involving the inherent execution component

of

the workload (cited incorrectly as “performance” in many

studies), one involving pipeline effects, and one involving

the memory hierarchy. Changes to the procpor

microarchitecture can be translated into changes in CPI

almost intuitively by considering these three components

independently.

Since this is the fiftieth anniversary of electronic digital

computing, von Neumann’s machine built at the Institute

for Advanced Studies (henceforth called the IAS machine)

is used

to

illustrate the point. This first-generation

machine shares many similarities with modern processors,

and it is used to demonstrate that it is crucial to

understand instruction flow in a simple processor before

hoping to understand flow in a complex (e.g., superscalar)

processor. This notion is self-evident, but surprisingly

many resist it.

The third point made in the paper builds

on

the first

two points: The CPI components of scalar (single decode

per cycle) flow imply bounds for the CPI components of

superscalar (multiple decodes per cycle) flow. Therefore,

if understanding (or undertaking the development of a

detailed simulator for) the flow in a superscalar processor

is an overwhelming task, a great deal can be gained by

understanding the flow for a much simpler (scalar) case.

Given a simple understanding of that simple (scalar) case,

extrapolation to the more general case can be (at least)

bounded.

The fourth point is a divergence from the first three

points; it focuses on an aspect of performance that will

become more dominant in the coming decade, hence,

on an aspect of design that must evolve. This is the

performance and design of the memory bus. The simpler

designs and slower cycle times of past designs have not

stressed bus utilization the way future designs will.

Queueing effects grow nonlinearly once utilization exceeds

certain thresholds. This must be mitigated with a new

approach to bus arbitration.

The final point of the paper is that as an industry,

we are devoting disproportionate resources to overly

complex superscalar designs because we have not found

a metric that helps

us

to avoid those features in the

microarchitecture that hurt cycle time. That metric is

power consumption.

A

clear focus

on

power will ultimately

lead the industry through the GHz barrier, and will

converge the microarchitectures of the client and server

spaces in the process.

2.

Separable components

of

CPI

A

digital computer executes an instruction stream, which

is a translated statement

of

a high-level problem.

~

216

P.

G.

EMMA

Computer performance is “benchmarked” using

instruction streams that are thought to be representative

of the typical work done by the computer.

An instruction stream is a sequence of instructions

that effects a sequence of events when executed on a

computer. The particular events effected depend on the

microarchitecture

of

the computer, but the frequency of

those events is inherent in the instruction stream.

The basic unit of time within a computer is the cycle,

and each event caused by an instruction in the program

takes an integer number

of

cycles. Some events are

concurrent, and some events take zero cycles when they

occur in certain microarchitectures.

Cycles per instruction (CPI) is the most natural metric

for expressing processor performance because it is the

product of two measurable things:

CPI=

(cycles per event) (events per instruction).

The number of cycles per event is determined by the event

type for a particular microarchitecture (it can be zero for

some event types), and the number of events per

instruction is known for each workload (independent of

the microarchitecture).

out of five instructions in a program is a branch

instruction, the event frequency is

0.2

branches per

instruction for that program. If a branch causes a three-

cycle delay in a processor, the CPI contribution of

branches in that processor is

0.6

CPI for that program.

That is, on the average, an instruction incurs

0.6

cycles of

delay because of branches in the program. A performance

enhancer such as a branch prediction mechanism can be

thought of as a mechanism that either reduces the

frequency

of

branches or reduces the average penalty for a

branch.

For example, consider branches in the program. If one

Other events that can cause delay include data

dependencies (instructions that are delayed because they

need the results of previous instructions that have not yet

executed), cache misses (referenced data that are not in

the local cache and must be fetched from the memory

hierarchy), and serialization events (explicit draining of

the processor pipeline to ensure correct function of

certain complex operations). For the most part, the

frequencies of these events can be measured directly

from a program (benchmark) without knowledge of the

microarchitecture of the processor that will execute the

program.

Therefore, a benchmark is really a specification of

event frequencies, and those frequencies apply to any

microarchitecture.

A

specific microarchitecture determines

the delay (penalty) associated with each event type (which

can be zero). CPI is the most natural expression of

performance because it is the dot product of a benchmark

(event frequencies) and a microarchitecture (event

IBM

J.

RES.

1

3EVELOP.

\

IOL.

41

NO.

3

?

fiAY

1997

penalties) for all event types. Further, a very small

number

of

events account for most of the performance of

a processor.

3.

Some observations on von Neumann’s

IAS

machine

Figure

1

shows a sketch of the organization

of

the,

IAS

machine, which is considered to be the first stored-

program computer. This sketch was taken directly from

Prof. John Hayes’s book

[l]

as he adapted it from von

Neumann’s verbal description of the machine

[2],

and it is

reproduced here courtesy of Prof. Hayes and of McGraw-

Hill. The machine comprises a central processing unit

(CPU) that is coupled to a main memory and some simple

input-output equipment

(I/O).

For the purposes of this

discussion, we ignore the

I/O

portion of the figure because

we are primarily concerned with the active processing part

of the computer. In the case of the

IAS

machine, the

1/0

equipment is used to preload the main memory with a

program and its data, and then to start the CPU. The

program and its data live completely in main memory until

completion of the program; i.e., the I/O does not interact

with the program.

program control unit. The data processing unit contains an

arithmetic logic unit and some internal registers, and the

program control unit contains buffer registers that hold

instructions and circuitry for decoding those instructions

and generating the requisite control signals that govern

the operation of the aggregation in Figure

1.

In

modern

parlance, the data processing unit is called an execution

unit

(EU),

and the program control unit is called an

instruction unit (IU).

In the

IAS

machine, main memory has the appearance

of

a large register file from the instruction-level semantic

point of view, but some

of

the register space contains the

program itself. (The instructions of the program are within

the data space, and can be dynamically modified by the

program.) By program construction, the program and its

data fit completely in main memory. Roughly speaking,

main memory has the appearance of a modern-day

cache that never misses (both because the program is

constrained to live in main memory and because it

is

semantically impossible to miss within a register file).

instruction is fetched and executed in its entirety only

after the previous instruction has completed in its

entirety, and before the following instruction is fetched.

Furthermore, each instruction is processed in

two

nonoverlapped phases. First, the instruction is fetched and

decoded (I phase), and then the instruction is executed

by the data processing unit

(E

phase). These phases

are not pipelined; they are done sequentially for each

instruction.

The CPU comprises a data processing unit and a

The operation of the

IAS

machine is as follows. Each

IBM

J.

RES. DEVELOP. VOL.

41

NO.

3 MAY

1997

Structure

of

von Neumann’s

IAS

processor. Reprinted with the

permission

of

the McGraw-Hill

Book

Company

from

page

23

of

the second edition

of

Computer Architecture and Organization

by

John

P.

Hayes, published by

the

McGraw-Hill

Book

Company,

copyright

1988.

Note that there is a single bus between main memory

and the CPU. This bus is used to transfer an instruction

during the

I

phase, and an operand during the E phase.

This is what is commonly referred to as the “von

Neumann bottleneck,” which is widely (and incorrectly)

perceived to be the “flaw” of the von Neumann

programming model. (Very interestingly, the von

Neumann bottleneck has its genesis in the solution to a

power-budget constraint, as is explained later.)

A

programming model is the high-level part of a

computer architecture. Computer architecture is the

contractual interface between the hardware and the

program; that is, a computer architecture

is

a set

of

rules

that describe the logical function of a machine as

observable by a program running on that machine.

Computer architecture does not specify what the

hardware actually does; it specifies what the hardware

appears to

do.

The entire point of defining an architecture

is to isolate the programmer from those details.

Architecture is

a

level of abstraction that allows a program

P.

G.

EMMA

A

modem processor: Conceptualization

of

performance.

to have consistent function when run

on

machines having

different implementations (microarchitectures).

The von Neumann programming model is simply this:

A

program (which is a sequence of instructions) generates

outputs that are consistent with the outputs generated by

that program if each of the instructions in the program

sequence is fetched and executed in its entirety only after

all instructions preceding it have completed, and before

any instructions following it are fetched.

A

simpler (albeit

ambiguous) statement of the same thing is that “executing

a sequence

of

instructions should have the effect of

executing the instructions in the order in which they

appear in the sequence,” or more simply, “a sequence of

instructions is executed in sequence.”

The von Neumann programming model is merely a

formal statement that the execution of a program should

not be nonsensical. It does not imply that the hardware

cannot execute instructions out of order, and it does not

imply that there is only one bus or an inherent bottleneck.

All

it implies is that the hardware must not generate

program outputs that are inconsistent with the outputs

that would be generated by hardware that is constrained

in the fashion implied by the programming model.

In fact, the

IAS

machine has a single bus to memory,

and the I and

E

phases are not overlapped (not

pipelined). Each vacuum tube in the IAS machine runs

with its cathode at

200

V,

and turns

on

when its grid

exceeds

20

V.

Putting 20

V

on the grid is most easily

accomplished by using a resistive divider, but this results

P.

G.

EMMA

in an undesirable steady-state power dissipation. To

contain their power budget, the designers of the

IAS

machine used a more complex scheme to propagate logic

values.

Instead of having resistive dividers between the stages

of the arithmetic-logic unit (i.e., the carry chain), the

IAS

machine,uses bypass capacitors to couple the stages.

Therefore, instead of propagating static states (high or low

voltages), the

IAS

machine propagates pulses.’ (This was

the first dynamic logic.)

If logic pulses are passed repeatedly through a capacitor

with no quiescent period, a dc bias builds up, creating

unacceptable noise susceptibility. The nonpipelined nature

of the

IAS

microarchitecture provides a natural solution

to this problem: The execution unit quiesces during the

I

phase, and the instruction unit quiesces during the

E

phase.

The von Neumann bottleneck has little to do with

architecture and a lot to do with power.

As

described

later, power budgets will steer microarchitecture in the

next decade in a fairly obvious way.

4.

The modern processor and

its

performance

Figure

2

is a high-level diagram of a modern processor.

The processor core comprises an instruction unit (IU), an

execution unit (EU), and a cache (C). The processor core

is coupled to a main memory which can be a hierarchical

system.

At this level of abstraction, the processor core is very

similar to the

IAS

machine in many ways. (The greatest

similarity is that the IAS machine never experiences a

cache miss, just like any modern processor when running

ISPEC.)

The major difference in the microarchitecture is that

the

IU

and the

EU

operate concurrently in modern

processors. This is basic “pipelining,” and in a well-

designed processor, the cache can accommodate the

bandwidth requirements

of

both units running

concurrently. In some modern architectures, instruction

data and operand data are distinct, and reside in separate

caches that are (necessarily) visible to the instruction set

architecture

(ISA).

In

the

IAS

machine, the

two

forms of

data are indistinguishable, and the interpretation of a

datum is contextual.

The performance components of this processor

correspond to its major hardware components, which are

enclosed by the dotted lines in Figure

2.

Therefore,

partitioning performance into these three major

components (as described below) is the most natural

means for understanding processor performance.

cache performance and a finite-cache effect (FCE). The

The total CPI for a processor is the sum of an infinite-

1

Related in

a

conversation with

J.

Pomerene, engineer

on

the

IAS

project.

IBM

J.

RES.

DEVELOP.

VOL. 41

NO.

3

MAY

1997

infinite-cache performance is the CPI for the core

processor under the assumption that there are no cache

misses. Roughly speaking, this is raw processor speed

when the effects of the memory hierarchy are ignored.

The FCE accounts for the effects of the memory

hierarchy.

Finite-cache effect

(FCE)

The FCE is the CPI associated with cache misses:

FCE

=

(cycles per miss)

*

(misses per instruction).

The misses-per-instruction component is commonly

called the miss rate, and the cycles-per-miss component is

called the miss penalty.

(In

the literature, the terms “miss

rate” and “miss ratio” are often used interchangeably.

This is incorrect. The miss ratio is the ratio of misses to

references, and this can only be related to the instruction

frequency by knowing the average number of references

per instruction.)

The beauty of separating the FCE in this way

is

that

for a given cache geometry (size, line size, and set

associativity), it allows a benchmark to be characterized

by its miss rate quite independently of the details of the

cache-to-memory interface. That miss rate applies

to

any other processor whose cache shares the same

geometry.

Measuring the miss rate does not require complex

simulation, but it requires simulating many misses (e.g.,

millions) for an accurate measurement.

A

miss rate

simulation merely requires

a

data structure that reflects

the cache geometry (the “simulated cache”), and a long

reference stream to run through that data structure to

effect replacements (misses). The simulation does not

require any level of detail, or any emulation of actual

hardware mechanisms.

Characterization of the miss penalty requires a detailed

simulation of the processor and the memory system to

which it is coupled, but an accurate characterization can

be done with a relatively small number

of

misses (e.g.,

thousands). That is, to characterize miss penalty, the

actual hardware mechanisms that process a miss must be

simulated in detail, and they must be simulated within the

context of the processor. This ensures certainty regarding

the average delay associated with a miss, and this average

converges rapidly.

Therefore, the complex simulation (to characterize miss

penalty) can be fairly short, and the lengthy simulation (to

characterize miss rate) can be very simple. It is also true

that a (good) performance practitioner can make a fairly

accurate guess as to miss penalty for a given processor-

memory structure if he has experience with similar

structures (i.e., detailed simulation of derivative designs

might be unnecessary for someone who is knowledgeable).

f

CPI

as

a

function

of

miss

rate.

”

.

..

”

Note that simulation is unnecessary in the trivial case of a

blocking cache, because nothing overlaps the processing

of

the miss.

Figure

3

shows an idealized plot of a processor’s CPI as

a function of miss rate. Note that the slope of the finite-

cache CPI is the miss penalty (cycles per miss), and the

y-intercept for this line is the infinite-cache CPI. Since

the infinite-cache CPI is independent of miss rate, it is

horizontal.

Therefore, if the infinite-cache CPI can be determined

for a processor, this fixes the y-intercept.

If

the miss

penalty can be determined, this fixes the slope of the

finite-cache CPI, and that line can be drawn. For a given

miss rate

(x

coordinate), the CPI is determined.

Infinite-cache performance

Measuring infinite-cache performance requires a detailed

simulation of the processor core. This simulation (using no

cache misses) can be relatively short if there is reasonable

confidence that both the instruction mix and the software-

module mix are representative of steady-state performance.

The infinite-cache CPI is the sum

of

two parts (each of

which is independent of the miss rate), as is shown in

Figures

2

and

3.

These are the execute busy (EBusy) and

the execute idle (EIdle) components.

EBusy represents the average intrinsic work (execution

cycles) done per instruction. In a scalar processor, EBusy

corresponds to the CPI that would be obtained if there

IBM

J.

RES.

DEVELOP.

VOL.

41

NO.

3

MAY

1997

P.

G.

EMMA

220

were no pipeline delays. Another interpretation of EBusy

is infinite-cache CPI with the EU utilization at

100%.

For

a scalar RISC machine, EBusy

=

1;

for a CISC machine,

EBusy

>

1.

Note that EBusy is independent of pipeline

delays. EBusy is the dot product of the instruction

frequencies and their execution times; it is a pure measure

of intrinsic work,

EIdle represents the average delay per instruction

caused by pipeline effects. Very simply, EIdle is the

difference between the infinite-cache CPI and EBusy. In

general, it accounts for all of the numerous pipeline

P.

G.

EMMA

effects, but only a few of these are really significant. EIdle

can also be separated into a number of relatively

independent components

[3],

although this is not done in

'

detail here.

As can be surmised from some of the previous

discussion, the first crude separation of EIdle is as follows:

1.

Events in which instructions are not available in time to

keep the

EU

busy. These are branches, dynamic effects

associated with cache utilization, and effects associated

with the fetching

of

variable-length instructions (if

applicable).

2.

Events in which operand data are not ready in time

to keep the EU busy. The most predominant is the

address generate interlock (AGI) associated with

indirect data fetching (e.g., pointer chasing), but

execution dependencies in general fall into this

category.

3.

Serialization events in which the pipeline is deliberately

drained between instructions to ensure correct function.

These events are predominant in MP environments.

Note that each of these events has a frequency that can

be measured for any benchmark, and that frequency is

independent of the microarchitecture. Each event has an

associated penalty that can be determined from the

microarchitecture. The CPI contribution of each event

type has the form

CPI

=

(cycles

per event)

*

(events per instruction).

A further interesting observation is that most

of

the EIdle

events, and all of the predominant ones (branches, AGIs,

and serialization), have penalties that are directly

proportional to the number of stages in the pipeline (in

general), and specifically to the number of cycles required

for a cache access. That is, EIdle CPI varies linearly with

the number of stages in the pipeline (Le., with the degree

of

pipelining).

Superpipelining and wave pipelining are techniques

that achieve a fast cycle time by making the pipeline

granularity fine. It has been shown that processor cycle

time decays to an asymptote when the pipeline granularity

is increased, while CPI increases linearly, as shown

respectively in

Figures

4(a)

and

4(b)

[4].

The performance

(MIPS) is the inverse of the product of the two, and that

product has a quadratic form (thus, a unique maximum),

as shown in

Figure

4(c).

Because of unpredictable branching, and because of a

high AGI frequency due to heavy pointer use, EIdle

events in commercial code render superpipelining

techniques fruitless. These techniques are better suited

to the realm of scientific workloads.

IBM

J.

RES. DEVELOP.

VOL.

41

NO.

3

MAY

1997

Real workloads

Figure

3

is an idealized plot, but in fact, red workloads

exhibit a small correlation between EBusy and miss rate.

This does not mean that miss rate is causal to EBusy; it

merely means that there is a statistical dependence

(correlation). This is why rigorous benchmark selection

requires an accurate module mix. An accurate instruction

mix alone is insufficient.

Furthermore, the actual finite-cache CPI becomes

superlinear when the miss rate is very low or very high,

for different reasons:

1.

When the miss rate is very low, the misses do not

cluster (in time) quite

as

much as they would in steady

state; therefore, no portions of misses overlap each

other. The net effect is that the average miss penalty is

slightly larger.

and its buses become saturated servers. The misses

serviced by the saturated system then incur inordinate

queueing delays, and the average miss penalty

increases. This effect is nonlinear, and is discussed later

in this paper.

2.

When the miss rate is very high, the memory system

Out-of-order execution

in

modern pipelines

Tomasulo described a “common data-bus algorithm” that

permits out-of-order execution in the floating-point

hardware of the System/360* Model

91

computer

[5].

In

the 1960s, floating-point operations required multiple

execution cycles,

so

allowing a logically subsequent

instruction to bypass an operation that was in progress

helped performance.

The Tomasulo algorithm was extended for use in the

S/390* ES/9000*

family

of

processors to include register

renaming and a means for taking precise interrupts while

executing out of order [6]. The general scheme is

described by Smith and Pleszkun

[7],

and is now used in

many modern processors.

Two aspects of the general scheme can be compelling

under certain circumstances:

1.

If the number of architected registers is insufficient for

supporting the maximum number of instructions that

can be in progress at any time (proportional to the

product of the number of pipeline stages and the

number of instructions decoded per cycle), register

renaming provides a means to increase the number of

physical registers

so

as to support this peak rate.

2.

Register renaming can simplify the control required to

take precise interrupts and/or to initiate retry

operations.

However, the out-of-order aspect of the broad scheme is

generally not valuable in modern processors, where the

IBM

J.

RES.

DEVELOP. VOL.

41

NO.

3

MAY

1997

number of execution cycles needed per instruction is

almost always one (exceptions are divide, square root, and

some decimal operations).

Out-of-order execution is not usually valuable because

in-order flow is generally maintained in

all

of the pipeline

except for the execution phase, which is usually a single

cycle. Specifically, the in-order decoding of instructions is

required

so

that the hardware can make sense of the

logical instruction sequence, and in-order instruction

completion is required

so

that exceptions are handled

properly and interrupts are taken precisely. Between the

decoding and the completion of an instruction, CISC

pipelines access an operand and then execute the

instruction, and RISC pipelines merely execute the

instruction. In general, the ordering of operand accesses is

preserved to satisfy constraints imposed by MP-coherency

issues (specifically, fetches are kept in order, stores are

kept in order, and fetches and stores to the same location

are kept in order).

Therefore, the situation in modern processors is that

order within the pipeline is maintained up until execution,

it is restored immediately after execution, and execution

takes a single cycle. Enabling out-of-order execution in

this environment is not worthwhile, because it will seldom

happen. If enabling out-of-order execution complicates a

design, it should not be done.

to enable out-of-order execution (i.e., if the two

circumstances listed above are not motivating factors), it

should certainly not be done. Not only will this mechanism

fail to improve performance, but the renaming steps will

lengthen the pipeline (or will have a cycle-time impact)

which will increase CPI as previously described, thereby

actually hurting performance.

In

addition to arithmetic multicycle operations, there

are other classes of multicycle operations in some

instruction-set architectures. Specifically, there can be

long moves, and there can be operating-system assist

instructions. In both cases, the opportunity for them to

overlap other operations is very slight. Long moves fully

utilize critical resources (e.g., buses) and prevent much

else from happening while they are in progress. Operating-

system assists are usually serialization events which

explicitly preclude overlap.

High-MIPS effects and workload characterization

The initial velocity of a falling object is determined

predominantly by gravity. As this predominant effect

increases the velocity, second-order effects (which increase

with velocity) gain significance, and work against the effect

of gravity.

In

physics, this situation is clearly described by

a simple differential equation in which the object reaches

a terminal (constant) velocity. The falling object can go no

faster than this.

Further, if register renaming is used specifically

P.

G.

EMMA

221

CPI

in

the

complex

number

domain.

Similar phenomena affect processor speed, although

the effects cannot be crisply described by differential

equations. Collectively “high-MIPS’’ effects (HMEs) are

effects that tend to slow a processor as it runs faster.

Most of the HMEs arise because the latency (not the

bandwidth) associated with the

I/O

subsystem does not

scale with processor speed.

In a multiprogramming environment, as the processor

speeds up (via faster circuits, higher ILP, or higher degree

of MP), the latency of the

I/O

subsystem appears larger

from the perspective of an executing program.

If

the

multiprogramming level is maintained, the added speed

causes the processor utilization to drop (i.e., the added

speed is not utilized). To maintain a high utilization, the

multiprogramming level is increased.

When the multiprogramming level increases, the

software queues become longer, which causes dispatchers

and other operating-system modules to utilize a higher

percentage of the system time. Operating systems

compensate for this by using different algorithms, which

changes the basic instruction mix somewhat. Further, the

larger number of coexistent processes generate a higher

rate of synchronization events, and those events must

synchronize with more processes. Since the aggregate

“footprint” of the coexistent processes is larger, the cache

miss rate and the TLB miss rate will be larger for each

process. These effects all slow the system down.

Note that as processes (mainstream and

I/O)

and

processors (mainstream and

I/O)

interact, they necessarily

spend some portion of their time in wait-loops waiting for

asynchronous events to complete (e.g., a page-in). As the

222

number of processors in a multiprocessor increases, as

P.

G.

EMMA

those processors run faster, and as the degree of

multiprogramming increases, the portion of time spent

waiting increases.

When measuring the real performance of a running

system by measuring instruction throughput (MIPS), it is

important to subtract out that portion of the performance

that is spent in wait loops (or other analogous code).

Not

only are wait loops nonproductive (and should not be

included in any measure of “work), but they can actually

run very fast, and can bias a measurement of MIPS in an

artificially favorable way.

5.

IPC

and other metrics that defy intuition

Thus far, the case has been made for CPI as the

performance metric that most naturally lends itself to

intuitive interpretation and simple estimation techniques.

CPI is a dot product involving a set of event types, where

the contribution of each event type is the product of its

frequency and its associated penalty. The frequencies are

intrinsic to workloads, and the penalties can be readily

derived from the microarchitecture.

An experienced designer should know approximate

numbers for key events, key workloads, and basic

microarchitectures. Changes to a microarchitecture rarely

affect more than one or two event types, and that effect

should be readily understood by an experienced designer.

Quick estimation using this dot product is simple and

remarkably accurate.

IPC is the inverse of CPI. It defies intuition while

hinting at high performance. Merely intoning “instructions

per cycle” creates a subliminal feeling that massive

instruction-level parallelism (ILP) is being achieved.

It is not.

When performance is expressed as

IPC,

EBusy and

EIdle become intermingled and inseparable. This does not

clarify performance; it shrouds it. The use of IPC has

been the largest impediment

to

the understanding of

pipeline effects, and the largest contributor to the notion

that EIdle defies intuitive quantification. It has been the

largest impediment

to

simple analyses of superscalar and

VLIW processor designs.

maximum speed in the absence of any possible accounting

for pipeline effects as “performance,” that for several

years we assumed that most of the industry was measuring

CPI

in the complex-number domain, as shown in

Figure

5.

In

fact,

so

many papers quote some theoretical

Complex CPI

As was explained in the previous section on infinite-cache

performance,

CPI

=

EBusy

+

EIdle.

On

the basis of the numbers quoted in many papers, the

only reasonable conclusion that is possible is that

IBM

J.

RES.

DEVELOP.

VOL.

41

NO.

3

MAY 1997

published performance is just EBusy, which can be

inte;preted as a projection of complex CPI onto the real

axis,

Re (CPI)

=

EBusy,

where the angle associated with the projection is

arccos

EBusy

The corresponding projection onto the imaginary axis is

Im

(CPI)

=

(EBusy

+

EIdle)

sin(@).

Think of the imaginary component

of

CPI as that

component of performance many authors would like to

imagine they need not deal with. Complex CPI is

CPI

=

EBusy

+

i

Im (CPZ).

A good indicator of the usefulness of a measurement is

the sensitivity of that measurement to

a

parameter of

interest. Since changes to a microarchitecture usually

target EIdle (i.e., improving EU utilization), the sensitivity

of CPI to EIdle is this indicator. This is

d

(CPI)

d(EId1e)

"

-1

EBusy

+

EIdle

EBUS~~

+

(EBusy

+

EIdle) JEIdle(2EBu~y

+

EIdle)

'

I

which is a purely imaginary complex trigonometric

function that is messy, hence unintuitive. (It also happens

that if EBusy and EIdle are comparable, this sensitivity is

small,

so

complex CPI is not a useful measurement

anyway.

1

Thus, just as for the inverse of CPI (IPC), it is apparent

that the complex form of

CPI

does not lead

to

clear

intuition. Nonetheless, computing complex CPI from

published performance numbers does yield another useful

metric.

In

particular,

0

is a direct measure of the degree

to which the published number is out of phase with

reality.

6.

Limits

of

superscalar performance as

understood

from

scalar components

Thus far, discussion has focused on performance for

pipelined machines that issue a single instruction per

cycle. It was argued that CPI can be partitioned into

independent pieces for such a machine, and those pieces

can be easily understood.

Superscalar processors decode and execute multiple

instructions per cycle, and the execution can be done out

of order with respect to the decoding. Understanding the

pipdined flow of multiple instructions per cycle (possibly

out of order) is more difficult than understanding the flow

IBM

J.

RES.

DEVELOP,

VOL.

41

NO.

3

MAY

1997

of single instructions per cycle. Similarly, implementing a

superscalar simulator is more difficult than implementing

a scalar simulator.

For some proposed microarchitectures, the tasks of

understanding performance and of simulating performance

seem overwhelming. While the dissection of superscalar

performance into its components is not done in this paper,

its key is

a

solid grasp of the analogous scalar

flow.

Frequently, this grasp is sufficient for making very good

estimates. Understanding superscalar performance without

understanding the analogous scalar flow is hopeless.

Every designer and performance analyst who is working

on

a superscalar processor without a detailed model

of

such and without the resources or time to construct one

should construct (at least) the analogous scalar model.

That is, construct a model of the same pipeline, and study

that pipeline running in scalar mode.

The scalar performance components are hard bounds

(fixed CPI limits) for the superscalar components; they

give definite indications of what levels of performance are

and are not achievable in superscalar mode. In particular,

the FCE is identical; scalar EBusy is an upper bound; and

scalar EIdle is a lower bound.

Recall that FCE is the product of the miss rate and the

miss penalty. The miss penalty depends on the memory

hierarchy, not on the microarchitecture of the processor.

The miss rate (misses per instruction) is a characteristic of

the workload, and is independent

of

the rate at which

instructions are executed. (Note that prefetching

mechanisms, branch prediction mechanisms, and explicit

speculation can increase the miss rate, but that increase

can be accurately estimated without a detailed model of

the processor.) Therefore, it is an excellent approximation

that the

FCE

in superscalar mode is identical to the

FCE in scalar mode. (This paper does not address

multithreading; suffice it to say that it makes the FCE

component worse, and it should not be used if a primary

goal is high performance.)

EBusy is the inherent work done by the program. This

work is identical in scalar and superscalar modes. The

difference is that in superscalar mode, some of the work is

done in parallel.

As

such, the scalar EBusy

is

an upper

bound, and the superscalar EBusy should be smaller. If

there are

n

functional units, it cannot be more than

n

times smaller. Estimates that approach EBusy/n are

probably wrong.

EIdle is (primarily) a measure of the functional

interlocks that are intrinsic to the program, e.g., branches,

AGIs, and serialization events. The penalties associated

with these intrinsic interlocks depend on the pipeline, and

on

the temporal proximities of the interlocks to the events

that resolve them (e.g., AGI).

The interlocks do not disappear in superscalar mode; on

the contrary, their effects are amplified. If superscalar

P.

G.

EMMA

operation is having the desired effect (i.e., the execution

rate is increased), the temporal proximities of interlocks

are shortened with respect to the events that resolve them.

That is, sequences of instructions are compressed into a

smaller temporal “window,” and the resolving events have

less time to resolve the interlocks before the interlocks

stop pipeline flow.

bound.

As

the execution rate increases, EIdle can only

become larger.

There are also EIdle effects that arise because of

resource conflicts (e.g., insufficient buffering, buses, or

ports). These are called “bottlenecks.” They are

(generally) extrinsic to the program, and should be

relatively small.

If

these effects are manifest in a scalar

design, the design is poor. Design resource should be

directed at these bottlenecks instead

of

at a superscalar

control structure.

If these effects are ignorable in a scalar design, they

nonetheless might be manifest in a superscalar design if

buffers and bandwidth-related resources are not scaled

accordingly. Therefore, as was true of the intrinsic EIdle,

extrinsic EIdle in a scalar design (if any) can only become

worse in a superscalar design. Scalar EIdle is a lower

bound.

Of the three scalar bounds, EIdle is the most difficult to

extrapolate to the superscalar realm.

As

was discussed in

the subsection

on

infinite-cache performance, the principal

components of EIdle have penalties that are directly

proportional to the number

of

cycles required for a cache

access. For this reason, a designer should know the

coefficient of infinite-cache CPI with respect to this

number. Recall that this is strictly linear,

so

the increase

in CPJ per cycle increase in the cache-access time is

constant. This coefficient is useful in quickly evaluating

whether a potential reduction in miss rate from a larger

cache justifies the potential exposure of a longer cache-

access time.

Therefore, the intrinsic part of scalar EIdle is a lower

Small fast

LO

caches

A

caveat in regard

to

the aforementioned pipeline

coefficient is that it should not be used to extrapolate CPI

downward into the realm of LO caches. LO caches cannot

be conceptualized in the same manner as

L1

caches

because their (rare) benefits cannot be quantified in terms

of hit rate. This is because

LO

caches are very small and

have hit rates that are below the “critical mass” required

to conceptualize their steady-state operation.

which are sequences

of

instructions between successive

taken branches (or between successive mispredicted

branches). If all data and instruction references in a basic

block are satisfied by the L1 cache, that basic block “sees”

Specifically, the quanta of pipeline flow are basic blocks

224

the pipeline flow involving the L1 access path. If L1

misses are rare (relative

to

the number

of

instructions in

the basic block), “the pipeline” is the pipeline associated

with the L1, and L1 misses can be treated as isolated

events that are accounted for in the

FCE.

The LO cache allows the basic block

to

start early

(assuming that the initial access of the block hits in the

LO), and therefore to finish early (assuming that all

references in the basic block hit

in

the LO).

If

any access

in the basic block misses

in

the

LO

and requires an L1

access, the flow for that block reverts to the L1 pipeline

flow, and the “head start” afforded by the

LO

is lost.

Therefore, only those basic blocks that live completely

in the LO benefit from the LO. For typical

LO

sizes and

miss rates, most workloads do not have a predominance

of basic blocks with this characteristic. In any case, the

probability of losing performance due to

LO

misses is not

proportional to the

LO

miss rate; instead, it is closer

to

a

geometric distribution of the miss rate whose degree is

proportional to the average number of accesses per basic

block.

The disadvantage of decoupling the

I

and

E

phases

A

model of high-performance computing that has become

pervasive in the last several years has the

I

phase

decoupled from the

E

phase. The

two

phases are

implemented with engines that are autonomous with

respect to each other, and they operate

on

a common

instruction queue.

In this model, the

I

engine is directed by an extremely

accurate branch-prediction mechanism, and it is able

to

run far ahead of the

E

engine, fetching instructions at a

high rate and dispatching them to the common instruction

queue. The

E

engine is a superscalar dataflow machine.

The dataflow analyzer chooses instructions that are not

interlocked, and dispatches them out of order (if

necessary) to a set of parallel execution elements at a

hypothetically high rate. The cache and the register file

are assumed to be highly multiported

so

that they do not

bottleneck execution.

Many studies that involve this model ignore the details

of the I phase and assume that it can keep ahead

of

the

E

phase. The focus then shifts

to

the dataflow analyzer,

which is a more tractable problem. If the constraints

imposed by considerations for an MP environment are

loosened (specifically, the ordering of fetches and the

completion of stores), and if the cache miss rate is

extremely low, the execution time is a very impressive

(and very unsurprising) number that is proportional to the

depth of the dataflow graph.

This model yields very impressive results that are flawed

on

several levels. The first level is a pure “catch-22’’ that

does not require any understanding of microarchitecture:

If the

E

phase accelerates

to

its theoretical speed limit

(determined by the dataflow graph), it must catch up with

P.

G.

EMMA IBM

J.

RES.

DEVELOP.

VOL.

41

NO.

3

MAY

1997

the

I

phase. Therefore, there cannot be instructions in the

common queue to dispatch (i.e., if the E phase goes as

fast as it is supposed to, the processor runs out of

instructions). The point is that the

I

phase is not

ignorable.

In real commercial workloads, the frequency of

unpredictable branches prevents the

I

phase from getting

far ahead of the E phase. Given the best techniques

available, a branch will be guessed wrong every

20-30

instructions in real commercial code.

(In

SPEC**, the

numbers are much better.) Therefore, the theoretical

maximum execution rate can only be maintained for

intervals of

20-30

instructions, and each interval is

followed by a wrong-guess recovery action that requires

restarting the

I

phase. For real hardware to approach the

bounds implied by this model, branch prediction must

become much better than it is today (except in SPEC).

10-20

instructions for

L1

sizes in the

64KB

to

256KB

range (in SPEC, the numbers are much better); thus,

maintaining a high execution rate requires moving cache

lines at a rate corresponding to the resulting miss

frequency. This is not realistic at very high execution

rates. For example, if a processor can execute five

instructions per cycle, and there is an

L1

miss every ten

instructions, the memory hierarchy must be able to move

an entire cache line in two processor cycles for the

processor to sustain this rate. (This is still an

unrealistically optimistic statement, because it assumes

that there is

no

miss latency, which would require perfect

prefetching.)

Another characteristic of real commercial code is

that a majority of conditional branches are resolved

by instructions that have an AGI interlock with an

immediately preceding instruction. (An example is a loop

that searches for a record in a linked list.) When this is

the case, it is disadvantageous to separate (temporally) the

I

and

E

phases by adding pipeline stages (the cycles

required to put instructions into a queue, analyze them,

and redispatch them). Instead, it is desirable to couple the

phases tightly

so

that there is minimal latency between the

initial dispatch of a load instruction and its execution

outcome. (This is much less of a constraint in SPEC.)

In real commercial code, an

L1

cache miss occurs every

7.

Future trends

in

cache

miss

penalty

In

high-performance systems of the

1980s,

cost was a less

crucial factor than it has become in the

1990s.

Symmetric

multiprocessor (SMP) systems of the

1980s

typically had

private unidirectional point-to-point buses between

processors and the memory system. Those buses were

driven by water-cooled circuits onto low-dielectric

packaging, and they ran at the same speed as the

processor. Cache line sizes were moderate by today’s

standards.

I

Temporal

components

of

a

cache

miss.

In the

1990s,

processors have become faster (shorter

cycle time) and buses have not kept pace. There are many

systems in which the processor runs at a higher frequency

than the

bus,

and that difference is becoming larger. The

number

of

processors in

SMPs

has increased, and to

contain costs, many systems are shared-bus systems. With

the growth of object-oriented programming, miss rates

have increased, and with higher instruction-level

parallelism, miss frequencies (in time) are larger for the

same miss rates. Further, it is becoming the fashion to

increase line size.

Each of these trends increases

bus

utilization, and thus

exacerbates the effects

of

bus

utilization.

Bus

queueing

(which was an ignorable effect in high-performance

systems

of

the

1980s)

will dominate system-level

performance in the next decade if new protocols are not

adopted to mitigate it.

components

of

a cache miss. The leading-edge component

is the time it takes the memory system to deliver the first

datum of a miss. Since a miss causes the transfer of both

the datum that caused the miss and all other data in the

same line, the trailing-edge component accounts for those

cycles that are lost because an entire line was transferred,

i.e., the cycles following the leading edge.

processor is less than the total number of cycles that it

takes to transfer a line, but it is directly proportional to

that number, and thus directly proportional to the line

size. Roughly speaking, trailing-edge effects fall into three

categories:

Figure

6

is a conceptual drawing of the temporal

Note that the trailing-edge delay that affects the

Since spatial locality exists, there is frequently an

upstream reference (immediately following the miss

event) to a datum that is in the same line as the datum

that caused the miss. This should not be counted as a

second miss, but the second reference will experience a

delay if the associated datum is still in transit at the

time of the reference event.

225

The incoming line consumes bandwidth at the

L1

cache

and interferes with the running processor.

There are finitely many systems at the

L1

interface (e.g.,

only one system in many processors) for containing the

necessary state for controlling the processing of each

miss in progress. That is, there is a maximum number of

misses that can be in progress at any time, and that

number is typically small. If all of these systems are

occupied, a new miss cannot be started until a miss in

progress has completed in its entirety (i.e., until an

entire trailing edge is over).

BR

=

fp/fR

=

bus ratio (number of processor cycles per

bus cycle).

Description of miss rates

mr

=

miss rate for a single processor (misses per

instruction).

CPI

=

cycles per instruction for a single processor.

P

=

number

of

processors that share the bus.

BMF

=

(mr

X

P)/CPI

=

bus miss frequency (misses

per cycle

on

the bus).

Assessment

of

the leading-edge penalty is more

straightforward. When a miss is recognized by the

L1

cache, there can be queueing at the address bus in some

systems. Once the processor gets control of the address

bus, it transmits an address to the

L2

cache (or whatever

is next in the hierarchy). At the receiving end, there could

be an ECC verification cycle, and queueing at the

L2.

Once the request is accepted, there is a (fixed)

L2

access

time, perhaps followed by another ECC verification cycle.

Again, there can be queueing at the data bus that returns

the first datum to the processor, followed by the transfer

of that datum (and perhaps another ECC verification cycle

at the receiving end). The trailing edge follows this.

Therefore, leading edge comprises several fixed delays

that can be estimated directly from the microarchitecture

of the cache hierarchy, and a few chances for queueing.

Whenever there is a chance for queueing, the average

delay incurred is a nonlinear function of the utilization of

the subject resource (in this case the address bus, the

L2,

and the data bus).

The address bus is used for only one bus cycle every

miss,

so

its utilization is small. The

L2

utilization can be

large, but there are many straightforward techniques for

mitigating these effects, e.g., interleaving. The utilization

of the data bus is strongly related to the trailing edge. All

current system-level design trends tend toward increasing

this utilization, hence its associated queueing delay. As is

described below, queueing delay explodes if utilization is

driven past some general thresholds.

For the sake of illustration, the general trend in

queueing effects is analyzed below, using an open

queueing model. While this is not accurate because the

real system is closed, the solution for the open system is

simple, and yields valuable intuition as to the general

trend. (The solution for a closed system is

very

complex,

and does not yield to intuition.) The notation used is as

follows:

Description

of

trailing edge

L

=

line size (bytes per line).

W

=

width

of

the bus (bytes).

P,

=

L/W

=

number of “packets” per line.

Service time for a miss

TE

=

P,

X

BR

=

trailing edge (number of processor

cycles to transfer a line).

Bus utilization

U

=

BMF

X

TE

=

bus utilization

(a

probability).

Note that the trailing edge is the service time

for

a miss

when the bus is the ‘‘server’’ in the queueing model. (Keep

in mind that the trailing edge is distinct from the trailing-

edge effect, as was previously described.)

definition of utilization above. In particular, since

utilization is a probability, it is physically bounded

between zero and one. In the open definition, it is the

product of a rate and a service time which are assumed to

be constant; hence, there is no feedback and

no

implicit

bounding.

In

the real closed system, there is feedback to

bound the utilization. Specifically, increasing bus

utilization increases the queueing effect, which increa:es

CPI, which decreases BMF, which keeps the utilization

less than one.

The open aspect of the model is easily seen in the

Nevertheless, the queueing delay

(Q)

is calculated for

the open system below. The calculation is expressed in two

forms to illustrate a point (as explained later).

Q

is the

average number of cycles that each miss spends waiting to

get control of the data bus. This is part of the leading-

edge penalty.

Description

of

bus

rate

fp

=

frequency of the processor in MHz.

226

fB

=

frequency of the bus in MHz.

BMF

x

TE~

n-l

x-

2

+

i

x

BMF’

x

TE’+’.

i=l

P.

G.

EMMA

IBM

J.

RES. DEVELOP.

VOL. 41

NO.

3

MAY

1997

The first form

o

-

--e equation is written directly from

the following intuition. If a miss requires service, there are

i

misses in progress with probability U', and the new miss

must wait for the trailing edges

of

all

i

misses to complete.

On the average, the first trailing edge is half over when

the new miss finds that the bus is busy (hence, the

negative term before the summation to correct for this).

There can be as many as

n

misses outstanding in the

system,

so

a new miss can have no more than

n

-

1

misses ahead of it (which is the limit of the summation).

Most processors today can only have a single miss

outstanding,

so

in most shared-bus MP systems,

n

=

P.

The first expression for

Q

shows that the average

queueing delay is a polynomial in U, and the degree

of

the polynomial is one less than the number of misses that

can be simultaneously outstanding in the system. In most

systems, the degree is

P

-

1.

The second expression for

Q

is the same as the first, but it has been rewritten to show

that the first term is proportional to the square of the

trailing edge, hence, the square of the line size.

If

U

is kept relatively small,

U'

dies out, and the

queueing delay is proportional to the square of the line

size. If

U

is allowed to grow large (say,

>OS),

U'

does not

die out, and the queueing delay explodes.

For example, consider a four-processor system where

each processor runs at

2

CPI, and each processor can have

(at most) one miss outstanding. Assume that the bus is

32 bytes wide, and that it runs at half the speed of the

processor. Let the line size be

128

bytes. If the miss rate is

0.02

(i.e., if the intermiss distance is one miss every

50

instructions), the equations above yield a queueing delay

of 3.71 cycles per miss. These system-level assumptions are

fairly conservative for most systems that are being

projected, and this queueing delay is quite significant.

Figure

I

is a plot of queueing delay as a function of

trailing edge for various miss frequencies. The dotted lines

show constant utilizations. This shows that if

U

is kept

reasonably small (e.g., <0.3), the queueing delay is fairly

flat, but if

U

is large (e.g.,

>OS),

the delay becomes

exceedingly large.

it demonstrates that modern systems are on the edge of a

nonlinear range in bus queueing. A new approach to

transferring data is needed in the coming decade. The

effect of queueing delay

on

the base CPI is

Q

X

BMF

X

CPI CPZ

The example above is shown as point

A

on the plot, and

A(CPI)

=

~

P

-

In the example above,

A(CPI)

=

0.074, which is nearly

4%

of the total system performance. Figure

7

shows that

i

Bus

queueing

delay

as

a

function

of

trailing

edge

for

various

miss

frequencies.

this is headed in a very dramatic direction as utilization is

increased further.

8.

Future trends

in

bus protocols

In the previous section, the various trends in system

design that drive bus utilization were discussed, and the

effects of utilization

on

queueing delay were shown.

To

summarize, the trends that drive bus utilization are the

following:

More processors in a system.

More processors sharing a bus.

Faster processors (cycle time) and higher bus ratios.

Larger line sizes

(=

larger trailing edge).

Lower CPI

(=

higher miss frequency for the same

Higher miss rates (new code with larger working sets).

Multiple misses outstanding per processor (larger

n).

Speculative execution

(=

higher miss rate).

Prefetching

(=

higher miss rate).

miss rate).

These are all good trends, but they make a previously

ignorable effect significant. It is very simple to propose

solutions such as making the buses faster and/or wider,

adding more buses to the system, or reducing the miss

rate. However, these are not real solutions. Much effort is

already spent making buses as fast as possible and as wide

IBM

J.

RES.

DEVELOP.

VOL.

41

NO.

3

MAY

1997

P.

G.

EMMA

as possible. They cannot be made faster and wider than

what is possible. Miss rates are inherent, and they are

getting bigger.

demand miss occurs, the exigent datum (the word that

the processor needs immediately) is queued behind

nonessential and nonurgent traffic that is flowing in the

system. Those data fall into three categories:

The problem with a highly utilized bus is that when a

1.

Trailing edges that follow other exigent data.

2.

Speculatively fetched data that are not actually needed.

3.

Prefetched data that are not needed anytime

soon.

Current bus protocols treat the transfers of cache lines as

atomic events, and allow exigent data to queue behind

nonexigent data.

Since data bus utilization will become large in the next

decade, a family of protocols is needed that allows the

transfers of lines to be broken up

so

that exigent data can

be transferred on demand. This requires that a small

amount

of

control information be added to a miss request

that identifies the urgency and/or priority of the miss. In

current protocols, limited control information is already

sent that tells the hierarchy the nature of the request (e.g.,

read, write, fetch exclusive). The new information requires

a few more bits depending on how exotic the protocol is.

When future memory systems send data back to a set of

processors, they will have to arbitrate each bus cycle. This

will allow the trailing edges of multiple misses to be

interleaved. Specifically, future protocols must allow the

first datum of a demand miss to interrupt the returning

traffic (other trailing edges) to get that urgently needed

datum to the requesting processor, to then resume

transferring the interrupted stream, and to merge the

trailing edge of the new stream into the existing stream.

transactions. Instead, they will be divisible strings of

contiguous data. The returning control bus must be active

every cycle to handle this; Le., since a miss will not be

returned as an atomic transaction, control data are

required to identify the data on the bus for every cycle in

which there are data on the bus. (Minimally, it must

indicate what processor the data are for, and the miss ID

number, as it currently must for each atomic transaction.)

This will be required in shared-bus systems, and will

obviate mainstream point-to-point systems because it

enables point-to-point bus performance (electrical

considerations aside) in a shared-bus environment by

minimizing queueing delay even at very high bus

utilizations.

In future systems, misses will not be atomic

Line buffers at both the memory and the processor ends

of a bus will be required to time-multiplex the bus among

multiple trailing edges. At the memory end, the buffer

must hold as many lines as there can be misses

228

P.

G.

EMMA

outstanding in the system. At the processor end, the

buffer must hold as many incoming lines as the processor

can have misses outstanding.

In current bus protocols, when a processor issues a miss

request, it sends the following three pieces of information

to the memory hierarchy:

1.

Address (real) of the datum that is to be returned first

(and thereby, explicitly, the address of the line that

contains the datum).

2.

A miss ID, which

(on

a shared-bus system) comprises

two parts:

The ID of the processor that is issuing the miss.

The

ID

of the miss (in case the processor can have

more than one miss outstanding).

3.

The miss type, which tells the memory what status is

needed for the line (e.g., shared, exclusive).

When existing memory systems return data, they send

back the second field above with the data. In this way, the

processors

on

the bus can tell what the data are (Le., who

they are for, and the miss to which they correspond).

Existing protocols treat each miss as an atomic

transaction, and they lock up the bus for the duration of

the associated line transfer. For these protocols, the

second field (above) need not be sent back

on

every bus

cycle; it is just sent at the beginning of each transaction.

In future bus protocols, each bus cycle will be used for a

unique transaction, and the second field (above) will be

sent

on

every cycle to identify the data for that cycle.

In future protocols, there could be as many as three

additional control fields sent with each miss request.

All three together would provide a much richer set of

protocols than is actually needed. Nonetheless, all fields

are listed below to provide a complete specification for

future protocols. Each field is optional, and each can be

arbitrarily simple or complex. The new fields are the

following:

1.

A priority level is used to distinguish as many

of

the

following types of misses as desired:

Demand data fetch (exigent).

Demand instruction fetch (exigent).

Demand data fetch down conditional path

Demand instruction fetch down conditional path

Data fetch down speculative path (speculatively

Instruction fetch down speculative path

(conditionally exigent).

exigent).

(speculatively exigent).

IBM

J.

RES.

I

IEVELOP. VOL.

41

NO.

3

E

dAY

1997

Data prefetch initiated by prefetch mechanism

Instruction prefetch initiated by prefetch

(prefetch).

mechanism (prefetch).

2.

Multiplex control specifies how this data stream is to be

multiplexed with respect to the other active streams.

Assuming that Field

1

above is not used, Field

2

might

specify one of the following actions:

Put the entire stream behind the currently active

traffic. (This is the current default; Le., this is what

current protocols do.)

Suspend all other active traffic for one bus cycle to

return the first datum from this miss immediately,

then resume the suspended traffic, and put

the remainder of this stream at the end of the

queue.

Suspend all other active traffic to return this entire

stream, then resume the suspended traffic.

Suspend all other active traffic for one bus cycle to

return the first datum from this miss immediately,

then interleave the packets of this stream with the

ongoing traffic.

Return the packets from this stream only.if there is

no other traffic present.

Other things deemed useful.

If Field

1

above is used, Field

2

would specify the same

control functions listed above, except that it would

enforce them with respect to priority levels, e.g.:

Suspend traffic of equal or lower priority for one

cycle to return the first packet of this stream, then

queue the remainder of this stream behind traffic of

higher or equal priority.

packet of this stream, then queue the remainder of

this stream behind traffic of higher or equal

priority.

Suspend all traffic for one cycle to return the first

3.

A

command field rescinds or changes instructions as to

how previously issued misses are to be handled. This

allows for the changing of characteristics of misses that

have not yet begun, or of misses that are in progress.

For example, Field

3

could specify the following:

Change the type of request of an outstanding miss.

This allows the upgrade of a read-only request to

an exclusive request, or the downgrade of an

exclusive request.

the upgrade of the priority of a miss, which would

be appropriate if a speculative prefetch became a

known demand miss.

Change Field

1

on

an outstanding miss. This allows

Rescind miss request (i.e., cancel a previous

request). This would be useful

if

it were found that

a prefetch that had previously been issued was

down a path that is now known not taken.

Reset starting address. This is useful if a previously

initiated prefetch for a line was done without

specific knowledge of the address of the datum that

would be needed first, and the address of that

datum is now known.

Add a new starting address. This is useful if an

upstream reference to an incoming line is

discovered. It is different from the previous

command because it does not change the first

doubleword address; it just says that instead of

returning the line sequentially, the line should be

returned starting with the first two explicitly named

(nonsequential) words.

Other things deemed useful.

This new family of protocols will emerge in the coming

decade. It enables high performance

on

a shared bus, and

it allows the bus to be run at very high utilization with

minimal queueing delays.

There is currently a resurgence of interest in sectored

caches for some of the reasons outlined above. In a future

generation of processors, sectoring will re-emerge as an

intermediate step in the evolution toward these protocols.

9.

Power and microarchitecture for high-

frequency design

In the description of the IAS machine, the power budget

was shown to be partly responsible for a basic aspect of

many modern ISAs known

as

the von Neumann

bottleneck. In fact, power will play an important role in

shaping microarchitecture in the coming decade.

A popular viewpoint is that power is an unfortunate

constraint which causes a divergence in the

microarchitectures that target the server and client spaces.

The reasoning is that the client requires low power, and

thus a simple core, and the server requires a high degree

of ILP, hence high power.

guiding microarchitectural development into the highest

performance realm for the server. The client and server

spaces will converge to the same microarchitectural core

with the same physical floorplans, albeit with different

circuit designs.

Specifically, there has been confusion in the last decade

because there is no general methodology for assessing the

impact of a microarchitectural feature

on

cycle time. Most

proposed changes to a microarchitecture target ILP. It is

known that adding microarchitectural features to a

processor makes it bigger, and that making a processor

bigger probably does not improve its cycle time, but the

extent to which an added microarchitectural feature hurts

cycle time is generally not known.

Instead, consider power to be a metric that is useful for

IBM

J.

RES. DEVELOP.

VOL.

41

NO.

3

MAY

1997

P.

G.

EMMA

It is human nature to give credit to a design where

there is a tangible benefit (quantifiable CPI reduction),

and to discount the unknown. That is, if the cycle-time

impact of an added feature might be negligible, we tend to

assume that it is negligible; if the CPI benefit is known,

adding a feature to a machine to achieve higher ILP is

perceived to be desirable on balance.

This is a dangerous trend, particularly in CMOS,

because CMOS is a wiring-driven technology. In CMOS,

the area of a machine is (almost) unrelated to the number

230

of gates in the machine, and is more strongly dependent

P.

G.

EMMA

on

the number

of

wires that must be run to connect them.

Things that drive ILP tend to drive wiring-track usage in

an exponential manner. When parallel execution units,

parallel buses, and multiple ports are added to parts of a

machine, wiring (hence area) increases dramatically.

While it is true that planarization technology allows

for

more levels of wiring, more than a few levels are useless

for increasing interconnectivity. Very simply, once the

lower levels of metal are heavily congested, vias cannot be

dropped through them, and the upper levels of metal

cannot be connected to the silicon surface. Sai-Halasz

advocates using upper levels of metal for “fat wires” which

have lower resistance, hence higher speed for signals that

must travel more than a few millimeters

[8].

Figures

8(a)

and

8(b)

respectively show the cycle time

and the CPI as a function of the degree

of

ILP in a

superscalar processor.

In

this case, CPI decreases to an

asymptote, and cycle time increases linearly as more

parallelism is added to the processor. This is the dual of

Figure 4 (the superpipelining case),

so

MIPS

as shown in

Figure

8(c)

has the same form as in Figure 4(c). The

lesson is that there is an optimal level of

ILP,

and the

level appears to be small (say

2

to 4,

or

when pressed for

an exact number,

T),

mostly because of the sensitivity of

cycle time to adding wires to the machine.

While the coefficient of the line in Figure 8(b) is very

difficult to assess, and is highly debatable, power can be

the metric that provides real guidance in achieving high

performance. When hardware is added to a machine, the

power impact is readily tangible. If the goal of a

microarchitecture is low power and, more specifically, low

work (the product of power and time as it affects battery

life), only those features that pervasively provide low

CPI

are included. Features that only help CPI sometimes (and

that hurt cycle time all of the time) are eliminated if low

power is a goal. Those elements should also be eliminated

if high performance is a goal.

As

the industry pushes processor design into the GHz

range and beyond, there will be a resurgence

of

the RISC

approach. While superscalar design is very fashionable, it

remains

so

largely because its impact on cycle time is not

well understood. Complex superscalar design stands in the

path of the highest performance; he who achieves the

highest MHz runs the fastest.

A

clear focus on power yields a clear focus on high

performance. This trend will make the microarchitectures

of the client and the server converge. In the client

processor, the circuit design will be optimized for low

power. Since CMOS is wiring-driven, adding active silicon

area via resizing devices can usually be done with little

impact to the physical floorplan. The high-performance

server can then be derived directly from the client core by

resizing devices to optimize for speed.

IBM

J.

RES,

DEVELOP.

VOL.

41

NO.

3

MAY

1997

IBM

J.

RES.

DEVELOP.

1

Complex CPl, relativity, and adiabatics

In

a previous subsection on complex

CPI,

the discussion

was limited to values of

CPI

in the first quadrant only.

Now that the discussion has turned to power, the other

quadrants in Figure

5

take

on

significant physical interest.

In

particular, points in quadrants

I1

and

Ill

have negative

real components. The only reasonable interpretation of

such points is that they represent the performance of a

processor that is running the program backward.

One possible interpretation of quadrants

I11

and IV,

which have negative complex components, is that they

represent a new paradigm in circuit performance; in

particular, they represent processors that run faster than

the speed of light. According to simple relativistic theory,

when the machine runs faster than light, time moves

backward relative to our inertial frame of reference.

According to this theory, quadrants

I

and

111

are

indistinguishable, since quadrant

111

has the computation

being run in reverse while time moves backward. As such,

quadrant

I11

is uninteresting.

Quadrants

I1

and

IV

are of real interest, particularly

with the recent advent of adiabatic computing. A

processor that can run adiabatically in quadrant I1 acts as

a power source, hence a perpetual motion machine.

In

quadrant IV, if a machine enters an adiabatic realm, it

becomes a black hole. If this happens, it will change the

world as we know it.

10.

Conclusion

In this paper, several points were made that are

antithetical to some of the modern philosophy in

processor microarchitecture. These points are based

on

simple observations relating to the machinations of

electronic von Neumann computers, which have been in

existence since the onset of this industry.

First, the

most

popular performance metric,

IPC

(instructions per cycle), is the reciprocal of the metric that

should be used,

CPI

(cycles per instruction). This is

primarily because

CPI

is a simple dot product of a few

numbers that any experienced designer should have at his

fingertips. It is intuitive, and it makes for remarkably

quick and remarkably accurate estimates.

On

the other hand,

IPC

does not yield to intuition.

Instead, it shrouds fundamental issues in mystery, and it

has much of the industry (and academia) running down

blind corridors

in

a state of general confusion.

Second, the separability of

CPI

into three independent

components was demonstrated. The three components

account for the intrinsic work done by the computer, the

pipeline structure of the computer, and the memory

hierarchy. It was argued that a solid grasp of each of these

three components is necessary in understanding the

performance of a superscalar processor, because the scalar

components are hard bounds for the analogous superscalar

IOL.

41

NO.

3

MAY

1

991

components. Essentially, the argument is that one must

have a grasp of the simple case before one can hope to

understand the general case.

Third, attention was focused on a trend in future

systems in which data bus utilizations cross a threshold

that will make queueing at the memory bus a limitation of

system performance. A new family of bus protocols that

can mitigate this effect was proposed. These protocols will

emerge in the coming decade because of the impending

delays due to queueing.

will drive the development of microarchitecture in

the coming decade, and that the aspects of a

microarchitecture that result in low power also result in

high performance. This is particularly true in

CMOS,

which is a wiring-driven technology. This trend will cause

the client microarchitecture and the server

microarchitecture to converge.

Finally, an argument was made that power consumption

*Trademark or registered trademark of International Business

Machines Corporation.

**Trademark

or

registered trademark of Standard

Performance Evaluation Corporation.

References

1.

2.

3.

4.

5.

6.

7.

8.

J. P. Hayes,

Computer Architecture and Organization,

McGraw-Hill Book Co., Inc., New York, 1988.

J. von Neumann,

Collected

Works,

Vol. 5,

Design

of

Computers, Theory

of

Automata and Numerical Analysis,

The Macmillan Company, New York, 1963.

P. Emma,

J.

Knight, J. Pomerene, R. Rechtschaffen, and

F.

Sparacio, “Components of Uniprocessor Performance,’’

Research Report RC-12203,

IBM Thomas

J.

Watson

Research Center, Yorktown Heights, New York, October

1986.

P.

Emma

and

E. Davidson, “Characterization of Branch

and Data Dependencies in Programs for Evaluating

Pipeline Performance,’’

ZEEE

Trans. Computers

C-36,

R. Tomasulo, “An Efficient Algorithm for Exploiting

Multiple Arithmetic Units,”

ZBM

J.

Res. Develop.

11,

No. 1,

25-33 (January 1967).

J. Liptay, “Computer System with Logic for Writing

Instruction Identifying Data into Array Control Lists for

Precise

Post

Branch Recoveries,” US. Patent 5,134,561,

July 1992.

J. Smith and A. Pleszkun, “Implementation of Precise

Interrupts

in

Pipelined Processors,” presented at

the

12th

Annual International Symposium on Computer

Architecture, June 1986.

G.

Sai-Halasz, “Performance Trends in High-End

Processors,”

Proc.

ZEEE

83,

20-36 (1995).

NO. 7, 859-875 (July 1987).

Received August

8,

1996; accepted for publication

February 19, 1997

231

P.

G.

EMMA

Philip

G.

Emma

ZEM

Research Division, Thomas

J.

Watson

Research Center,

P.O.

Box 218,

Yorktown Heights, New York

10598

(pemma@watson.ibm.com).

Dr. Emma received his B.S.,

M.S., and Ph.D. degrees in electrical engineering from the

University

of

Illinois at Urbana-Champaign.

In

1983, he

joined the Advanced Computer Architecture group at the

IBM Thomas

J.

Watson Research Center in Yorktown

Heights, New York, where he did research

in

high-

performance computer architecture.

In

1992, he became

Manager of VLSI Systems and Design, and was a design-team

leader

on

a future IBM CMOS mainframe processor. He is

currently the Technical Assistant to the Vice President

of

Systems Technology and Science. Dr. Emma has achieved

18

Invention Plateaus and has earned the IBM Research Division

honorary title

of

“Master Inventor.” He is a Fellow

of

the

IEEE.

232

P.

G.

EMMA

IBM

J.

RES.

DEVELOP.

VOL.

4

I1

NO.

3

MAY

1997