The Motorola 68060 Microprocessor


                        Joe Circello and Floyd Goodrich

                   Microprocessor and Memory Technology Group
                                  Motorola Inc.


Abstract

	The Motorola 68060 is the fourth-generation microprocessor of the 
M68000 Family.  User object code-compatible with previous family members, 
it delivers 3 to 3.5 times the performance of the previous generation 
processor in this family, the 68040.  Performance features include a 
superscalar integer unit, a high-performance floating point unit, dual 
8-Kbyte on-chip caches, a branch cache, and on-chip memory-management units.  
A streamlined design enables high-performance techniques to achieve a high 
level of parallel instruction execution.  Improved performance at a low 
cost makes the 68060 an ideal processor for the mid to high range of 
desktop computing applications, and compatibility features enable it easily
to upgrade performance of existing 68040-based systems.  This paper 
describes the operation of the 68060.

Figure 1 - Simplified 68060 Block Diagram (***NOT SHOWN IN ASCII-ONLY COPY***)


Introduction and Overview

	The 68060 is the fourth-generation microprocessor of Motorola's 
M68000 Family of CISC micro-processors.  It is a single-chip implementation 
that employs a deep pipeline, dual-issue superscalar execution, a branch 
cache, a high performance floating point unit (FPU), 8 Kbytes each of 
on-chip instruction and data caches, on-chip demand paged memory management 
units (MMUs).  These features allow it to achieve execution rates of less 
than one clock per instruction sustained execution. 
	In order to meet the performance goals of the 68060, instruction 
execution times needed to decrease, and parallel operations needed to 
increase over previous generations of M68000 microprocessors.  A superscalar 
instruction dispatch micro-architecture is the most obvious feature of this 
increased parallelism on the 68060.  Superscalar architectures are 
distinguished by their ability to dispatch two or more instructions per clock 
cycle from an otherwise conventional instruction stream.
	Figure 1 shows a block diagram of the 68060.  In addition to the 
superscalar features, this single chip has many other performance, upgrade 
and system integration features including:
	* 100% user-mode object code compatibility with 68040
	* Dual-issue superscalar instruction dispatch implementation of 
	  M68000 architecture
	* IEEE Compatible on-chip FPU
	* Branch Cache to minimize pipeline refill latency
	* Separate 8 Kbyte on-chip instruction and data	caches with 
          simultaneous access
	* Bus Snooping
	* 68040 compatible bus protocol or new high-speed bus protocol
	* 32-bit nonmultiplexed address and data bus
	* Four-entry write buffer
	* Concurrent operation of Integer Unit, FPU, MMUs, caches, 
          Bus Controller, and Pipeline
	* Sophisticated power management subsystem
	* Low-power 3.3V operation
	* JTAG Boundary Scan

Design Targets

	The design goals of the 68060 included providing a simple upgrade 
path for existing M68000 Family designs while also supplying a basis for 
Motorola's successful 68EC0x0 Family of embedded controllers and for the 
68300 Family of modular integrated controllers.
	Initial requirements for the targeted 68060 were to provide a factor 
of three performance enhancement over a 25 MHz 68040 with existing compiler 
technology.  Architectural enhancements were to provide at least a 50% 
improvement while doubling clock frequency doubles performance.  The 
performance estimates reflect analysis of existing object code; additional 
performance advantages are, of course, available when using compilers 
designed specifically for the 68060.
	In addition to software compatibility, the 68060 preserves the 
investment in board-level ASICs by providing bus compatibility with the 
68040.  This supersocket approach facilitates upgrade of all existing and 
future 68040-based systems.
	The 68060 uses approximately 2.4 million transistors.  The part is 
a static CMOS design based on a 0.5 um triple level metal wafer process.  
This process will enable the 68060 to operate at a 3.3 volt power supply--
a greater than 50% power reduction over a 5.0 volt power supply.  Since the 
68060 minimizes power dissipation through a variety of architectural and 
circuit techniques, it is able to offer high performance processing to the 
laptop and portable markets in addition to the traditional computer-system 
markets.

Architectural Features

	The architecture of the 68060 revolves around its novel integer unit 
pipeline.  Taking advantage of many of the same performance enhancements 
used by RISC designs as well as developing new architectural techniques, 
the 68060 harnesses new levels of performance for the M68000 Family.
	The superscalar micro-architecture actually consists of two distinct 
parts:  a four-stage instruction fetch pipeline (IFP) responsible for 
accessing the instruction stream and dual four-stage operand execution 
pipelines (OEPs) which perform the actual instruction execution.  These 
pipeline structures operate in an independent manner with a FIFO instruction 
buffer providing the decoupling mechanism.  A branch cache minimizes the 
latency effects of change of flow instructions by allowing the IFP to 
detect changes in the instruction prefetch stream well in advance of 
their actual execution by the OEPs.
	The 68060 is a full internal Harvard architecture.  The instruction 
and data caches are designed to support concurrent instruction fetch 
and operand read and operand write references on every clock cycle.  This 
organization coupled with a multi-ported register file provide the 
necessary bandwidth to maximize the throughput of the pipelines.  The 
operand execution pipelines operate in a lock-stepped manner that provides 
simultaneous, but not out-of-order, program execution.  The net result is 
a machine architecture invisible to existing applications providing full 
support of the M68000 programming model including precise exceptions.
	The 68060 external bus interface provides a superset of 68040 
functionality.  Maintaining 32-bit widths on both the address and data 
bus as well as a bursting protocol for cacheable memory, the 68060 supports 
transfers of one, two, four, or 16 bytes in a given bus cycle.  The system 
designer can, however, choose to operate in one of two modes:  a mode 
compatible with the 68040 protocol or a new mode consistent with higher 
frequency bus designs.  By allowing this choice, the 68060 can easily fit 
into upgrades of existing designs as well as new high frequency 
implementations.

Pipeline Organization

	The IFP is responsible for prefetching instructions and loading them 
into the FIFO instruction buffer.  One key aspect of the design is the branch 
cache, which allows the IFP to detect changes in the instruction stream 
based on past execution history.  This allows the IFP to provide a 
constant stream of instructions to the instruction buffer to maximize 
the execution rates of the OEPs.  The IFP is implemented as a four-stage 
design shown in Figure 2.

Figure 2 - The IFP of the 68060 (***NOT SHOWN IN ASCII-ONLY COPY***)

        The four stages of the IFP are:  Instruction Address Generation (IAG), 
Instruction Cache (IC), Instruction Early Decode (IED) and Instruction 
Buffer (IB).  The instruction and branch caches are integral components of 
the IFP.
        Four operations can be occurring concurrently in the IFP.  The 
IAG stage calculates the next prefetch address from a number of possible 
sources.  The variable length of the M68000 Family instruction set as well 
as change-of-flow detection make this stage critical to the performance of 
the 68060.  After the IAG sends the appropriate address to the instruction 
cache, the IC stage of the IFP is responsible for performing the cache lookup 
and fetching the bit pattern of the instruction.  The IED stage of the 
pipeline analyzes the bytes fetched from the instruction stream and builds 
an extended operation word.  This lookup stage effectively converts the 
variable-length instruction with multiple formats into a fixed-length 
extended operation word that is used by the OEPs in all subsequent processing.  
At the conclusion of the IED stage, the prefetched bytes along with the 
extended operation word issue into the instruction buffer  The IB stage 
reads instructions from the 96-byte FIFO buffer and loads them into 
the dual OEPs.  The FIFO effectively decouples the operation of the IFP from 
the operations of the dual OEPs.

Figure 3 - The OEP Units of the 68060 (***NOT SHOWN IN ASCII-ONLY COPY***)


        Consecutive instructions issue from the FIFO instruction buffer into 
the instruction registers of the dual OEPs.  The operand execution pipelines, 
known as the primary OEP (pOEP) and the secondary OEP (sOEP), are partitioned 
into a 4-stage implementation depicted in Figure 3.  The four stages of the 
OEPs are: Decode and Select (DS), operand Address Generation (AG), Operand 
Cycle (OC) and the EXecute cycle (EX).  For instructions writing data to 
memory, there are two additional pipeline stages:  the Data Available (DA) 
and Store (ST) cycles.
	The Decode and Select stage of the OEPs provides two primary 
functions:  this stage determines the next state for the entire operand 
pipeline and selects the components required for operand address calculation.  
To determine the next state of the OEPs, the DS cycle logic tests the 
extended operation words to ascertain the number of instructions that can 
issue into the AG stage.  If multiple instructions can issue into the AG 
stages in parallel, the first and second instructions move into the 
respective AG stages. If only a single instruction can issue because of 
architectural constraints, the first instruction issues into the pOEP, and 
the DS stage evaluates the second and third instructions as a pair during 
the next clock cycle.  The net effect is a sliding 2-instruction window to 
examine possible pairs of instructions for parallel execution.  A dedicated 
adder located in the AG stage sums the three components of the effective 
address:  the base, the index and the displacement.
	The Operand Cycle (OC) of the OEPs performs the actual fetch of 
operands required by the instruction.  For memory operands, the OEP accesses 
the data cache in this cycle to retrieve the data.  For register operands, 
the OEP accesses the register file containing all the general-purpose 
registers during the OC stage. At the conclusion of the OC cycle, the execute 
engines receive the required operands.  The EXecute cycle (EX) performs the 
operations required to complete the instruction execution including updating 
the condition codes.  If the destination of the instruction is a data or an 
address register, the result is available at the end of the EX stage; if 
the destination is a memory location, the operation requires two additional 
cycles.  First, there is a Data Available (DA) stage where the destination 
operand issues to the data cache, which aligns the operand.  Second, updates 
to the data cache occur during the STore (ST) cycle.  Additionally, there is 
a four longword FIFO write buffer that is selectable on a page basis and 
serves to decouple the operation of the OEPs from external bus cycles.
	Since this is an order-two superscalar machine (dual instruction 
issue), the sOEP is conceptually a copy of the pOEP.  A notable exception to 
this concept is the fact that the sOEP executes only a subset of the complete 
instruction set.  As an example, the floating point execute engine resides 
only in the pOEP.  Consequently, all floating point instructions must execute 
only the pOEP.  As instructions travel down the OEPs, they remain 
lock-stepped.  This insures that there is no out-of-order execution and thus 
greatly simplifies support for the precise exception model of the M68000 
Family.  The micro-architecture of the 68060 supports a number of optimizations 
to increase the number of superscalar instruction dispatches.  In internal 
evaluations of traces from existing object code totaling several billion 
instructions, 50% to 60% of instructions execute as pairs.
	From the preceding discussion concerning the operand pipeline stages, 
all data cache read references occur in the OC stage while data cache write 
references occur in the ST stage.  The data cache uses a 4-way interleaving 
scheme to allow simultaneous operand read and write operations from both 
OEPs.  The data cache directories are a single-ported design.  As a result, 
within a superscalar pair of instructions, the 68060 only allows a single 
operand memory reference.  The data cache also supports single-cycle 
references of 64-bit double-precision floating-point operands.
	A common drawback to long pipelines is the penalty associated with 
refilling the pipeline when a change of program flow occurs.  Condition code 
evaluation occurs in the EX stage, but waiting for a branch instruction to 
reach this point needlessly restricts performance.  Instead, the 68060 
contains a 256-entry Branch Cache (BC) which predicts the direction of a 
branch based on past execution history well in advance of the actual 
evaluation of condition codes.
	The BC stores the Program Counter value of change-of-flow instructions 
as well as the target address of those branches.  The BC also uses some 
history bits to track how each given branch instruction has executed in the 
past.  The 68060 checks the BC during the IC stage of the IFP, the same stage 
that performs the lookup into the Instruction Cache.  If the BC indicates that 
the instruction is a branch and that this branch should be predicted as taken, 
the IAG pipeline stage is updated with the target address of the branch 
instead of the next sequential address.  This approach, along with the 
instruction folding techniques that the BC uses, allow the 68060 to achieve a 
zero-clock latency penalty for correctly predicted taken branches.
	If the BC predicts a branch as not-taken, there is no discontinuity 
in the instruction prefetch stream.  The IFP continues to fetch instructions 
sequentially.  Eventually, the not-taken branch instruction executes as a 
single-clock instruction in the OEP, so correctly predicted not-taken 
branches require a single clock to execute.  These predicted as not-taken 
branches allow a superscalar instruction dispatch, so in many cases, the next 
instruction executes simultaneously in the sOEP.
	The 68060 performs the actual condition code checking to evaluate the 
branch conditions in the EX stage of the OEP.  If a branch has been 
mispredicted, the 68060 discards the contents of the IFP and the OEPs, and 
the 68060 resumes fetching of the instruction stream at the correct location.  
To refill the pipeline in this manner, there is a seven-clock penalty for a 
mispredicted branch.  If the BC correctly predicted the branch, the OEPs 
execute seamlessly with no pipeline stalls.  Internal studies of the 
prediction algorithm used on the 68060 show greater than 90% accuracy from 
statistics gathered from several billions of instructions from applications 
across many runtime environments.

Floating Point Unit

	The floating point unit (FPU) of the 68060 provides complete binary 
compatibility with previous M68000 Family floating point solutions.  The 
68060 performs all internal operations in 80-bit extended precision and 
completely supports the IEEE 754 floating point standard.
	Conceptually, the FPU appears as another execute engine in the EX 
stage of the pOEP.  A 64-bit data path between the data cache and the FPU 
optimizes the FPU for single-cycle references of 32- or 64-bit memory 
operands.  As previously noted, all floating point instructions must execute 
through the pOEP.  However, integer instructions can be simultaneously 
dispatched into the sOEP with most FPU instructions, and the 68060 supports 
overlap between the integer execute engines and the FPU.  Once a multi-cycle 
FPU instruction is dispatched, the pOEP and sOEP continue to dispatch and 
complete integer instructions (including change-of-flow instructions) until 
another FPU instruction is encountered.  At this point, the OEPs stall until 
the FPU execute engine is available for the next instruction.
	The FPU's internal organization consists of three units:  the adder, 
the multiplier and the divider.  The 68060's design does not support 
concurrent floating point execution; only one of these functional units is 
active at a time.  Table 1 shows execution times for the 68060 FPU. 

		Instruction	CPU Clocks
		  FMOVE	            1
		  FADD	            3
		  FMUL	            4
		  FDIV		24 
		  FSQRT	           66

Table 1 - 68060 Floating Point Execution Times

Pipeline Example

	Figure 4 shows an example of the 68060 pipeline operation.  The code 
shown comes from a commercially available compiler and represents the 
inner SAXPY loop from the matrix300 program from the SPEC89 benchmark suite.  
Since the OEPs are decoupled from the IFP, this example only focuses on 
the OEPs.
	This loop executes 13 instructions in only ten clock cycles, 
producing a steady-state performance of 0.77 clocks per instruction (CPI).  
This code includes two multi-cycle FPU instructions (4-cycle FMUL and 3-cycle 
FADD), but the superscalar micro-architecture is  able to effectively exploit 
the parallelism within the loop to achieve a less than one CPI measure.
	This example code loop demonstrates several major architectural 
features of the 68060.  Of the 13 instructions, the 68060 dispatches four 
groups of 2-instruction pairs (at cycles 1, 2, 4, 5), one group of three 
instructions (at cycle 9) and two individual instructions (at cycles 3 and 8).  
At cycle 3, the pair of instructions being examined is {pOEP = lsl.l, 
sOEP = fadd.d}.  Since all floating-point instructions must issue into the 
pOEP, the fadd.d does not issue into the sOEP.  On the next cycle, a new 
2-instruction pair is examined {pOEP = fadd.d, sOEP = add.l}, and at this 
time, both instructions issue down the OEPs.  At cycles 6 and 7, the pipeline 
stalls on the fadd.d instruction as the 4-cycle fmul completes execution. The 
floating-point store operation at cycle 8 inhibits any sOEP dispatch because 
of certain post-exception fault possibilities.  At cycle 9, an instruction 
triplet is dispatched {add.l, subq.l, bcc.b}.  Recall the branch cache 
utilizes various instruction folding techniques that effectively allow this 
predicted as taken branch to execute in 0 cycles.  Finally, at cycle 10, the 
pipeline stalls for one clock on the floating-point store instruction as it 
waits for the completion of the three-cycle fadd.

Power Management On Chip

	With 2.4 million transistors operating at frequencies of 50 MHz and 
higher, power management becomes a crucial issue on the 68060.  From the 
inception, the 68060 focused on minimizing chip-level power dissipation.  There 
are primarily three different areas of interest for power dissipation.
	The 68060 operates from a 3.3 volt power supply.  Since power 
dissipation is a function of the square of the power supply voltage, simply 
changing the power supply voltage to  3.3 volts results  in a 56%  reduction 
in power compared to a 5 volt power supply.  In addition to a lower supply 
voltage, the 68060 is a completely static design.  The 68060's operating 
frequency, which linearly affects chip-level power dissipation, can vary 
dynamically down toward the DC range.  Although the 68060 is a 3.3 volt part, 
its I/O buffers interface to either 3 volt or 5 volt peripherals and memory, 
facilitating upgrades of existing designs.
	Sophisticated power management circuitry on chip dynamically controls 
and minimizes power consumption.  This circuitry selectively updates modules 
on the 68060 on a clock-by-clock basis, dynamically shuttung off the circuits 
not required to support the activities in the current clock cycle.  Entire 
areas of the 68060 can shut off for long periods of time when they are not 
required.
	The 68060 also incorporates the LPSTOP instruction.  This instruction 
effectively puts the 68060 into a low-power sleep mode in which it stays until 
awakened by an externally generated interrupt.  Data on previous members of the 
M68000 Family shows that use of the LPSTOP instruction can extend battery 
life in portable applications by over 250%.

Summary

	The 68060 relies on new as well as standard architectural techniques 
to extend the performance of the M68000 Family product line.  Performance 
simulations predict that between 3 and 3.5 times a 25 MHz 68040 are possible 
using existing  object code.
	The 68060 relies on a deep internal pipeline and a superscalar 
internal architecture coupled with 8 Kbyte instruction and data caches, a 
256-entry branch cache, on-chip MMUs and an on-chip FPU to bring new levels 
of performance to the M68000 Family architecture.
	Power management is very important on the 68060, and this design uses 
dynamic power management techniques to minimize power consumption.  The 68060 
operates from a 3.3 volt power supply, which greatly reduces its power 
dissipation.  Although the 68060 operates at a lower operating voltage, it 
interfaces to both 3 volt and 5 volt peripherals and logic.
	In addition to providing full application object code compatibility 
with previous CPUs in this family, the 68060 provides a superset of 68040 
hardware functionality.  Designs compatible with existing and future 68040 
systems are simple, and higher frequency designs are possible using a new 
bus interface protocol.

Acknowledgements

	The authors would like to thank all members of the 68060 design team 
and management  Without their concerted team effort, this project and this 
paper would not have been possible.


References

Bernal, R.D. and Circello, J.C., "Putting RISC Efficiency To Work in CISC 
Architectures," VLSI Systems Design, September 1987, pp. 46-51.

Circello, J.C. et al, "Refined Method Brings Precision to Performance 
Analysis," Computer Design, March 1, 1989,   pp. 77-82.

Edenfield, R.W. et al, "The 68040 Processor, Part 1, Design and 
Implementation," IEEE MICRO, February 1990, pp. 66-78.

Diefendorff, K. and Allen, M., "Organization of the Motorola 88110 
Superscalar RISC Microprocessor," IEEE MICRO, April 1992, pp. 40-63.

Hennessy, J. and Patterson, D., "Computer Architecture: A Quantitative 
Approach,"  Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1990.