The Motorola 68060 Microprocessor Joe Circello and Floyd Goodrich Microprocessor and Memory Technology Group Motorola Inc. Abstract The Motorola 68060 is the fourth-generation microprocessor of the M68000 Family. User object code-compatible with previous family members, it delivers 3 to 3.5 times the performance of the previous generation processor in this family, the 68040. Performance features include a superscalar integer unit, a high-performance floating point unit, dual 8-Kbyte on-chip caches, a branch cache, and on-chip memory-management units. A streamlined design enables high-performance techniques to achieve a high level of parallel instruction execution. Improved performance at a low cost makes the 68060 an ideal processor for the mid to high range of desktop computing applications, and compatibility features enable it easily to upgrade performance of existing 68040-based systems. This paper describes the operation of the 68060. Figure 1 - Simplified 68060 Block Diagram (***NOT SHOWN IN ASCII-ONLY COPY***) Introduction and Overview The 68060 is the fourth-generation microprocessor of Motorola's M68000 Family of CISC micro-processors. It is a single-chip implementation that employs a deep pipeline, dual-issue superscalar execution, a branch cache, a high performance floating point unit (FPU), 8 Kbytes each of on-chip instruction and data caches, on-chip demand paged memory management units (MMUs). These features allow it to achieve execution rates of less than one clock per instruction sustained execution. In order to meet the performance goals of the 68060, instruction execution times needed to decrease, and parallel operations needed to increase over previous generations of M68000 microprocessors. A superscalar instruction dispatch micro-architecture is the most obvious feature of this increased parallelism on the 68060. Superscalar architectures are distinguished by their ability to dispatch two or more instructions per clock cycle from an otherwise conventional instruction stream. Figure 1 shows a block diagram of the 68060. In addition to the superscalar features, this single chip has many other performance, upgrade and system integration features including: * 100% user-mode object code compatibility with 68040 * Dual-issue superscalar instruction dispatch implementation of M68000 architecture * IEEE Compatible on-chip FPU * Branch Cache to minimize pipeline refill latency * Separate 8 Kbyte on-chip instruction and data caches with simultaneous access * Bus Snooping * 68040 compatible bus protocol or new high-speed bus protocol * 32-bit nonmultiplexed address and data bus * Four-entry write buffer * Concurrent operation of Integer Unit, FPU, MMUs, caches, Bus Controller, and Pipeline * Sophisticated power management subsystem * Low-power 3.3V operation * JTAG Boundary Scan Design Targets The design goals of the 68060 included providing a simple upgrade path for existing M68000 Family designs while also supplying a basis for Motorola's successful 68EC0x0 Family of embedded controllers and for the 68300 Family of modular integrated controllers. Initial requirements for the targeted 68060 were to provide a factor of three performance enhancement over a 25 MHz 68040 with existing compiler technology. Architectural enhancements were to provide at least a 50% improvement while doubling clock frequency doubles performance. The performance estimates reflect analysis of existing object code; additional performance advantages are, of course, available when using compilers designed specifically for the 68060. In addition to software compatibility, the 68060 preserves the investment in board-level ASICs by providing bus compatibility with the 68040. This supersocket approach facilitates upgrade of all existing and future 68040-based systems. The 68060 uses approximately 2.4 million transistors. The part is a static CMOS design based on a 0.5 um triple level metal wafer process. This process will enable the 68060 to operate at a 3.3 volt power supply-- a greater than 50% power reduction over a 5.0 volt power supply. Since the 68060 minimizes power dissipation through a variety of architectural and circuit techniques, it is able to offer high performance processing to the laptop and portable markets in addition to the traditional computer-system markets. Architectural Features The architecture of the 68060 revolves around its novel integer unit pipeline. Taking advantage of many of the same performance enhancements used by RISC designs as well as developing new architectural techniques, the 68060 harnesses new levels of performance for the M68000 Family. The superscalar micro-architecture actually consists of two distinct parts: a four-stage instruction fetch pipeline (IFP) responsible for accessing the instruction stream and dual four-stage operand execution pipelines (OEPs) which perform the actual instruction execution. These pipeline structures operate in an independent manner with a FIFO instruction buffer providing the decoupling mechanism. A branch cache minimizes the latency effects of change of flow instructions by allowing the IFP to detect changes in the instruction prefetch stream well in advance of their actual execution by the OEPs. The 68060 is a full internal Harvard architecture. The instruction and data caches are designed to support concurrent instruction fetch and operand read and operand write references on every clock cycle. This organization coupled with a multi-ported register file provide the necessary bandwidth to maximize the throughput of the pipelines. The operand execution pipelines operate in a lock-stepped manner that provides simultaneous, but not out-of-order, program execution. The net result is a machine architecture invisible to existing applications providing full support of the M68000 programming model including precise exceptions. The 68060 external bus interface provides a superset of 68040 functionality. Maintaining 32-bit widths on both the address and data bus as well as a bursting protocol for cacheable memory, the 68060 supports transfers of one, two, four, or 16 bytes in a given bus cycle. The system designer can, however, choose to operate in one of two modes: a mode compatible with the 68040 protocol or a new mode consistent with higher frequency bus designs. By allowing this choice, the 68060 can easily fit into upgrades of existing designs as well as new high frequency implementations. Pipeline Organization The IFP is responsible for prefetching instructions and loading them into the FIFO instruction buffer. One key aspect of the design is the branch cache, which allows the IFP to detect changes in the instruction stream based on past execution history. This allows the IFP to provide a constant stream of instructions to the instruction buffer to maximize the execution rates of the OEPs. The IFP is implemented as a four-stage design shown in Figure 2. Figure 2 - The IFP of the 68060 (***NOT SHOWN IN ASCII-ONLY COPY***) The four stages of the IFP are: Instruction Address Generation (IAG), Instruction Cache (IC), Instruction Early Decode (IED) and Instruction Buffer (IB). The instruction and branch caches are integral components of the IFP. Four operations can be occurring concurrently in the IFP. The IAG stage calculates the next prefetch address from a number of possible sources. The variable length of the M68000 Family instruction set as well as change-of-flow detection make this stage critical to the performance of the 68060. After the IAG sends the appropriate address to the instruction cache, the IC stage of the IFP is responsible for performing the cache lookup and fetching the bit pattern of the instruction. The IED stage of the pipeline analyzes the bytes fetched from the instruction stream and builds an extended operation word. This lookup stage effectively converts the variable-length instruction with multiple formats into a fixed-length extended operation word that is used by the OEPs in all subsequent processing. At the conclusion of the IED stage, the prefetched bytes along with the extended operation word issue into the instruction buffer The IB stage reads instructions from the 96-byte FIFO buffer and loads them into the dual OEPs. The FIFO effectively decouples the operation of the IFP from the operations of the dual OEPs. Figure 3 - The OEP Units of the 68060 (***NOT SHOWN IN ASCII-ONLY COPY***) Consecutive instructions issue from the FIFO instruction buffer into the instruction registers of the dual OEPs. The operand execution pipelines, known as the primary OEP (pOEP) and the secondary OEP (sOEP), are partitioned into a 4-stage implementation depicted in Figure 3. The four stages of the OEPs are: Decode and Select (DS), operand Address Generation (AG), Operand Cycle (OC) and the EXecute cycle (EX). For instructions writing data to memory, there are two additional pipeline stages: the Data Available (DA) and Store (ST) cycles. The Decode and Select stage of the OEPs provides two primary functions: this stage determines the next state for the entire operand pipeline and selects the components required for operand address calculation. To determine the next state of the OEPs, the DS cycle logic tests the extended operation words to ascertain the number of instructions that can issue into the AG stage. If multiple instructions can issue into the AG stages in parallel, the first and second instructions move into the respective AG stages. If only a single instruction can issue because of architectural constraints, the first instruction issues into the pOEP, and the DS stage evaluates the second and third instructions as a pair during the next clock cycle. The net effect is a sliding 2-instruction window to examine possible pairs of instructions for parallel execution. A dedicated adder located in the AG stage sums the three components of the effective address: the base, the index and the displacement. The Operand Cycle (OC) of the OEPs performs the actual fetch of operands required by the instruction. For memory operands, the OEP accesses the data cache in this cycle to retrieve the data. For register operands, the OEP accesses the register file containing all the general-purpose registers during the OC stage. At the conclusion of the OC cycle, the execute engines receive the required operands. The EXecute cycle (EX) performs the operations required to complete the instruction execution including updating the condition codes. If the destination of the instruction is a data or an address register, the result is available at the end of the EX stage; if the destination is a memory location, the operation requires two additional cycles. First, there is a Data Available (DA) stage where the destination operand issues to the data cache, which aligns the operand. Second, updates to the data cache occur during the STore (ST) cycle. Additionally, there is a four longword FIFO write buffer that is selectable on a page basis and serves to decouple the operation of the OEPs from external bus cycles. Since this is an order-two superscalar machine (dual instruction issue), the sOEP is conceptually a copy of the pOEP. A notable exception to this concept is the fact that the sOEP executes only a subset of the complete instruction set. As an example, the floating point execute engine resides only in the pOEP. Consequently, all floating point instructions must execute only the pOEP. As instructions travel down the OEPs, they remain lock-stepped. This insures that there is no out-of-order execution and thus greatly simplifies support for the precise exception model of the M68000 Family. The micro-architecture of the 68060 supports a number of optimizations to increase the number of superscalar instruction dispatches. In internal evaluations of traces from existing object code totaling several billion instructions, 50% to 60% of instructions execute as pairs. From the preceding discussion concerning the operand pipeline stages, all data cache read references occur in the OC stage while data cache write references occur in the ST stage. The data cache uses a 4-way interleaving scheme to allow simultaneous operand read and write operations from both OEPs. The data cache directories are a single-ported design. As a result, within a superscalar pair of instructions, the 68060 only allows a single operand memory reference. The data cache also supports single-cycle references of 64-bit double-precision floating-point operands. A common drawback to long pipelines is the penalty associated with refilling the pipeline when a change of program flow occurs. Condition code evaluation occurs in the EX stage, but waiting for a branch instruction to reach this point needlessly restricts performance. Instead, the 68060 contains a 256-entry Branch Cache (BC) which predicts the direction of a branch based on past execution history well in advance of the actual evaluation of condition codes. The BC stores the Program Counter value of change-of-flow instructions as well as the target address of those branches. The BC also uses some history bits to track how each given branch instruction has executed in the past. The 68060 checks the BC during the IC stage of the IFP, the same stage that performs the lookup into the Instruction Cache. If the BC indicates that the instruction is a branch and that this branch should be predicted as taken, the IAG pipeline stage is updated with the target address of the branch instead of the next sequential address. This approach, along with the instruction folding techniques that the BC uses, allow the 68060 to achieve a zero-clock latency penalty for correctly predicted taken branches. If the BC predicts a branch as not-taken, there is no discontinuity in the instruction prefetch stream. The IFP continues to fetch instructions sequentially. Eventually, the not-taken branch instruction executes as a single-clock instruction in the OEP, so correctly predicted not-taken branches require a single clock to execute. These predicted as not-taken branches allow a superscalar instruction dispatch, so in many cases, the next instruction executes simultaneously in the sOEP. The 68060 performs the actual condition code checking to evaluate the branch conditions in the EX stage of the OEP. If a branch has been mispredicted, the 68060 discards the contents of the IFP and the OEPs, and the 68060 resumes fetching of the instruction stream at the correct location. To refill the pipeline in this manner, there is a seven-clock penalty for a mispredicted branch. If the BC correctly predicted the branch, the OEPs execute seamlessly with no pipeline stalls. Internal studies of the prediction algorithm used on the 68060 show greater than 90% accuracy from statistics gathered from several billions of instructions from applications across many runtime environments. Floating Point Unit The floating point unit (FPU) of the 68060 provides complete binary compatibility with previous M68000 Family floating point solutions. The 68060 performs all internal operations in 80-bit extended precision and completely supports the IEEE 754 floating point standard. Conceptually, the FPU appears as another execute engine in the EX stage of the pOEP. A 64-bit data path between the data cache and the FPU optimizes the FPU for single-cycle references of 32- or 64-bit memory operands. As previously noted, all floating point instructions must execute through the pOEP. However, integer instructions can be simultaneously dispatched into the sOEP with most FPU instructions, and the 68060 supports overlap between the integer execute engines and the FPU. Once a multi-cycle FPU instruction is dispatched, the pOEP and sOEP continue to dispatch and complete integer instructions (including change-of-flow instructions) until another FPU instruction is encountered. At this point, the OEPs stall until the FPU execute engine is available for the next instruction. The FPU's internal organization consists of three units: the adder, the multiplier and the divider. The 68060's design does not support concurrent floating point execution; only one of these functional units is active at a time. Table 1 shows execution times for the 68060 FPU. Instruction CPU Clocks FMOVE 1 FADD 3 FMUL 4 FDIV 24 FSQRT 66 Table 1 - 68060 Floating Point Execution Times Pipeline Example Figure 4 shows an example of the 68060 pipeline operation. The code shown comes from a commercially available compiler and represents the inner SAXPY loop from the matrix300 program from the SPEC89 benchmark suite. Since the OEPs are decoupled from the IFP, this example only focuses on the OEPs. This loop executes 13 instructions in only ten clock cycles, producing a steady-state performance of 0.77 clocks per instruction (CPI). This code includes two multi-cycle FPU instructions (4-cycle FMUL and 3-cycle FADD), but the superscalar micro-architecture is able to effectively exploit the parallelism within the loop to achieve a less than one CPI measure. This example code loop demonstrates several major architectural features of the 68060. Of the 13 instructions, the 68060 dispatches four groups of 2-instruction pairs (at cycles 1, 2, 4, 5), one group of three instructions (at cycle 9) and two individual instructions (at cycles 3 and 8). At cycle 3, the pair of instructions being examined is {pOEP = lsl.l, sOEP = fadd.d}. Since all floating-point instructions must issue into the pOEP, the fadd.d does not issue into the sOEP. On the next cycle, a new 2-instruction pair is examined {pOEP = fadd.d, sOEP = add.l}, and at this time, both instructions issue down the OEPs. At cycles 6 and 7, the pipeline stalls on the fadd.d instruction as the 4-cycle fmul completes execution. The floating-point store operation at cycle 8 inhibits any sOEP dispatch because of certain post-exception fault possibilities. At cycle 9, an instruction triplet is dispatched {add.l, subq.l, bcc.b}. Recall the branch cache utilizes various instruction folding techniques that effectively allow this predicted as taken branch to execute in 0 cycles. Finally, at cycle 10, the pipeline stalls for one clock on the floating-point store instruction as it waits for the completion of the three-cycle fadd. Power Management On Chip With 2.4 million transistors operating at frequencies of 50 MHz and higher, power management becomes a crucial issue on the 68060. From the inception, the 68060 focused on minimizing chip-level power dissipation. There are primarily three different areas of interest for power dissipation. The 68060 operates from a 3.3 volt power supply. Since power dissipation is a function of the square of the power supply voltage, simply changing the power supply voltage to 3.3 volts results in a 56% reduction in power compared to a 5 volt power supply. In addition to a lower supply voltage, the 68060 is a completely static design. The 68060's operating frequency, which linearly affects chip-level power dissipation, can vary dynamically down toward the DC range. Although the 68060 is a 3.3 volt part, its I/O buffers interface to either 3 volt or 5 volt peripherals and memory, facilitating upgrades of existing designs. Sophisticated power management circuitry on chip dynamically controls and minimizes power consumption. This circuitry selectively updates modules on the 68060 on a clock-by-clock basis, dynamically shuttung off the circuits not required to support the activities in the current clock cycle. Entire areas of the 68060 can shut off for long periods of time when they are not required. The 68060 also incorporates the LPSTOP instruction. This instruction effectively puts the 68060 into a low-power sleep mode in which it stays until awakened by an externally generated interrupt. Data on previous members of the M68000 Family shows that use of the LPSTOP instruction can extend battery life in portable applications by over 250%. Summary The 68060 relies on new as well as standard architectural techniques to extend the performance of the M68000 Family product line. Performance simulations predict that between 3 and 3.5 times a 25 MHz 68040 are possible using existing object code. The 68060 relies on a deep internal pipeline and a superscalar internal architecture coupled with 8 Kbyte instruction and data caches, a 256-entry branch cache, on-chip MMUs and an on-chip FPU to bring new levels of performance to the M68000 Family architecture. Power management is very important on the 68060, and this design uses dynamic power management techniques to minimize power consumption. The 68060 operates from a 3.3 volt power supply, which greatly reduces its power dissipation. Although the 68060 operates at a lower operating voltage, it interfaces to both 3 volt and 5 volt peripherals and logic. In addition to providing full application object code compatibility with previous CPUs in this family, the 68060 provides a superset of 68040 hardware functionality. Designs compatible with existing and future 68040 systems are simple, and higher frequency designs are possible using a new bus interface protocol. Acknowledgements The authors would like to thank all members of the 68060 design team and management Without their concerted team effort, this project and this paper would not have been possible. References Bernal, R.D. and Circello, J.C., "Putting RISC Efficiency To Work in CISC Architectures," VLSI Systems Design, September 1987, pp. 46-51. Circello, J.C. et al, "Refined Method Brings Precision to Performance Analysis," Computer Design, March 1, 1989, pp. 77-82. Edenfield, R.W. et al, "The 68040 Processor, Part 1, Design and Implementation," IEEE MICRO, February 1990, pp. 66-78. Diefendorff, K. and Allen, M., "Organization of the Motorola 88110 Superscalar RISC Microprocessor," IEEE MICRO, April 1992, pp. 40-63. Hennessy, J. and Patterson, D., "Computer Architecture: A Quantitative Approach," Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1990.