Monday, December 15, 2008

Intel Pentium M Tutorial

Introduction
In this tutorial we will explain you how Pentium M CPU works in an easy to follow language. Since all new CPUs from Intel will use Pentium M’s architecture, studying this architecture is very important to understand the architecture of Core Solo e Core Duo (Yonah) CPUs and also to understand the foundation layer for the forthcoming Core microarchitecture, to be used by Merom, Conroe and Woodcrest CPUs. In this tutorial you will learn exactly how its architecture works so you will be able to compare it more precisely to other processors from Intel and competitors from AMD.
Pentium M is based on Intel’s 6th generation architecture, a.k.a. P6, the same used by Pentium Pro, Pentium II and Pentium III CPUs and not on Pentium 4’s as you may think, being originally targeted to mobile computers. You may think of Pentium M as an enhanced Pentium III. Pay attention to not confuse Pentium M with Pentium 4 M or with Pentium III M, which are different CPUs. Read our tutorial All Pentium M Models to learn about all Pentium M versions released so far.
Several times Pentium M is called Centrino. Actually, Centrino is when you have a laptop with a Pentium M CPU, an Intel 855 or 915 chipset and Intel/PRO wireless LAN. So, if you have a laptop based on Pentium M but without Intel/PRO wireless LAN, for example, it cannot be called Centrino.
In this tutorial we will basically explain how P6 architecture works and what’s new on Pentium M compared to Pentium III. So, in this tutorial you will also learn how Pentium Pro, Pentium II, Pentium III and Celeron (models based on P6 architecture, i.e. slot 1 and socket 370 ones) processors work.
In order to continue, however, you need to have read our tutorial “How a CPU Works”. In this tutorial we explain the basics about how a CPU works. In the present tutorial we are assuming that you have already read it, so if you didn’t, please take a moment to read it before continuing, otherwise you may find yourself a little bit lost. Actually we can consider the present tutorial as a sequel to our How a CPU Works tutorial. It is also a good idea to read our Inside Pentium 4 Architecture tutorial, just for understanding how Pentium M differs from Pentium 4.
Before going further, let’s see the main differences between Pentium M and Pentium III CPUs:
• Externally, Pentium M works like Pentium 4, transferring four data per clock cycle. This technique is called QDR (Quad Data Rate) and makes the local bus to have a performance four times its actual clock rate, see table below.
Real Clock Performance Transfer Rate
100 MHz 400 MHz 3.2 GB/s
133 MHz 533 MHz 4.2 GB/s
• L1 memory cache: two 32 KB L1 memory caches, one for data and another for instructions (Pentium III had two 16 KB L1 memory caches).
• L2 memory cache: 1 MB on 130 nm models (“Banias” core) or 2 MB on 90 nm models (“Dothan” core). Pentium III had up to 512 KB. Celeron M, which is a low-cost version of Pentium M, has a 512 KB L2 memory cache.
• Support for SSE2 instructions.
• Advanced branch prediction: branch prediction circuit was redesigned (and based on Pentium 4’s branch prediction circuit) to improve performance.
• Micro-ops fusion: The instruction decoder fuses two micro-ops into one micro-op in order to save energy and improve performance. We’ll talk more about this later.
• Enhanced SpeedStep Technology, which allows the CPU to reduce its clock while idle in order to save battery life.
• Several other battery-saving features were added to Pentium M’s microarchitecture, since this CPU was originally designed for mobile computers.
Let’s now talk more in depth about Pentium M’s architecture.
Pentium M Pipeline
Pipeline is a list of all stages a given instruction must go thru in order to be fully executed. Intel didn’t disclosure Pentium M’s pipelines, so we will talk about Pentium III’s. Pentium M’s pipeline has probably more stages than Pentium III’s, but analyzing Pentium III’s will give you a good idea on how Pentium M’s architecture work.
Just to remember, Pentium 4 pipeline has 20 stages and the pipeline of newer Pentium 4 CPUs based on “Prescott” core has 31 stages!
On Figure 1 you can see Pentium III’s 11-stage pipeline.

click to enlarge
Figure 1: Pentium III pipeline.
Here is a basic explanation of each stage, which explains how a given instruction is processed by P6-class processors. If you think this is too complex for you, don’t worry. This is just a summary of what we will be explaining on the next pages.
• IFU1: Loads one line (32 bytes, i.e. 256 bits) from L1 instruction cache and stores it in the Instruction Streaming Buffer.
• IFU2: Identifies the instructions boundaries within 16 bytes (128 bits). Since x86 instructions don’t have a fixed length this stage marks where each instruction starts and ends within the loaded 16 bytes. If there is any branch instruction within these 16 bytes, its address is stored at the Branch Target Buffer (BTB), so the CPU can later use this information on its branch prediction circuit.
• IFU3: Marks to which instruction decoder unit each instruction must be sent. There are three different instruction decoder units, as we will explain later.
• DEC1: Decodes the x86 instruction into a RISC microinstruction (a.k.a. micro-op). Since the CPU has three instructions decode units, it is possible to decode up to three instructions at the same time.
• DEC2: Sends the micro-ops to the Decoded Instruction Queue, which is capable to store up to six micro-ops. If the instruction was converted in more than six micro-ops, this stage must be repeated in order to catch the missing micro-ops.
• RAT: Since P6 microarchitecture implements out-of-order execution (OOO), the value of a given register could be altered by an instruction executed before its “correct” (i.e., original) place in the program flow, corrupting the data needed by another instruction. So, to solve this kind of conflict, at this stage the original register used by the instruction is changed to one of the 40 internal registers that P6 microarchitecture has.
• ROB: At this stage three micro-ops are loaded into the Reorder Buffer (ROB). If all data necessary for the execution of a micro-op are available and if there is an open slot at the Reservation Station micro-op queue, then the micro-op is moved to this queue.
• DIS: If the micro-op wasn’t sent to the Reservation Station micro-op queue, this is done at this stage. The micro-op is sent to the proper execution unit.
• EX: The micro-op is executed at the proper execution unit. Usually each micro-op needs only one clock cycle to be executed.
• RET1: Checks at the Reorder Buffer if there is any micro-op that can be flagged as “executed”.
• RET2: When all micro-ops related to the previous x86 instruction were already removed from the Reorder Buffer and all micro-ops related to the current x86 instruction were executed, these micro-ops are removed from the Reorder Buffer and the x86 registers are updated (the inverse process done at RAT stage). The retirement process must be done in order. Up to three micro-ops can be removed from the Reorder Buffer per clock cycle.
Don’t worry if all this sounded confusing to you. We will explain all this better on the next pages.
Memory Cache and Fetch Unit
As we mentioned, Pentium M’s L2 memory cache can be of 1 MB (130 nm models, a.k.a. “Banias” core) or of 2 MB (90 nm models, a.k.a. “Dothan” core), while it has two L1 memory caches, one of 32 KB for instructions and another of 32 KB for data.
The fetch unit is divided into three stages, as we explained on the previous page. On Figure 2 you can see how Pentium M’s fetch unit works.

click to enlarge
Figure 2: Fetch unit.
As we mentioned before, the fetch unit loads one line (32 bytes = 256 bits) into its Instruction Streaming Buffer. Then the Instruction Length Decoder identifies the instructions boundaries within 16 bytes (128 bits). Since x86 instructions don’t have a fixed length this stage marks where each instruction starts and ends within the loaded 128 bits. If there is any branch instruction within these 128 bits, its address is stored at the Branch Target Buffer (BTB), so the CPU can later use this information on its branch prediction circuit. The BTB has 512 entries.
Then the Decoder Alignment Stage marks to which instruction decoder unit each instruction must be sent. There are three different instruction decoder units, as we will explain on next page.
Instruction Decoder and Register Renaming
Since the introduction of P6 architecture with Pentium Pro Intel processors use a hybrid CISC/RISC architecture. The processor must accept CISC instructions, also known as x86 instructions, since all software available today is written using this kind of instructions. A RISC-only CPU couldn’t be create for the PC because it wouldn’t run software we have available today, like Windows and Office.
So, the solution used by all processors available on the market today from both Intel and AMD is to use a CISC/RISC decoder. Internally the CPU processes RISC-like instructions, but its front-end accepts only CISC x86 instructions.
CISC x86 instructions are referred as “instructions” as the internal RISC instructions are referred as “microinstructions”, “micro-ops” or “µops”.
These RISC microinstructions, however, cannot be accessed directly, so we couldn’t create software based on these instructions to bypass the decoder. Also, each CPU uses its own RISC instructions, which are not public documented and are incompatible with microinstructions from other CPUs. I.e., Pentium M microinstructions are different from Pentium 4 microinstructions, which are different from Athlon 64 microinstructions.
Depending on the complexity of the x86 instruction, it has to be converted into several RISC microinstructions.
Pentium M instruction decoder works like shown on Figure 3. As you can see, there are three decoders and a Micro Instruction Sequencer (MIS). Two decoders are optimized for simple instructions, which are the most used ones. This kind of instruction is converted in just one micro-op. One decoder is optimized for complex x86 instructions, which can be converted in up to four micro-ops. If the x86 instruction is too complex, i.e. it converts into more than four micro-ops, it is sent to the Micro Instruction Sequencer, which is a ROM memory containing a list of micro-ops that should replace the given x86 instruction.

click to enlarge
Figure 3: Instruction Decoder and Register Renaming.
The instruction decoder can convert up to three x86 instructions per clock cycle, one complex at Decoder 0 and two simple at decoders 1 and 2, feeding the Decoded Instruction Queue with up to six micro-ops per clock cycle, scenario that is reached when Decoder 0 sends four micro-ops and the other two decoders send one micro-op each – or when the MIS is used. Very complex x86 instructions that use the Micro Instruction Sequencer can delay several clock cycles to be decoded, depending on how many micro-ops will be generated from the conversion. Keep in mind that the Decoded Instruction Queue can hold only up to six micro-ops, so if more than six micro-ops are generated by the decoder plus MIS, another clock cycle is needed to send the current micro-ops present in the queue to the Register Allocation Table (RAT), empty the queue and accept the micro-ops that didn’t “fit” before.
Pentium M uses a new concept to the P6 architecture that is called micro-op fusion. On Pentium M the decoder unit fuses two micro-ops into one. They will be de-fused only to be executed, at the execution stage.
On P6 architecture, each microinstruction is 118-bit long. Pentium M instead of working with 118-bit micro-ops works with 236-bit long micro-ops that are in fact two 118-bit micro-ops.
Keep in mind that the micro-ops continue to be 118-bit long; what changed is that they are transported in groups of two.
This idea behind this approach was to save energy and increase performance. It is faster to send one 236-bit micro-op than two 118-bit micro-ops. Also the CPU will consume less power, since less micro-ops will be circulating inside of it.
Fused micro-ops are then sent to the Register Allocation Table (RAT). CISC x86 architecture has only eight 32-bit registers (EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP). This number is simply too low, especially because modern CPUs can execute code out-of-order, what would “kill” the contents of a given register, crashing the program.
So, at this stage, the processor changes the name and contents of the registers used by the program into one of the 40 internal registers available (each one of them is 80-bit wide, thus accepting both integer and floating-point data), allowing the instruction to run at the same time of another instruction that uses the exact same standard register, or even out-of-order, i.e. this allows the second instruction to run before the first instruction even if they mess with the same register.
Reorder Buffer
So far the x86 instructions and the micro-ops resulted from them are transferred between the CPU stages in the same order they appear on the program being run.
Arriving at the ROB, micro-ops can be loaded and executed out-of-order by the execution units. After being executed, the instructions are sent back to the Reorder Buffer. Then at the Retirement stage, executed micro-ops are pulled out of the Reorder Buffer at the same order they entered it, i.e. they are removed in order. On Figure 4 you can have a better idea on how this works.

click to enlarge
Figure 4: How the Reorder Buffer works.
On Figure 4 we simplified the Reservation Station and the execution units for a better understanding of the Reorder Buffer. We will talk about these two stages in depth on next page.
Reservation Station and Execution Units
As we mentioned before, Pentium M uses fused micro-ops (i.e. carries two micro-ops together) from the Decode Unit up to the dispatch ports located on the Reservation Station. The Reservation Station dispatches each micro-op individually (defused).
Pentium M has five dispatch ports numbered 0 thru 4 located on its Reservation Station. Each port is connected to one or more execution units, as you can see on Figure 5.

click to enlarge
Figure 5: Reservation Station and execution units.
Here is a small explanation of each execution unit found on this CPU:
• IEU: Instruction Execution Unit is where regular instructions are executed. Also known as ALU (Arithmetic and Logic Unit). “Regular” instructions are also known as “integer” instructions.
• FPU: Floating Point Unit is where complex math instructions are executed. In the past this unit was also known as “math co-processor”.
• SIMD: Is where SIMD instructions are executed, i.e. MMX, SSE and SSE2.
• WIRE: Miscellaneous functions.
• JEU: Jump Execution Unit processes branches and is also known as Branch Unit.
• Shuffle: This unit executes a kind of SSE instruction called “shuffle”.
• PFADD: Executes a SSE instruction called PFADD (Packed FP Add) and also COMPARE, SUBTRACT, MIN/MAX and CONVERT instructions. This unit is pipelined, so it can start executing a new micro-op at each clock cycle even if it didn’t complete the execution of the previous micro-op. This unit has a latency of three clock cycles, i.e. it delays three clock cycles to deliver each processed instruction.
• Reciprocal Estimates: Executes two SSE instructions, one called RCP (Reciprocal.Estimate) and another called RSQRT (Reciprocal Square Root Estimate).
• Load: Unit to process instructions that ask a data to be read from the RAM memory.
• Store Address: Unit to process instructions that ask a data to be written at the RAM memory. This unit is also known as AGU, Address Generator Unit. This kind of instruction uses both Store Address and Store Data units at the same time.
• Store Data: Unit to process instructions that ask a data to be written at the RAM memory. This kind of instruction uses both Store Address and Store Data units at the same time.
Keep in mind that complex instructions may take several clock cycles to be processed. Let’s take an example of port 0, where the floating point unit (FPU) is located. While this unit is processing a very complex instruction that takes several clock ticks to be executed, port 0 won’t stall: it will keep sending simple instructions to the IEU while the FPU is busy.
So, even thought the maximum dispatch rate is five microinstructions per clock cycle, actually the CPU can have up to twelve microinstructions being processed at the same time.
As we mentioned, on instructions that ask the CPU to read a data stored at a given RAM memory address, the Store Address Unit and the Store Data Unit are used at the same time, one for calculating the address and the other for reading the data.

Actually that’s why ports 0 and 1 have more then one execution unit attached. If you pay attention, Intel put on the same port one fast unit together with at least one complex (and slow) unit. So, while the complex unit is busy processing data, the other unit can keep receiving microinstructions from its corresponding dispatch port. As we mentioned before, the idea is to keep all execution units busy all the time.
As we explained, after each micro-op is executed, it returns to the Reorder Buffer, where its flag is set to “executed”. Then at the Retirement Stage the micro-ops that have their “executed” flag on are removed from the Reorder Buffer on its original order (i.e. the order they were decoded) and then the x86 registers are updated (the inverse step of register renaming stage). Up to three micro-ops can be removed from the Reorder Buffer per clock cycle. After this the instruction was fully executed.
Enhanced SpeedStep Technology
SpeedStep Technology was created to increase battery life and was first introduced with Pentium III M processor. This first version of SpeedStep Technology allowed the CPU to switch between two clock frequencies on the fly: Low Frequency Mode (LFM), which maximized battery life, and High Frequency Mode (HFM), which allowed you to run your CPU at its maximum speed. The CPU had two clock multiplier ratios and what it did was to change the ratio it was using. The LFM ratio was factory-lock and you couldn’t change it.
Pentium M introduced Enhanced SpeedStep Technology, which goes beyond that, by having several other clock and voltage configurations between LFM (which is fixed at 600 MHz) and HFM (which is the CPU full clock).
Just to give you a real example, the clock/voltage configuration table for a 1.6 GHz Pentium M based on 130 nm technology is the following:
Voltage Clock
1.484 V 1.6 GHz
1.42 V 1.4 GHz
1.276 V 1.2 GHz
1.164 V 1 GHz
1.036 V 800 MHz
0.956 V 600 MHz
Each Pentium M model has its own voltage/clock table. It is very interesting to notice that it is not only about lowering the clock rate when you don’t need so much processing power from your laptop, but also about lowering its voltage, which helps a lot to lower battery consumption.
Enhanced SpeedStep Technology works by monitoring specific MSRs (Model Specific Registers) from the CPU called Performance Counters. With this information, the CPU can lower or raise its clock/voltage depending on CPU usage. Simply put, if you increase CPU usage, it will increase its voltage/clock, if you lower the CPU usage, it will lower its voltage/clock.
Enhanced SpeedStep was just one of the several enhancements done on Pentium M microarchitecture in order to increase battery life.
A good example was done on the execution units. On other processors, the same power line feeds all execution units. So it is not possible to turn off an idle execution unit on Pentium 4, for example. On Pentium M execution units have different power lines, making the CPU capable of turning off idle execution units. For example, Pentium M detects in advance if a given instruction is an integer one (“regular instruction”), disabling the units and datapaths not needed to process that instruction, if they are idle, of course.