II. Extensive analysis and discussion
A. Instruction Set Architectures
When it comes to instruction set architecture, there are two categories: Reduced Instruction Set Computer (RISC) and Complex Instruction Set Computer (CISC). CISC instructions are highly specialized and so can support for variety of instructions and addressing modes. But they have different CPI and instruction size. In contrast, RISC instructions are short and simple, but only can support for less addressing modes. They have same low CPI. However, the more complex instructions need to be programmed. The debate between CISC and RISC has been  longstanding[7].
Although x86 is recognized as a CISC architecture, it has the advantages of both RISC and CISC architecture[5, 7]. Peter Glaskowsky said “The x86 architecture dominates the PC and server markets, but the guts of modern x86 chips are very RISC-likeâ€[6]. The way how it works is that the complex x86 instruction is translated to simple short micro-op code which is RISC-like.
In this way, the architecture can benefit from both CISC and RISC design. X86 has been used in both Intel and AMD processors such as Intel i3/i5/i7, AMD Phenom, AMD Athlon, etc. Therefore, both of them have the advantage of this combination design.
B. Memory Hierarchy
In this section, I will compare two examples CPUs from AMD and Intel. The models selected are  FX-8320/8150 vs i7 2600, since they have similar performance according to [4] and their prices are close to each other. Selecting some processors to compare with each other could be a subjective choice, they can be picked based on performance of some benchmark or simply price. However, I am only interested with a general sense of their memory design, so those selections should be more than good enough for this purpose. The following table compares memory information of those processors, the information are from Intel and AMD product manual [2, 3] and [8, 9].
Table I. AMD vs. Intel: Memory Hierarchy
AMD FX-8150 | Intel i7 2600 | |
Microarchitecture | Bulldozer | Sandy Bridge |
Cores | 8 cores 4 modules |
4 cores |
L1 cache | 256 KB(I) 128 KB(D) 64KB per module(I) 16KB per core(D) 2-way set associative |
4*32 KB(I) 4*32 KB(D) 4-way associative for instructions 8-way associative for data |
L2 cache | 4*2 MB Each 2MB is shared between 2 cores in one module 16-way associative |
4*256 KB Each 256KB is private to each core 8-way associative |
L3 cache | 8 MB all core shared 64-way associative |
8MB all core shared 16-way associative |
replacement scheme | pseudo-LRU | L1 pseudo-LRU L2 pseudo-LRU L4 pseudo-LRU but with an ordered selection algorithm |
Latency | 4 (L1 cache load) | 4 (L1 cache) 11(L2 cache) 25 (L3 cache) |
Max memory bandwidth | 21 GB/s | 21 GB/s |
Memory channel support | now known | 3 memory channels Each channel consists of a separate set of DIMMs Each channel can transfer in parallel |
Virtual & physical memory addresses | 48-bit virtual and physical addresses | 48-bit virtual addresses and 36-bit physical addresses |
TLB feature | 2 levels TLB | 2 levels TLB |
How caches are indexed | not known | L1: virtually indexed and physically tagged L2: physically indexed L3: physically indexed |
From Table I above, we can conclude that both AMD and Intel have similar memory design. Both of them use LRU policy which can take advantage of time locality. Both AMD and Intel have the multiple level of caches. This design can help the processor quickly get the data and avoid more cache misses. The virtually indexed and physically tagged approach used in L1 allows the cache read to begin immediately and the tag comparison use physical addresses[1](P-B38).
C. Optimization of cache performance:
To optimize cache performance, there are 10 common techniques that are applied to cache design [1]. The goal is to reduce hit time, increase cache bandwidth, reduce the miss penalty, reduce miss rate and reduce miss penalty or miss rate through parallelism [1](P79). In this section, I list how AMD and Intel use those optimization techniques. The name of those techniques are slightly changed from [1](P79-92) .
1. Use small and simple first-level caches to reduce hit time and power.
Rrom the table I, this is true for both AMD and Intel processors. For i7, L3 is 8MB large and use 16-way associativity while L1 is only 32 KB and 4-way or 8-way associativity.
2. Pipeline cache access to increase cache bandwidth.
This technique pipelines cache access so that first-level cache hit latency can be multiple cycles. In this way, clock cycle time is shorter and bandwidth is higher, but hits are slower. In the current Intel Core i7 pipeline takes 4 clocks for cache access. AMD also has the load-store pipeline.
3. Nonblock caches to increase cache bandwidth.
Nonblocking cache allows data cache to continue to supply data when a miss happens.  In this way, the processor does not need to stall on a miss. The Intel Core i7 as a high-performance processor supports both “hit under multiple miss†and “miss under missâ€[1](P83).
4. Multibanked caches to increase cache bandwidth.
This means divide the cache into independent banks that can support multiple accesses at the same time. This technique makes the access faster. Intel Core i7 has four banks in L1 and eight banks in L2 [1](P85).
5. Merge write buffer to reduce miss penalty.
Write buffers are used in write-through caches. Merging write buffer means combining an entry whose address matches another entry’s address in the buffer. The Intel Core i7 uses this merging technique in its caches.
6. Prefetch instructions and data to reduce miss penalty or miss rate via hardware.
Hardware prefetching means hardware outside of the cache prefetches instruction/data before the processor requests them. The Intel Core i7 supports this technique in its L1 and L2 caches.
D. Instruction-level Parallelism
When we talk about instruction-level parallelism, the major approaches include static scheduling, dynamic scheduling, branch prediction, multiple-issue, speculation and their combinations.
1. Static scheduling
Static scheduling means using compiler to deal with dependencies and minimize stalls. This technique can be used on dynamically scheduled pipeline.
2. Dynamic scheduling
Dynamic scheduling means executing an instruction as soon as its operands are available(out-of-order execution), this introduces WAR and WAR hazards which does not exist in classic 5-stage pipeline. Waiting for the operands to be available resolves RAW hazards, and dynamic register renaming resolves WAR and WAW hazards. The simple version of Tomasulo’s approach results in in-order issue and out-of-order completion, but by combining speculation it can do out-of-order execution and in-order commit.
3. Branch prediction
Intel Core i7 use two-level predictor. The first level is small enough to predict a branch every clock cycle, and the second level is larger to serve as a backup. Each predictor combines: 1) two-bit predictor, 2) global history predictor, and 3) a loop exit predictor. The best predictor is selected based on the accuracy of each predictor.
AMD Bulldozer has a new branch prediction design [13]. The scheme is a hybrid method with a local predictor and a global predictor. The branch target buffer(BTB) has two levels: the level-1 is organized as a set-associative cache with 128 4-way sets and the level-2 has 1024 5-way sets [13].
4. Multiple-issue
Multiple-issue can be achieved by using static superscalar processor, dynamic superscalar processor or VLIW processor. VLIW means packaging multiple operations into one large instruction. This technique is mainly used is signal processing, not AMD or Intel processors.
5. Speculation
By adding a reorder buffer, temporary results can be hold in those buffer entries. Only when an instruction come to the top of the reorder buffer, it can be committed and then removed from the buffer.
The primary approach used in Intel Core i3, i5, i7 and AMD Phenom is speculative superscalar [1](P194). Their issue structure is dynamic, hazard detection is on hardware, scheduling is dynamic with speculation, and it is out-of-order execution with speculation [1](P194).
Table II. AMD vs. Intel: Hardware for Dynamic scheduling
AMD FX-8150 | Intel i7 2600 | |
Instruction Decode Width [11] | 4-wide | 4-wide |
Single Core Peak Decode [11] | 4 instructions | 4 instructions |
Instruction Decode Queue[9] | 16 entry | 18+ entry |
Buffers | 40-entry load queue 24-entry store queue [11] |
48 load and 32 store buffers |
pipeline depth | 18+ stages | 14 stages |
branch misprediction penalty | 20 clock cycles for conditional and indirect branches 15 clock cycles for unconditional jumps and returns |
17 cycles |
reservation station | 40-entry unified  integer, memory scheduler [8] 60-entry unified floating point scheduler [8] |
36-entry centralized reservation station shared by 6 functional unit |
reorder buffer | 128-entry retirement queue | 128-entry reorder buffer |
In addition to the information shown in table II, the latencies for floating point instructions and integer vector instructions are longer than intel’s Sandy Bridge normally[13].
E. Thread-level Parallelism
There are two flavors of existing shared-memory multiprocessors: centralized shared-memory multiprocessor and distributed shared memory multiprocessor.
Memory coherence is the major issue that need to be handled. The protocols for maintaining memory coherence include directory based and snooping. The key difference between snooping protocols and directory based protocol is that every node snoops on the broadcast media in snooping protocol while directory based protocol use a directory for each cache block and communication is only between the involved nodes. In directory based protocol, a request from a node always first is sent to directory, and then to involved nodes.
To implement an invalidate protocol in multicore, a bus is used to perform invalidates [1](P356). Instead of a shared-memory access bus, in newer multicore processor, the bus used for coherence can be the connection between private caches and the shared caches [1](P356). The Intel Core i7 use this approach. This approach optimizes the speed and is faster [1](P356).
The limitation of centralized shared-memory multiprocessors and snooping protocols is the bottleneck of centralized resource [1](P363). The Intel Core i7 put a directory at the outermost L3 cache. The directory contains information about which processor’s caches have a block copy in the outermost L3 cache [1](P363). It is implemented as a bit vector of the size equal to the number of cores for each L3 block. [1](P363, 379)
AMD Opteron use an approach that is in the middle between a snooping and a directory based protocol [1](P363). First of all, memory is connected to each multicore chip and 4 such chips can be connected at most [1](P363). Because local memory is faster to access, it is considered as non-uniform memory access [1](P363). Secondly, Opteron uses point-to-point links to broadcast to other chips and uses explicit acknowledgement to verify an invalid operation has completed [1](P363). It is also worth to mention here that the outermost cache for each multicore chip is also shared among the cores; this is the same approach the Intel Core i7 uses[1](P384).