Performance and ISA General Concepts
We can use this for pipeline
Average CPI
CPI = sum of (cycles for each instructions * F)F = instruction Freq / Instruction Count
CPI = (CPUtime * Clock Rate) / Instruction Count
= Clock Cycles / Instruction Count
Influencing Factors on Performance
1. Compile the program to Binary
Depending on the compiler and the kind of instruction we have
We can change number of instruction we have
[AVG CPI]
[AVG CPI]
Compiler:
Different compiler use different technique to compile
- gcc
- Clang
- icc
They will generate different binary and will optimise code by using -o(level) in the compile line
e.g -o4 will run level 4
Instruction Set Architecture:
The same high level statement is translated differently depending to ISA
e.g A*+B
e.g A*+B
2. Binary Executes on Machine
[CYCLE TIME CPI]
Machine:
- More accurately the hardware implementation
- Determine cycle time and cycle per instruction
Cycle time:
Different clock frequency
Different clock frequency
Cycle per instruction
Design of internal mechanism
Summary:
Performance is specific
- A given machine can have a different CPU
- Common misunderstanding: Expect improvement by changing one aspect machines purposes
Amdahl's Law
Performance is limited to the non speedup portion of the program. When we improve it, we do not need to update. Optimise but there is a limit to the optimise.
FP run 5x faster doesn't mean we have to divide.
We have to multiply the FP instruction and add back to the remaining time.
e.g
FP ins = 6 sec
Benchmark = 6 sec
Total : 12 sec
SpeedUp: 6/5 (FP Ins) + 6
FP run 5x faster doesn't mean we have to divide.
We have to multiply the FP instruction and add back to the remaining time.
e.g
FP ins = 6 sec
Benchmark = 6 sec
Total : 12 sec
SpeedUp: 6/5 (FP Ins) + 6
Boolean Algebra
Take a X Y to represent a set of logic
e.g X = A + B
RISC VS CISC
CISC:
is like a matrix, each use a matrix multiplication
Give whatever the user wants.
EXE: small
Hardware: Complex
e.g Intel x8b
RISC:
Give them the simplest things, and the rest is build
e.g add, mul, branch
EXE: Big
Simple: optimise
e.g MIPS, ARMS
#1 Data Storage
Storage architecture.
Von Neumann architecture, all the memory is in the memory and when processor needs it, we bring it
Standard register (GPR)
There are instructions such as load and store to load and store the information from memory
This is the more popular
Memory-Memory
Specify the memory address in the instrcution and straight away
This is bad because memory takes very long to load
Stack
Last in first out data structure. The push and pop use to bring the infomation from the memory.
When we perform add, we will take the value from the stack and result is push into stack.
Popping it will store it back to the memory.
All the instruction here is very tiny
e.g Java JVM
Accumulator
There is accumulator, what ever result that is calculated will be place as accumulator to preload everything when doing ALU execution.
Loading it back will take it from the accumulator
#2 Memory and Addressing mode
- Address size is different from the data.
2^k is means k different location
When reading, we will use a n-bit data bus but n may not be the same as k
Loading:
Value place in MAR
Store
Value in MDR
Endianness
The ordering of the bytes in multiple byte word store in memory
Big endian:
Store the most significant in the lower address
Little-endian
Store the least significant byte in the lower address
The problems lies when you do a load, different machine will give return results.
Intel: Little Endian
Mips: Depends (Sim: Little Endian)
Network order: Big endian
Addressing modes
3 kinds of addressing modes but in other there are more than 3
e.g Register indirect, auto increment
#3 Operation
Standard operation in instruction set
- data movement
- Arithmetic
- Shift
- Branch
- Call
Note: Load is the most used so optimise it first
#4 Instruction Format
Instruction Length
Variable length of instruction:
Used in most CISC
Require multi step Fetch and decoded
Fixed-Length:
Use in RISC
Easy to decode/encode
Instruction bit are scarce
Hybrid:
A mix of both