Performance and ISA General Concepts


Response: Number of seconds per instruction
Performance: Number of instruction per one second



We can use this for pipeline

Average CPI

CPI = sum of (cycles for each instructions * F)
F = instruction Freq / Instruction Count

CPI = (CPUtime * Clock Rate) / Instruction Count
= Clock Cycles / Instruction Count

Influencing Factors on Performance

1. Compile the program to Binary
Depending on the compiler and the kind of instruction we have
We can change number of instruction we have
[AVG CPI]

Compiler:
Different compiler use different technique to compile
- gcc 
- Clang
- icc
They will generate different binary and will optimise code by using -o(level) in the compile line
e.g -o4 will run level 4 


Instruction Set Architecture:
The same high level statement is translated differently depending to ISA
e.g A*+B

2. Binary Executes on Machine
[CYCLE TIME CPI]

Machine:
 - More accurately the hardware implementation
- Determine cycle time and cycle per instruction


Cycle time:
Different clock frequency

Cycle per instruction
Design of internal mechanism

Summary:
Performance is specific
- A given machine can have  a different CPU
- Common misunderstanding: Expect improvement by changing one aspect machines purposes

Amdahl's Law

Performance is limited to the non speedup portion of the program. When we improve it, we do not need to update. Optimise but there is a limit to the optimise.

FP run 5x faster doesn't mean we have to divide.
We have to multiply the FP instruction and add back to the remaining time.
e.g
FP ins = 6 sec
Benchmark = 6 sec
Total : 12 sec
SpeedUp: 6/5  (FP Ins) + 6

Boolean Algebra

Take a X Y to represent a set of logic
e.g X = A + B

RISC VS CISC

CISC:
is like a matrix, each use a matrix multiplication
Give whatever the user wants.
EXE: small
Hardware: Complex

e.g Intel x8b

RISC:
Give them the simplest things, and the rest is build
e.g add, mul, branch
EXE: Big
Simple: optimise

e.g MIPS, ARMS


#1 Data Storage

Storage architecture.
Von Neumann architecture, all the memory is in the memory and when processor needs it, we bring it

Standard register (GPR)

There are instructions such as load and store to load and store the information from memory
This is the more popular

Memory-Memory

Specify the memory address in the instrcution and straight away
This is bad because memory takes very long to load

Stack

Last in first out data structure. The push and pop use to bring the infomation from the memory.
When we perform add, we will take the value from the stack and result is push into stack.
Popping it will store it back to the memory.
All the instruction here is very tiny

e.g Java JVM

Accumulator

There is accumulator, what ever result that is calculated will be place as accumulator to preload everything when doing ALU execution.
Loading it back will take it from the accumulator


#2 Memory and Addressing mode

- Address size is different from the data. 
2^k is means k different location
When reading, we will use a n-bit data bus but n may not be the same as k

Loading:
Value place in MAR 
Store
Value in MDR

Endianness

The ordering of the bytes in multiple byte word store in memory

Big endian:
Store the most significant in the lower address

Little-endian
Store the least significant byte in the lower address

The problems lies when you do a load, different machine will give return results.
Intel: Little Endian
Mips: Depends (Sim: Little Endian)
Network order: Big endian

Addressing modes

3 kinds of addressing modes but in other there are more than 3
e.g Register indirect, auto increment

#3 Operation

Standard operation in instruction set
- data movement
- Arithmetic
- Shift
- Branch
- Call

Note: Load is the most used so optimise it first

#4 Instruction Format

Instruction Length

Variable length of instruction:
Used in most CISC
Require multi step Fetch and decoded

Fixed-Length:
Use in RISC
Easy to decode/encode
Instruction bit are scarce

Hybrid:
A mix of both