CS257 - Advanced Computer Architecture

Main memory organisation

There are four main structural components of the computer

CPU - controls operation of computer and data processing
Main memory - stores data
IO - moves date between computer and external environment
System interconnection - communication between CPU, main memory and IO

The designer's dilemma is that faster memory is more expensive per capacity, so the performance, capacity and cost of memory systems have to be balanced. This is solved with the memory hierarchy.

The memory hierarchy consists of small amounts of expensive fast memory for the most commonly needed data and large, inexpensive memory for less frequently used data. This takes advantage of temporal and spatial locality.

Semiconductor memory

Solid state, main and cache memory use semiconductor memory. There are several types of semiconductor memory, including RAM, ROM and flash memory.

RAM is able to rapidly read and write data and is volatile. There are two categories of RAM - dynamic (DRAM) and static (SRAM).

Dynamic memory cells are simpler and more compact. They comprise of a transistor/capacitor pair and allow greater memory cell density. It is cheaper to produce than SRAM but has refresh overhead which is inefficient for smaller memory.

Static memory cells provide better read and write times than DRAM and use flip-flop logic gates. Cache memory is usually implemented as SRAM whereas main memory is usually DRAM.

Interleaved memory is composed of a collection of DRAM chips grouped together to form a memory bank. Each bank is able to independently serve read and write requests. If consecutive words in memory are stored in different banks they can be transferred faster. Distributing addresses among memory banks is called interleaving. An effective use of interleaved main memory is when the number of banks is equal to or a multiple of the number of words in a cache line.

Suppose a processor requests a read/write operation and the DRAM performs the operation but the processor must wait for it to be completed. This causes a performance bottleneck which can be resolved with synchronous DRAM (SDRAM), which exchanges data with the processor after a fixed number of clock cycles, so the CPU doesn't have to wait and can do other tasks during the known busy period. SDRAM works best when transferring large blocks of data serially.

SDRAM sends data to the processor once per clock cycle. Double Data Rate (DDR) SDRAM sends data twice per clock cycle. This, combined with prefetching, achieves higher data rates.

Cache DRAM (CDRAM) uses both SRAM and DRAM. It consists of a small SRAM chip and regular DRAM chip and is ideal for cache memory.

ROM is non-volatile read-only memory. PROM is writeable but only once. EPROM is rewritable using UV radiation to erase data. EEPROM allows individual bytes to be erased and rewritten but is more expensive than EPROM.

Flash memory is semiconductor memory used for both internal and external applications. It has high read/write speeds but can only modify blocks, not individual bytes. It is denser than EEPROM but rewritable unlike EPROM.

Memory hierarchy design

Memory hierarchy design needs to consider

locality - temporal and spatial
inclusion - subsets of lower level memory is copied to level above
coherence - copies of the same data must be consistent

The performance of a memory hierarchy depend on

address reference statistics
access time of each level
storage capacity of each level
size of blocks to be transferred between levels
allocation algorithm

Performance is often measured as a hit ratio, which is the probability an address is in a memory level.

Virtual memory

Virtual memory is a hierarchical memory system managed by the operating system. To the programmer it appears as a single large memory. Virtual memory is used

to make programs independent of memory system configuration and capacity
to free the programmer from the need to allocate storage and permit sharing of memory between processes
to achieve fast access rates and low cost per bit
to simplify the loafing of programs for execution

When using virtual memory, the address space of the CPU (virtual or logical address space) is much larger than the physical memory space. When required information is not in main memory it is swapped in from secondary memory.

The pages of a process in memory do not all need to be loaded for it to execute. Demand paging only loads a page when it is needed. When a program makes a reference to a page not in memory, it triggers a page fault which tells the OS to load the desired page. Because a process and only execute when it is in main memory, that memory is referred to as real memory and the larger memory available to the user is called virtual memory.

The mapping between logical and physical addresses is stored in the page table.

Whereas paging is invisible to the user and pages are fixed sizes, segmentation allows the user or OS to allocate blocks of memory of variable length with a given name and access rights.

A Translation Lookaside Buffer (TLB) holds the most recently referenced table entries, like a cache for addresses. When an address is not found in the TLB the page tables have to be searched, which has high overhead.

The page size affects the space utilisation. If the page is too large there is excessive internal fragmentation and is it is too small then page tables become too large.

When a page fault occurs, the memory management software replaces the page from secondary storage. If the main memory is full, it is necessary to swap a page out to make space for a new one. This is managed using page replacement algorithms

Random - Swaps random pages out, poor performance
FIFO - Oldest page swapped out, does not consider frequency of access
Clock replacement algorithm - Similar to FIFO, uses a 'use' bit to track pages which have been used, removes unused pages
Least recently used algorithm - Page least recently used is swapped out. This can be expensive to track so an approximation with 'use' bits is used instead
Working set replacement algorithm - Keep track of a working set of pages for a process. A working set is the set of pages accessed in a set time interval

Thrashing occurs when there are too many processes in memory and the OS spends most of its time swapping pages. This can be avoided using good page replacement algorithms, reducing the number of processes or installing more memory.

Cache memory

Block - The minimum unit of transfer between cache and main memory
Line - A portion of cache memory capable of holding one block
Tag - A portion of a cache line that is used for addressing purposes

When virtual addresses are used, the system designer may choose to place the cache between the processor and the MMU or between the MMU and main memory. A logical cache, also known as a virtual cache, stores data using virtual addresses. The processor accesses the cache directly without going through the MMU. A physical cache stores data using main memory physical addresses.

If the CPU requests the contents of a memory location in cache that is called a cache hit. If not, it is a cache miss. The cache hit ratio is given by the number of hits divided by the total number of blocks in cache.

The average access time, $t_a$ is given by $t_a = t_c + (1-h)t_m$ where $t_m$ is cache access time, $t_m$ is main memory access time and $h$ is the hit rate. The objective is to make $\frac{t_a}{t_c} = 1$. For a cache system, $\frac{t_m}{t_c}$ is low, so $h$ between 0.7 and 0.9 will give good performance.

As there are fewer cache lines than main memory blocks, an algorithm is needed for mapping main memory blocks into cache lines. Three techniques are used

Direct - Map each block of main memory into one possible cache line by taking block number modulo number of cache lines
Associative - Each main memory block can be loaded into any line of the cache. The cache control logic interprets a memory address as a tag and word field. The cache control logic can then search the cache for the tag
Set associative - A combination of the both. Cache consists of a number of sets and each set consists of a number of lines. A given block maps to a set.

A victim cache can be used to reduce conflict misses of direct mapped caches by storing blocks which map to the same cache line.

Content addressable memory is constructed of SRAM cells but is more expensive and holds less data. When a tag is provided CAM searches for it in parallel, returning the address where the match is found in a single clock cycle.

For associative and set associative mapping a block replacement policy is needed for cache misses. There are 3 approaches

Random
LRU
FIFO

The most effective is LRU. The replacement mechanism must be implemented in hardware (unlike page replacement). A counter associated with each line is incremented at regular intervals and reset when the line is referenced. On a miss when the cache is full, the line with the highest counter is replaced.

Writing to the cache can cause inconsistencies with main memory.

Write through performs write operations on the cache and main memory at the same time. This is simple but generates more memory traffic which can cause a bottleneck.

Write back minimises memory writes. Updates are only made in the cache (which sets the dirty bit) and then when a block is replaced it is written back to main memory if the dirty bit is set. This will make areas of main memory invalid and access is only possible through the cache. This is complex to implement and can cause bottleneck.

If there is a write miss at cache level then the block can be fetched from main memory into the cache (write allocate) or the block to be written is modified in the main memory and not loaded into the cache (no write allocate).

Having multiple caches can improve performance but can complicate design.

If multiple CPUs with their own cache are sharing main memory then writing can cause inconsistencies. Cache consistency can be ensured using

Bus watching with write through - Each cache controller monitors the address lines to detect write operations. If another write is made, the controller invalidates the entry.
Hardware transparency - Additional hardware is used to ensure all updates to main memory via cache are reflected in all caches.
Noncacheable memory - Only a portion of main memory is shared by more than one processor and is designated as uncacheable. All accesses to shared memory are cache misses.

A split cache has separate caches for instructions and data. The instruction cache does is read only and doesn't need to worry about writes. Split cache is used for L1 cache and improves performance.

Measuring cache performance using hit rate does not consider the cost of a cache miss. Instead, average memory access time can be defined as the hit time + (miss rate $\times$ miss penalty). Cache optimisation can therefore be classified into three categories

Reducing miss rate
Reducing miss penalty
Reducing hit time

Cache misses can be categorised as follows

Compulsory - Misses that would occur regardless of cache size, such as the first access to a block
Capacity - Misses that occur because the cache is not large enough
Conflict - Misses that occur because of cache replacement conflicts
Coherency - Misses occurring due to cache flushes in multiprocessor systems

Cache optimisation approaches include

Larger block sizes - exploits spatial locality by increasing the block size. Likely to reduce the number of compulsory misses, lower miss rate. Increases cache miss penalty, can increase capacity and conflict misses.
Larger cache - Reduce capacity misses, longer hit times, increased power consumption
Higher levels of associativity - Increased cache associativity reduces conflict misses but causes longer hit times
Multilevel caches - Reduces miss penalty
Prioritising read misses over writes - Reduces miss penalty
Avoiding address translation during cache indexing - Reduces hit time by using the page offset to index cache

Some more advanced cache optimisations are

Controlling L1 cache size and complexity
Way prediction - reduce conflict misses while maintaining hit speed by predicting the next block to be accessed
Pipelined cache access
Multi-bank caches - organise cache as independent banks to support simultaneous access, increases bandwidth
Non-blocking caches - allows the processor to issue more than one cache request at a time, reducing the miss penalty
Critical word first - Send the missing word in a block to the processor immediately and then load the rest of the block, reduces miss penalty
Early restart - Load words normally and send the requested word to the processor immediately while loading the rest of the block, reduced miss penalty
Merging write buffer - Merge writes in the cache, reducing miss penalty
Hardware prefetching - load data/instructions into the cache before being requested by the processor, reducing the miss rate and miss penalty
Compiler controlled prefetching - compiler inserts prefetching instructions based on the source code, reduces miss penalty and miss rate
Compiler-based optimisation - improves instruction and data miss rates

Code optimisation

Memory bound code loads or stores lots of data. Compute bound code performs many integer or floating point operations.

The most common metric for performance of code is floating point operations per second (FLOPs). Peak FLOP is the maximum achievable FLOPs for a given machine. Program efficiency is given by actual divided by peak FLOP.

Optimisation techniques consist of 3 categories

Algorithmic - less complex algorithm
Code refactoring - address inefficiencies in resource utilisation
Parallelisation - Executing multiple tasks in parallel, for example with vectorisation or multithreading.

Loop optimisations

A lot of runtime is often taken up by loops so loop optimisations can improve runtime.

Loop dependencies occur when one or more iterations are dependent on previous iterations.

Aliased pointers are multiple pointer which point to the same memory location. The restrict keyword tells the compiler a pointer is not aliased.

Loop peeling is an optimisation which moves one or more iterations from a loop to outside the body of the loop. This can be used to remove loop dependency.

Loop interchange is an optimisation which modifies the order of memory accesses by switching the order of loops in the code. Consider a 2D array where each row fits in a cache line. Iterating columns before rows will copy each row into cache but if there are a lot of rows then the first cache lines will be evicted before being reused. Interchanging the order of loops can avoid this problem.

Loop blocking rearranges loops to improve cache reuse of a block of memory.

Loop fusion merges multiple loops over the same range into a single loop.

Loop fission improves locality by splitting loops (opposite of loop fusion). If there is one loop with many unrelated operations, they can be split to improve temporal locality.

Loop unrolling replaces the body of a loop with several iterations of the loop, reducing the number of condition checks to be done.

Loop pipelining reorders operations across iterations of a loop to enable instruction pipelining.

Flynn's taxonomy

Flynn's taxonomy classifies programs into four different categories

SISD - Single Instruction Single Data
SIMD - Single Instruction Multiple Data
MISD - Multiple Instruction Single Data
MIMD - Multiple Instruction Multiple Data

Vector instructions utilise specific hardware dedicated to executing SIMD instructions. Vectorisation can be applied automatically by the compiler, by using vector intrinsics or by using inline assembly.

Vectorisation usually involves loading data into a vector register, executing vector operations and then loading the data back. SSE and AVX can be used to implement vector intrinsics.

Threading

Threads allow code to run in parallel asynchronously. Threads consist of a global shared memory and a private memory. Threading is an example of MIMD. Race conditions can occur when a result is dependent on the order of execution of threads. OpenMP can be used to implement threading.

Processor architecture

The CPU

The CPU continuously performs the instruction cycle. Instructions are retrieved from memory, decoded and executed.

The instruction cycle takes place over several CPU cycles.

At the beginning of each instruction cycle, the processor fetches an instruction from memory. The program counter (PC) holds the address of the instruction to be fetched next. The processor increments the PC after each instruction fetch. The fetched instruction is loaded into the instruction register. The processor interprets the instruction and executes it.

The CPU consists of several components

The ALU performs mathematical and logic operations
The control unit decodes and executes instructions
Registers store values
The processor bus transfers data between processor components

There are two roles for processor registers

User-visible registers are accessible to the programmer and include data, address, index, stack pointer and condition code registers
Control and status registers are used by the control unit to maintain the operation of the processor and include the program counter, instruction register and memory address register

The program status word is a register or set of registers containing status information.

The collection of instructions a processor can execute is called its instruction set. Machine instructions consist of an opcode which specifies the operation and operands, which are like arguments.

Opcodes are represented by mnemonics, such as ADD or SUB. Each opcode has a fixed binary representation.

The execution of an instruction may involve one or more operands, each of which may require a memory access. Indirect addressing may require additional memory accesses. This can be accounted for with an indirect cycle.

Interrupts interrupt the execution of the processor. There are four types: program, timer, IO and hardware failure. Interrupts specify an interrupt handler to run.

Control unit

The control unit generates control signals which open and close logic gates, transferring data to and from registers and the ALU. It executes a series of micro-operations and generates the control signals to execute them.

The control unit takes the clock, instruction register, flags and control signals from the control bus as input. It outputs control signals within the processor and to the control bus.

There are 2 approaches to CU design.

Hardwired or random logic CUs use a combinatorial logic circuit, transforming the input signals to a set of output signals
Microprogrammed CUs translate each instruction into a sequence of primitive microinstructions which are then executed

Hardwired CUs are faster but more complex to design and test and difficult to change. They are used for implementing RISC architectures.

Microprogrammed CUs are slower byt easier to design and implement. They are used for implementing CISC architectures.

Pipelining

Different stages of the fetch-decode-execute cycle can be carried out in parallel. This is called instruction pipelining.

Each step, called a pipe stage or segment, of the pipeline completes a part of an instruction. Different steps complete different parts of different instructions in parallel. The throughput of an instruction pipeline is determined by how often an instruction exists the pipeline. The time required between moving an instruction one step down the pipeline is called a processor cycle. All stages must be ready to proceed at the same time so the length of a processor cycle is determined by the slowest pipe stage. This means one processor cycle is almost always one clock cycle.

The pipeline designer's goal is to balance the length of each pipeline stage. Assuming ideal conditions, the time per instruction on the pipelined processor is the time per instruction on the unpipelined processor divided by the number of pipe stages.

An example five stage pipeline consists of the following

Instruction fetch (IF)
Instruction decode (ID)
Memory access (MEM)
Write-back (WB)

This is just an example and different architectures will have different stages.

Pipelining increases the CPU instruction throughput meaning more instructions are completed per unit time. It does not reduce the execution time of a single instruction. It is important to keep the pipeline correct, moving and full. Failure to do so can lead to performance issues

Pipeline latency - If the duration of instructions does not change then the depth of the pipeline (number of stages) is limited
Imbalance among pipe stages - The clock can't run faster than the time needed for the slowest stage
Pipelining overhead - Caused by pipeline register delay and clock skew (maximum delay between clock signal reaching two registers)

The cycle time, $\tau$, is the time needed to advance a set of instructions one stage in the pipeline. It is given by $$ \tau = \underset{i}{\max}[\tau_i] + d = \tau_m + d $$ where

$\tau_i$ is the time delay on the circuitry in the $i$ th stage
$\tau_m$ is the maximum stage delay
$k$ is the number of stages
$d$ is the time delay of a latch

The time required for a pipeline with $k$ stages to execute $n$ instructions is $T_{k, n} = [k + (n-1)]\tau$.

Pipeline hazards occur when the pipeline must stall because conditions do not permit continued execution. There are three types

Resource
Control
Data

Resource hazards occur when two or more instructions that are already in the pipeline need the same resource. This can be resolved by detecting the problem and stalling one of the contending stages.

Branch instructions cause the order of execution to change, which causes control hazards as the processor may choose the wrong branch as the next instruction. This can be resolved by prefetching the branch target, using a loop buffer or using branch prediction.

Data hazards occur when there is a conflict in the access of an operand location. This can happen when the data that an instruction uses depends on data created by another previous instruction.

There are 3 types of data hazard that can occur

Read after write - a read of a register in an instruction happens before a write by a previous instruction
Write after read - the previous instruction reads a register after the current instruction has written
Write after write - the previous instruction writes after the current instruction

Amdahl's law states that speed-up is given by normal execution time for a task divided by optimised execution time for a task.

A reservation table is a technique for pipeline design. In a reservation table the rows correspond to resources and the columns to time units. An X denotes that a given resource is busy at a given time.

The number of clock cycles between two initiations of a pipeline is the latency between them. Any attempt by two or more initiations to use the same pipeline resource at the same time will cause a collision.

Potential collisions are identified using a collision vector, which denotes the difference in time slots between Xs on the same row. The initial collision vector represents the state of the pipeline after the first initiation.

A scheduling mechanism is needed to determine when new initiations can be accepted without a collision occurring.

One method is a scheduling strategy which represents pipeline activity with a state diagram. This uses collision vectors to check for collisions.

Shift the collision vector left one position, inserting a 0 in the rightmost position. Each 1-bit shift corresponds to a latency increase of 1.
Continue shifting $p$ times until a 0 bit emerges from the left end - this means $p$ is a permissible latency
The resulting collision vector represents the collisions that can be caused by the instruction currently in the pipeline. The original collision vector represents all collisions caused by the instruction to be inserted into the pipeline. To represent all collision possibilities, bitwise OR the initial collision vector with the $p$-shifted vector
Repeat from the initial state for all permissible shifts
Repeat for all newly created states from all permissible shifts

Pipeline enhancements

Data forwarding - forward the result value to the dependent instruction as soon as the value is available
Separate the L1 cache into an instruction and data cache - removes conflict between instruction fetching and execute stages
Dedicated execution units
Reservation station - A buffer to hold operations and operands for an execute unit until the operands are available, relieves bottleneck at operand fetch stage

Superscalar processors

A single linear instruction pipeline has at best 1 clock pulse per instruction (CPI). Fetching and decoding more than one instruction at a time can reduce the CPI to less than 1, which is known as superscalar.

The speed-up of a super scalar processor is limited by the local parallelism in the program. Instruction level parallelism (ILP) refers to the degree to which instructions can be executed in parallel. ILP can be maximised by compiler-based optimisation and hardware techniques but is limited by

True data dependency
Procedural dependency
Resource conflicts
Output dependency
Antidependency

Superpipelining divides the pipeline into a greater number of smaller stages in order to clock it at a higher frequency. In contract, superscalar allows parallel fetch and decode operations.

Instruction issue refers to the process of initiating instruction execution in the processor's functional units. An instruction has been issued when it has been moved from the decode stage to the first execute stage of a pipeline. Instruction issue policy refers to the rules applied to issue instructions. There are three general categories

In-order issue with in-order completion
In-order issue with out-of-order completion
Out-of-order issue with out-of-order completion

rename registers or something idk.

Parallel computer organisation

SIMD exploits instruction-level parallelism. Concurrency arises from performing the same operations on different pieces of data.

An array processor operates on multiple data elements at the same time using different spaces. A vector processor operates on multiple data elements in consecutive time using the same space.

The advantages of a vector processor include

No dependencies within a vector, pipelining and parallelisation work well
Each instruction generates a lot of work, reduces instruction fetch bandwidth requirements
Highly regular memory access pattern, easy prefetching
No need to explicitly code loops, fewer branches in instruction sequence

The disadvantages of a vector processor is that it only works if parallelism is regular and is very inefficient if it is irregular. Memory bandwidth can become a bottleneck if compute/memory operation balance is not maintained or data is not mapped appropriately to memory banks.

Each vector data register holds $n$ $m$-bit values. It has three vector control registers

VLEN - maximum number of elements to be stored in a vector register
VSTR
VMASK - indicates which elements of the vector to operate on

Memory banking can be used to efficiently load vectors from memory.

Vector/SIMD machines are good at exploiting regular data-level parallelism. Performance improvement is limited by the vectorisability of the code. Many existing instruction set architectures include SIMD operations.

GPUs are hardware designed for graphics. GPU applications are developed in Compute Unified Device Architecture (CUDA), a C-like language or OpenCL, a vendor-independent alternative. GPUs are capable of most forms of parallelism. GPUs are programmed using threads but executed using SIMD instructions.

A warp is a set of threads that execute the same instruction.

CUDA unifies multiple types of parallelism using the CUDA thread. The CUDA thread is the lowest level of parallelism. Each thread is associated with each data element. The compiler and hardware can group thousands of CUDA threads to yield other forms of parallelism. Threads are organised into blocks. A grid is the code that runs of a GPU and consists of a set of thread blocks. Blocks are executed independently and in any order. Different blocks can't communicate directly.

Symmetric multiprocessors

A symmetric multiprocessor (SMP) is a standalone computer with the following characteristics

Two or more similar processors of comparable capacity
Processors share the same memory and I/O facilities
Processors are connected by a bus or other internal connection
Memory access time is approximately the same for each processor
All processors share access to I/O devices
All processors can perform the same functions
System controlled by integrated OS

An SMP has a number of potential advantages over a uniprocessor organisation, including

Performance
Availability
Incremental growth
Scaling

SMPs can use a bus with control, address and data lines to communicate with IO devices. To facilitate DMA transfers from I/O subsystems to processors, the bus provides addressing, arbitration and time-sharing. Bus organisation is ideal because it is simple, flexible and reliable but it has poor performance. Each processor should have cache memory which leads to problems with cache coherence.

Cache coherence

Cache coherence refers to ensuring multiple caches contain the same data.

Software can be used to ensure cache coherence. It avoids the need for additional hardware which is attractive because it transfers the overhead of detecting problems to compile time, not runtime, though this can lead to inefficient cache utilisation.

Hardware based solutions, referred to as cache coherence protocols, provide dynamic recognition at runtime of potential inconsistency conditions. Because the problem is only dealt with when it arises there is better cache utilisation. Approaches are transparent to the programmer and compiler, reducing software development burden. Hardware based solutions can be divided into two categories

Directory protocols
- Collect and maintain information about copies of data in cache
- Director stored in main memory
- Requests are checked against directory
- Appropriate transfers are performed
- Creates central bottleneck
- Effective in large scale systems with complex interconnection
Snoopy protocols
- Distribute the responsibility for maintaining cache coherence among all of the cache controllers in a multiprocessor
- Suited to bus-based architecture
- Two basic approaches have been explored - write update and write invalidate

Write update snoopy protocols allow for multiple readers and writers. When a processor wishes to update a shared line the word to be updates is distributed to all others and caches containing the line can update it.

Write invalidate protocols allow for multiple readers but only one writer. When a write is required, all other caches of the line are invalidates. Writing processor then has exclusive access until line is required by another processor. Lines are marked as modified, exclusive, shared or invalid, so the protocol is known as MESI.

Multithreading and multicore systems

A process is an instance of a program running on a computer, which embodies resource ownership and scheduling. A process switch is an operation that switches the processor from one process to another.

A thread is a dispatchable unit of work within a process. A thread switch is an operation that switches processor control from one thread to another within the same process.

Hardware multithreading involves multiple thread contexts in a single processor. This has better latency tolerance, hardware utilisation and reduced context switch penalty but requires multiple thread contexts to be implemented in hardware and has reduced single-thread performance.

There are three types of multithreading

Fine-grained - switch to another thread every cycle such that no two instructions from the thread are in the pipeline concurrently
Coarse-grained - when a thread is stalled due to some event, switch to a different hardware context
Simultaneous - fine-grained multithreading implemented on top of a multiple issue, dynamically scheduled processor

Multicore refers to the combination of two or more processors on a single piece of silicon. Each core typically has all of the components of an independent processor.

Pollack's rule states that performance increase is roughly proportional to the square root of increase in complexity. As a consequence, doubling the logic in a processor core increases performance by 40%. Multicore has the potential for near linear improvement.

The main variables in a multicore organisation are

number of core processors on a chip
number of levels of cache memory
amount of cache memory shared

There are 4 levels of multicore organisation

Dedicated L1 cache
Dedicated L2 cache
Shared L2 cache
Shared L3 cache

Effective applications form multicore processors include

Multi-threaded native applications
Multi-process applications
Java applications - the JVM uses threading
Multi-instance applications

Heterogeneous multicore organisation refers to a processor chip that included more than one kind of core. Heterogeneous system architecture (HSA) includes the following features

The entire virtual memory space is visible to both CPU and GPU
The virtual memory system brings in pages to physical main memory as needed
A coherent memory policy ensures CPU and GPU caches are up to date
Unified programming interface enables users to exploit parallel capabilities of GPUs within programs that rely on CPU execution as well

With uniform memory access (UMA), all processes have access to all parts of main memory using loads and stores. Access time to all regions of memory and to memory for different processors is the same. This is the standard SMP implementation.

SMP approaches do not scale, so non-uniform memory access (NUMA) is used instead. NUMA has a single address space visible to all CPUs. Access to remote memory is via LOAD and STORE instructions and access to remote memory is slower than access to local memory.

Cache coherent NUMA (CC-NUMA) is a NUMA system in which cache coherence is maintained along all the caches in the processors.

Clusters are an alternative to SMP providing high performance and availability. They are a group of interconnected whole computers working together as a unified computing resource creating the illusion of one machine. Each computer in a cluster is called a node. Clusters are easy to scale and have high availability.

I/O mechanisms

Programmed I/O is a mapping between I/O related instructions that the processor fetches from memory and I/O commands the processor issues to I/O modules. Instruction forms depends on the addressing policies for external devices. When a processor, main memory and I/O share a bus, two addressing modes a possible

Memory-mapped
Isolated

Memory-mapped I/O uses the same address bus to address both memory and I/O devices. Memory on I/O devices is mapped to address values. When the CPU accesses a memory address that address may be in physical memory or an I/O device.

Memory-mapped I/O is simpler than many alternatives and allows use of general purpose memory instructions. Portions of memory address space must be reserved which is not ideal on processors with smaller address spaces.

Isolated I/O has a bus equipped with an input and output command line as well as read and write lines. Command lines specify whether an address refers to a location in memory or an I/O device. This means the full range of addresses are available for memory and I/O but additional hardware in needed for command lines.

Most I/O devices are much slower than the CPU so good synchronisation is necessary. Polled I/O is simpler to implement but busy-wait polling wastes power and CPU time and interleaved polling can lead to delayed response.

Interrupt-driven I/O uses interrupt requests (IRQs) and non-maskable interrupts (NMIs). Code can ignore IRQs but NMI can't be ignored. An interrupt can force the CPU to jump to a service routine. There a two design issues when implementing interrupt I/O. The processor needs to be able to determine which device issues the interrupt and if multiple interrupts occur it needs to decide which to process.

Device identification can be achieved through several methods

Multiple interrupt lines - Most straightforward approach, one interrupt line for each I/O device
Software poll - When the processor detects an interrupt it branches to an interrupt service routine (ISR) which polls each I/O module to determine which caused the interrupt, time consuming
Daisy chain - Uses an interrupt acknowledge line and a unique identifier for each I/O module
Bus arbitration - An I/O module must gain control of the bus before it can raise and interrupt

Interrupt-driven I/O has fast response and doesn't waste CPU time, but all data transfer is still handled by the CPU and it requires more complex hardware and software.

Direct memory access (DMA) uses a DMA controller (DMAC) which takes control of system buses and performs data transfer. DMA-based I/O can be more than 10 times faster than CPU-driven I/O.

The DMAC must only use a bus when the processor does not need it or it must force the CPU to suspend execution, known as cycle stealing. There are several methods for DMA organisation

Single bus detached - All modules share the same system bus, straightforward but inefficient
Single bus integrated - Connect I/O devices to the DMA instead of the system bus
Dedicated I/O bus - Separate bus for I/O modules, DMA acts as bridge between system and I/O bus

DMA has 3 operation modes

Cycle stealing
Burst mode
Transparent mode

RAID

Redundant Array of Inexpensive Disks (RAID) is a method for storing the same data over several disks to survive disk failure.

Data striping distributes data over multiple disks to make them appear as a single large disk. It improves aggregate I/O performance by allowing multiple I/O requests to be services in parallel.

There are 7 RAID levels

Level 0 uses non-redundant striping. Data is striped across all disks but there is no redundancy. It has the best write performance and lowest cost but any disk failure will result in data loss.
Level 1 (mirrored) stores two copies of all information on separate disks. Whenever data is written to disk it is also written to a redundant disk. Better read performance as data can be retrieved from either disk. If one disk fails, the data is still in the other. Data could also be striped across disks. This is referred to as RAID Level 1+0 or 0+1
Level 2 (redundancy through Hamming codes) uses very small stripes and uses Hamming codes to detect and correct errors. It is more storage efficient than mirroring. As disks don't fail that often RAID 2 is usually excessive
Level 3 (Bit-interleaved parity) employs parallel access with data distributed in small strips. Instead of error-correcting codes, a simple parity bit is used. In the event of failure data can be reconstructed from the parity bits. Only one redundant disk is required, it is simple to implement and can achieve high data rates. However, only one I/O request can be executed at a time
Level 4 (Block-interleaved parity) uses independent access. This is good for high I/O request rate but not high data rate. Data striping is used with larger stripes and parity bits are used. Level 4 is less write-efficient because the parity disk needs to be updated on every write
Level 5 (Block interleaved distributed parity) eliminates the parity disk bottleneck by distributing parity over all disks. It has one of the best small read, large read and large write performances but small write requests are relatively inefficient.
Level 6 (Dual redundancy) uses two different parity calculations and stores them on different disks. This provides extremely high availability but has more expensive writes. Read performance is similar.

SSDs use NAND flash memory. As the cost has dropped they have become increasingly competitive with HDDs. SSDs have

Higher performance I/O operations per second
Better durability
Longer lifespan
Lower power consumption
Quieter and cooler running
Lower access times and latency rates

Storage area networks (SANs) are networks which store data.

Computer Science Notes