CS257 - Advanced Computer Architecture

Main memory organisation

There are four main structural components of the computer

  • CPU - controls operation of computer and data processing
  • Main memory - stores data
  • IO - moves date between computer and external environment
  • System interconnection - communication between CPU, main memory and IO

The designer's dilemma is that faster memory is more expensive per capacity, so the performance, capacity and cost of memory systems have to be balanced. This is solved with the memory hierarchy.

The memory hierarchy consists of small amounts of expensive fast memory for the most commonly needed data and large, inexpensive memory for less frequently used data. This takes advantage of temporal and spatial locality.

Semiconductor memory

Solid state, main and cache memory use semiconductor memory. There are several types of semiconductor memory, including RAM, ROM and flash memory.

RAM is able to rapidly read and write data and is volatile. There are two categories of RAM - dynamic (DRAM) and static (SRAM).

Dynamic memory cells are simpler and more compact. They comprise of a transistor/capacitor pair and allow greater memory cell density. It is cheaper to produce than SRAM but has refresh overhead which is inefficient for smaller memory.

Static memory cells provide better read and write times than DRAM and use flip-flop logic gates. Cache memory is usually implemented as SRAM whereas main memory is usually DRAM.

Interleaved memory is composed of a collection of DRAM chips grouped together to form a memory bank. Each bank is able to independently serve read and write requests. If consecutive words in memory are stored in different banks they can be transferred faster. Distributing addresses among memory banks is called interleaving. An effective use of interleaved main memory is when the number of banks is equal to or a multiple of the number of words in a cache line.

Suppose a processor requests a read/write operation and the DRAM performs the operation but the processor must wait for it to be completed. This causes a performance bottleneck which can be resolved with synchronous DRAM (SDRAM), which exchanges data with the processor after a fixed number of clock cycles, so the CPU doesn't have to wait and can do other tasks during the known busy period. SDRAM works best when transferring large blocks of data serially.

SDRAM sends data to the processor once per clock cycle. Double Data Rate (DDR) SDRAM sends data twice per clock cycle. This, combined with prefetching, achieves higher data rates.

Cache DRAM (CDRAM) uses both SRAM and DRAM. It consists of a small SRAM chip and regular DRAM chip and is ideal for cache memory.

ROM is non-volatile read-only memory. PROM is writeable but only once. EPROM is rewritable using UV radiation to erase data. EEPROM allows individual bytes to be erased and rewritten but is more expensive than EPROM.

Flash memory is semiconductor memory used for both internal and external applications. It has high read/write speeds but can only modify blocks, not individual bytes. It is denser than EEPROM but rewritable unlike EPROM.

Memory hierarchy design

Memory hierarchy design needs to consider

  • locality - temporal and spatial
  • inclusion - subsets of lower level memory is copied to level above
  • coherence - copies of the same data must be consistent

The performance of a memory hierarchy depend on

  • address reference statistics
  • access time of each level
  • storage capacity of each level
  • size of blocks to be transferred between levels
  • allocation algorithm

Performance is often measured as a hit ratio, which is the probability an address is in a memory level.

Virtual memory

Virtual memory is a hierarchical memory system managed by the operating system. To the programmer it appears as a single large memory. Virtual memory is used

  • to make programs independent of memory system configuration and capacity
  • to free the programmer from the need to allocate storage and permit sharing of memory between processes
  • to achieve fast access rates and low cost per bit
  • to simplify the loafing of programs for execution

When using virtual memory, the address space of the CPU (virtual or logical address space) is much larger than the physical memory space. When required information is not in main memory it is swapped in from secondary memory.

The pages of a process in memory do not all need to be loaded for it to execute. Demand paging only loads a page when it is needed. When a program makes a reference to a page not in memory, it triggers a page fault which tells the OS to load the desired page. Because a process and only execute when it is in main memory, that memory is referred to as real memory and the larger memory available to the user is called virtual memory.

The mapping between logical and physical addresses is stored in the page table.

Whereas paging is invisible to the user and pages are fixed sizes, segmentation allows the user or OS to allocate blocks of memory of variable length with a given name and access rights.

A Translation Lookaside Buffer (TLB) holds the most recently referenced table entries, like a cache for addresses. When an address is not found in the TLB the page tables have to be searched, which has high overhead.

The page size affects the space utilisation. If the page is too large there is excessive internal fragmentation and is it is too small then page tables become too large.

When a page fault occurs, the memory management software replaces the page from secondary storage. If the main memory is full, it is necessary to swap a page out to make space for a new one. This is managed using page replacement algorithms

  • Random - Swaps random pages out, poor performance
  • FIFO - Oldest page swapped out, does not consider frequency of access
  • Clock replacement algorithm - Similar to FIFO, uses a 'use' bit to track pages which have been used, removes unused pages
  • Least recently used algorithm - Page least recently used is swapped out. This can be expensive to track so an approximation with 'use' bits is used instead
  • Working set replacement algorithm - Keep track of a working set of pages for a process. A working set is the set of pages accessed in a set time interval

Thrashing occurs when there are too many processes in memory and the OS spends most of its time swapping pages. This can be avoided using good page replacement algorithms, reducing the number of processes or installing more memory.

Cache memory

  • Block - The minimum unit of transfer between cache and main memory
  • Line - A portion of cache memory capable of holding one block
  • Tag - A portion of a cache line that is used for addressing purposes

When virtual addresses are used, the system designer may choose to place the cache between the processor and the MMU or between the MMU and main memory. A logical cache, also known as a virtual cache, stores data using virtual addresses. The processor accesses the cache directly without going through the MMU. A physical cache stores data using main memory physical addresses.

If the CPU requests the contents of a memory location in cache that is called a cache hit. If not, it is a cache miss. The cache hit ratio is given by the number of hits divided by the total number of blocks in cache.

The average access time, $t_a$ is given by $t_a = t_c + (1-h)t_m$ where $t_m$ is cache access time, $t_m$ is main memory access time and $h$ is the hit rate. The objective is to make $\frac{t_a}{t_c} = 1$. For a cache system, $\frac{t_m}{t_c}$ is low, so $h$ between 0.7 and 0.9 will give good performance.

As there are fewer cache lines than main memory blocks, an algorithm is needed for mapping main memory blocks into cache lines. Three techniques are used

  • Direct - Map each block of main memory into one possible cache line by taking block number modulo number of cache lines
  • Associative - Each main memory block can be loaded into any line of the cache. The cache control logic interprets a memory address as a tag and word field. The cache control logic can then search the cache for the tag
  • Set associative - A combination of the both. Cache consists of a number of sets and each set consists of a number of lines. A given block maps to a set.

A victim cache can be used to reduce conflict misses of direct mapped caches by storing blocks which map to the same cache line.

Content addressable memory is constructed of SRAM cells but is more expensive and holds less data. When a tag is provided CAM searches for it in parallel, returning the address where the match is found in a single clock cycle.

For associative and set associative mapping a block replacement policy is needed for cache misses. There are 3 approaches

  • Random
  • LRU
  • FIFO

The most effective is LRU. The replacement mechanism must be implemented in hardware (unlike page replacement). A counter associated with each line is incremented at regular intervals and reset when the line is referenced. On a miss when the cache is full, the line with the highest counter is replaced.

Writing to the cache can cause inconsistencies with main memory.

Write through performs write operations on the cache and main memory at the same time. This is simple but generates more memory traffic which can cause a bottleneck.

Write back minimises memory writes. Updates are only made in the cache (which sets the dirty bit) and then when a block is replaced it is written back to main memory if the dirty bit is set. This will make areas of main memory invalid and access is only possible through the cache. This is complex to implement and can cause bottleneck.

If there is a write miss at cache level then the block can be fetched from main memory into the cache (write allocate) or the block to be written is modified in the main memory and not loaded into the cache (no write allocate).

Having multiple caches can improve performance but can complicate design.

If multiple CPUs with their own cache are sharing main memory then writing can cause inconsistencies. Cache consistency can be ensured using

  • Bus watching with write through - Each cache controller monitors the address lines to detect write operations. If another write is made, the controller invalidates the entry.
  • Hardware transparency - Additional hardware is used to ensure all updates to main memory via cache are reflected in all caches.
  • Noncacheable memory - Only a portion of main memory is shared by more than one processor and is designated as uncacheable. All accesses to shared memory are cache misses.

A split cache has separate caches for instructions and data. The instruction cache does is read only and doesn't need to worry about writes. Split cache is used for L1 cache and improves performance.

Measuring cache performance using hit rate does not consider the cost of a cache miss. Instead, average memory access time can be defined as the hit time + (miss rate $\times$ miss penalty). Cache optimisation can therefore be classified into three categories

  • Reducing miss rate
  • Reducing miss penalty
  • Reducing hit time

Cache misses can be categorised as follows

  • Compulsory - Misses that would occur regardless of cache size, such as the first access to a block
  • Capacity - Misses that occur because the cache is not large enough
  • Conflict - Misses that occur because of cache replacement conflicts
  • Coherency - Misses occurring due to cache flushes in multiprocessor systems

Cache optimisation approaches include

  • Larger block sizes - exploits spatial locality by increasing the block size. Likely to reduce the number of compulsory misses, lower miss rate. Increases cache miss penalty, can increase capacity and conflict misses.
  • Larger cache - Reduce capacity misses, longer hit times, increased power consumption
  • Higher levels of associativity - Increased cache associativity reduces conflict misses but causes longer hit times
  • Multilevel caches - Reduces miss penalty
  • Prioritising read misses over writes - Reduces miss penalty
  • Avoiding address translation during cache indexing - Reduces hit time by using the page offset to index cache

Some more advanced cache optimisations are

  • Controlling L1 cache size and complexity
  • Way prediction - reduce conflict misses while maintaining hit speed by predicting the next block to be accessed
  • Pipelined cache access
  • Multi-bank caches - organise cache as independent banks to support simultaneous access, increases bandwidth
  • Non-blocking caches - allows the processor to issue more than one cache request at a time, reducing the miss penalty
  • Critical word first - Send the missing word in a block to the processor immediately and then load the rest of the block, reduces miss penalty
  • Early restart - Load words normally and send the requested word to the processor immediately while loading the rest of the block, reduced miss penalty
  • Merging write buffer - Merge writes in the cache, reducing miss penalty
  • Hardware prefetching - load data/instructions into the cache before being requested by the processor, reducing the miss rate and miss penalty
  • Compiler controlled prefetching - compiler inserts prefetching instructions based on the source code, reduces miss penalty and miss rate
  • Compiler-based optimisation - improves instruction and data miss rates

Code optimisation

Memory bound code loads or stores lots of data. Compute bound code performs many integer or floating point operations.

The most common metric for performance of code is floating point operations per second (FLOPs). Peak FLOP is the maximum achievable FLOPs for a given machine. Program efficiency is given by actual divided by peak FLOP.

Optimisation techniques consist of 3 categories

  • Algorithmic - less complex algorithm
  • Code refactoring - address inefficiencies in resource utilisation
  • Parallelisation - Executing multiple tasks in parallel, for example with vectorisation or multithreading.

Loop optimisations

A lot of runtime is often taken up by loops so loop optimisations can improve runtime.

Loop dependencies occur when one or more iterations are dependent on previous iterations.

Aliased pointers are multiple pointer which point to the same memory location. The restrict keyword tells the compiler a pointer is not aliased.

Loop peeling is an optimisation which moves one or more iterations from a loop to outside the body of the loop. This can be used to remove loop dependency.

Loop interchange is an optimisation which modifies the order of memory accesses by switching the order of loops in the code. Consider a 2D array where each row fits in a cache line. Iterating columns before rows will copy each row into cache but if there are a lot of rows then the first cache lines will be evicted before being reused. Interchanging the order of loops can avoid this problem.

Loop blocking rearranges loops to improve cache reuse of a block of memory.

Loop fusion merges multiple loops over the same range into a single loop.

Loop fission improves locality by splitting loops (opposite of loop fusion). If there is one loop with many unrelated operations, they can be split to improve temporal locality.

Loop unrolling replaces the body of a loop with several iterations of the loop, reducing the number of condition checks to be done.

Loop pipelining reorders operations across iterations of a loop to enable instruction pipelining.

Flynn's taxonomy

Flynn's taxonomy classifies programs into four different categories

  • SISD - Single Instruction Single Data
  • SIMD - Single Instruction Multiple Data
  • MISD - Multiple Instruction Single Data
  • MIMD - Multiple Instruction Multiple Data

Vector instructions utilise specific hardware dedicated to executing SIMD instructions. Vectorisation can be applied automatically by the compiler, by using vector intrinsics or by using inline assembly.

Vectorisation usually involves loading data into a vector register, executing vector operations and then loading the data back. SSE and AVX can be used to implement vector intrinsics.

Threading

Threads allow code to run in parallel asynchronously. Threads consist of a global shared memory and a private memory. Threading is an example of MIMD. Race conditions can occur when a result is dependent on the order of execution of threads. OpenMP can be used to implement threading.

Processor architecture

The CPU

The CPU continuously performs the instruction cycle. Instructions are retrieved from memory, decoded and executed.

The instruction cycle takes place over several CPU cycles.

At the beginning of each instruction cycle, the processor fetches an instruction from memory. The program counter (PC) holds the address of the instruction to be fetched next. The processor increments the PC after each instruction fetch. The fetched instruction is loaded into the instruction register. The processor interprets the instruction and executes it.

The CPU consists of several components

  • The ALU performs mathematical and logic operations
  • The control unit decodes and executes instructions
  • Registers store values
  • The processor bus transfers data between processor components

There are two roles for processor registers

  • User-visible registers are accessible to the programmer and include data, address, index, stack pointer and condition code registers
  • Control and status registers are used by the control unit to maintain the operation of the processor and include the program counter, instruction register and memory address register

The program status word is a register or set of registers containing status information.

The collection of instructions a processor can execute is called its instruction set. Machine instructions consist of an opcode which specifies the operation and operands, which are like arguments.

Opcodes are represented by mnemonics, such as ADD or SUB. Each opcode has a fixed binary representation.

The execution of an instruction may involve one or more operands, each of which may require a memory access. Indirect addressing may require additional memory accesses. This can be accounted for with an indirect cycle.

Interrupts interrupt the execution of the processor. There are four types: program, timer, IO and hardware failure. Interrupts specify an interrupt handler to run.

Control unit

The control unit generates control signals which open and close logic gates, transferring data to and from registers and the ALU. It executes a series of micro-operations and generates the control signals to execute them.

The control unit takes the clock, instruction register, flags and control signals from the control bus as input. It outputs control signals within the processor and to the control bus.

There are 2 approaches to CU design.

  • Hardwired or random logic CUs use a combinatorial logic circuit, transforming the input signals to a set of output signals
  • Microprogrammed CUs translate each instruction into a sequence of primitive microinstructions which are then executed

Hardwired CUs are faster but more complex to design and test and difficult to change. They are used for implementing RISC architectures.

Microprogrammed CUs are slower byt easier to design and implement. They are used for implementing CISC architectures.

Pipelining

Different stages of the fetch-decode-execute cycle can be carried out in parallel. This is called instruction pipelining.

Each step, called a pipe stage or segment, of the pipeline completes a part of an instruction. Different steps complete different parts of different instructions in parallel. The throughput of an instruction pipeline is determined by how often an instruction exists the pipeline. The time required between moving an instruction one step down the pipeline is called a processor cycle. All stages must be ready to proceed at the same time so the length of a processor cycle is determined by the slowest pipe stage. This means one processor cycle is almost always one clock cycle.

The pipeline designer's goal is to balance the length of each pipeline stage. Assuming ideal conditions, the time per instruction on the pipelined processor is the time per instruction on the unpipelined processor divided by the number of pipe stages.

An example five stage pipeline consists of the following

  • Instruction fetch (IF)
  • Instruction decode (ID)
  • Memory access (MEM)
  • Write-back (WB)

This is just an example and different architectures will have different stages.

Pipelining increases the CPU instruction throughput meaning more instructions are completed per unit time. It does not reduce the execution time of a single instruction. It is important to keep the pipeline correct, moving and full. Failure to do so can lead to performance issues

  • Pipeline latency - If the duration of instructions does not change then the depth of the pipeline (number of stages) is limited
  • Imbalance among pipe stages - The clock can't run faster than the time needed for the slowest stage
  • Pipelining overhead - Caused by pipeline register delay and clock skew (maximum delay between clock signal reaching two registers)

The cycle time, $\tau$, is the time needed to advance a set of instructions one stage in the pipeline. It is given by $$ \tau = \underset{i}{\max}[\tau_i] + d = \tau_m + d $$ where

  • $\tau_i$ is the time delay on the circuitry in the $i$ th stage
  • $\tau_m$ is the maximum stage delay
  • $k$ is the number of stages
  • $d$ is the time delay of a latch

The time required for a pipeline with $k$ stages to execute $n$ instructions is $T_{k, n} = [k + (n-1)]\tau$.

Pipeline hazards occur when the pipeline must stall because conditions do not permit continued execution. There are three types

  • Resource
  • Control
  • Data

Resource hazards occur when two or more instructions that are already in the pipeline need the same resource. This can be resolved by detecting the problem and stalling one of the contending stages.

Branch instructions cause the order of execution to change, which causes control hazards as the processor may choose the wrong branch as the next instruction. This can be resolved by prefetching the branch target, using a loop buffer or using branch prediction.

Data hazards occur when there is a conflict in the access of an operand location. This can happen when the data that an instruction uses depends on data created by another previous instruction.

There are 3 types of data hazard that can occur

  • Read after write - a read of a register in an instruction happens before a write by a previous instruction
  • Write after read - the previous instruction reads a register after the current instruction has written
  • Write after write - the previous instruction writes after the current instruction

Amdahl's law states that speed-up is given by normal execution time for a task divided by optimised execution time for a task.

A reservation table is a technique for pipeline design. In a reservation table the rows correspond to resources and the columns to time units. An X denotes that a given resource is busy at a given time.

The number of clock cycles between two initiations of a pipeline is the latency between them. Any attempt by two or more initiations to use the same pipeline resource at the same time will cause a collision.

Potential collisions are identified using a collision vector, which denotes the difference in time slots between Xs on the same row. The initial collision vector represents the state of the pipeline after the first initiation.

A scheduling mechanism is needed to determine when new initiations can be accepted without a collision occurring.

One method is a scheduling strategy which represents pipeline activity with a state diagram. This uses collision vectors to check for collisions.

  1. Shift the collision vector left one position, inserting a 0 in the rightmost position. Each 1-bit shift corresponds to a latency increase of 1.

  2. Continue shifting $p$ times until a 0 bit emerges from the left end - this means $p$ is a permissible latency

  3. The resulting collision vector represents the collisions that can be caused by the instruction currently in the pipeline. The original collision vector represents all collisions caused by the instruction to be inserted into the pipeline. To represent all collision possibilities, bitwise OR the initial collision vector with the $p$-shifted vector

  4. Repeat from the initial state for all permissible shifts

  5. Repeat for all newly created states from all permissible shifts

Pipeline enhancements

  • Data forwarding - forward the result value to the dependent instruction as soon as the value is available
  • Separate the L1 cache into an instruction and data cache - removes conflict between instruction fetching and execute stages
  • Dedicated execution units
  • Reservation station - A buffer to hold operations and operands for an execute unit until the operands are available, relieves bottleneck at operand fetch stage

Superscalar processors

A single linear instruction pipeline has at best 1 clock pulse per instruction (CPI). Fetching and decoding more than one instruction at a time can reduce the CPI to less than 1, which is known as superscalar.

The speed-up of a super scalar processor is limited by the local parallelism in the program. Instruction level parallelism (ILP) refers to the degree to which instructions can be executed in parallel. ILP can be maximised by compiler-based optimisation and hardware techniques but is limited by

  • True data dependency
  • Procedural dependency
  • Resource conflicts
  • Output dependency
  • Antidependency

Superpipelining divides the pipeline into a greater number of smaller stages in order to clock it at a higher frequency. In contract, superscalar allows parallel fetch and decode operations.

Instruction issue refers to the process of initiating instruction execution in the processor's functional units. An instruction has been issued when it has been moved from the decode stage to the first execute stage of a pipeline. Instruction issue policy refers to the rules applied to issue instructions. There are three general categories

  • In-order issue with in-order completion
  • In-order issue with out-of-order completion
  • Out-of-order issue with out-of-order completion

rename registers or something idk.

Parallel computer organisation

SIMD exploits instruction-level parallelism. Concurrency arises from performing the same operations on different pieces of data.

An array processor operates on multiple data elements at the same time using different spaces. A vector processor operates on multiple data elements in consecutive time using the same space.

The advantages of a vector processor include

  • No dependencies within a vector, pipelining and parallelisation work well
  • Each instruction generates a lot of work, reduces instruction fetch bandwidth requirements
  • Highly regular memory access pattern, easy prefetching
  • No need to explicitly code loops, fewer branches in instruction sequence

The disadvantages of a vector processor is that it only works if parallelism is regular and is very inefficient if it is irregular. Memory bandwidth can become a bottleneck if compute/memory operation balance is not maintained or data is not mapped appropriately to memory banks.

Each vector data register holds $n$ $m$-bit values. It has three vector control registers

  • VLEN - maximum number of elements to be stored in a vector register
  • VSTR
  • VMASK - indicates which elements of the vector to operate on

Memory banking can be used to efficiently load vectors from memory.

Vector/SIMD machines are good at exploiting regular data-level parallelism. Performance improvement is limited by the vectorisability of the code. Many existing instruction set architectures include SIMD operations.

GPUs are hardware designed for graphics. GPU applications are developed in Compute Unified Device Architecture (CUDA), a C-like language or OpenCL, a vendor-independent alternative. GPUs are capable of most forms of parallelism. GPUs are programmed using threads but executed using SIMD instructions.

A warp is a set of threads that execute the same instruction.

CUDA unifies multiple types of parallelism using the CUDA thread. The CUDA thread is the lowest level of parallelism. Each thread is associated with each data element. The compiler and hardware can group thousands of CUDA threads to yield other forms of parallelism. Threads are organised into blocks. A grid is the code that runs of a GPU and consists of a set of thread blocks. Blocks are executed independently and in any order. Different blocks can't communicate directly.

Symmetric multiprocessors

A symmetric multiprocessor (SMP) is a standalone computer with the following characteristics

  • Two or more similar processors of comparable capacity
  • Processors share the same memory and I/O facilities
  • Processors are connected by a bus or other internal connection
  • Memory access time is approximately the same for each processor
  • All processors share access to I/O devices
  • All processors can perform the same functions
  • System controlled by integrated OS

An SMP has a number of potential advantages over a uniprocessor organisation, including

  • Performance
  • Availability
  • Incremental growth
  • Scaling

SMPs can use a bus with control, address and data lines to communicate with IO devices. To facilitate DMA transfers from I/O subsystems to processors, the bus provides addressing, arbitration and time-sharing. Bus organisation is ideal because it is simple, flexible and reliable but it has poor performance. Each processor should have cache memory which leads to problems with cache coherence.

Cache coherence

Cache coherence refers to ensuring multiple caches contain the same data.

Software can be used to ensure cache coherence. It avoids the need for additional hardware which is attractive because it transfers the overhead of detecting problems to compile time, not runtime, though this can lead to inefficient cache utilisation.

Hardware based solutions, referred to as cache coherence protocols, provide dynamic recognition at runtime of potential inconsistency conditions. Because the problem is only dealt with when it arises there is better cache utilisation. Approaches are transparent to the programmer and compiler, reducing software development burden. Hardware based solutions can be divided into two categories

  • Directory protocols
    • Collect and maintain information about copies of data in cache
    • Director stored in main memory
    • Requests are checked against directory
    • Appropriate transfers are performed
    • Creates central bottleneck
    • Effective in large scale systems with complex interconnection
  • Snoopy protocols
    • Distribute the responsibility for maintaining cache coherence among all of the cache controllers in a multiprocessor
    • Suited to bus-based architecture
    • Two basic approaches have been explored - write update and write invalidate

Write update snoopy protocols allow for multiple readers and writers. When a processor wishes to update a shared line the word to be updates is distributed to all others and caches containing the line can update it.

Write invalidate protocols allow for multiple readers but only one writer. When a write is required, all other caches of the line are invalidates. Writing processor then has exclusive access until line is required by another processor. Lines are marked as modified, exclusive, shared or invalid, so the protocol is known as MESI.

Multithreading and multicore systems

A process is an instance of a program running on a computer, which embodies resource ownership and scheduling. A process switch is an operation that switches the processor from one process to another.

A thread is a dispatchable unit of work within a process. A thread switch is an operation that switches processor control from one thread to another within the same process.

Hardware multithreading involves multiple thread contexts in a single processor. This has better latency tolerance, hardware utilisation and reduced context switch penalty but requires multiple thread contexts to be implemented in hardware and has reduced single-thread performance.

There are three types of multithreading

  • Fine-grained - switch to another thread every cycle such that no two instructions from the thread are in the pipeline concurrently
  • Coarse-grained - when a thread is stalled due to some event, switch to a different hardware context
  • Simultaneous - fine-grained multithreading implemented on top of a multiple issue, dynamically scheduled processor

Multicore refers to the combination of two or more processors on a single piece of silicon. Each core typically has all of the components of an independent processor.

Pollack's rule states that performance increase is roughly proportional to the square root of increase in complexity. As a consequence, doubling the logic in a processor core increases performance by 40%. Multicore has the potential for near linear improvement.

The main variables in a multicore organisation are

  • number of core processors on a chip
  • number of levels of cache memory
  • amount of cache memory shared

There are 4 levels of multicore organisation

  • Dedicated L1 cache
  • Dedicated L2 cache
  • Shared L2 cache
  • Shared L3 cache

Effective applications form multicore processors include

  • Multi-threaded native applications
  • Multi-process applications
  • Java applications - the JVM uses threading
  • Multi-instance applications

Heterogeneous multicore organisation refers to a processor chip that included more than one kind of core. Heterogeneous system architecture (HSA) includes the following features

  • The entire virtual memory space is visible to both CPU and GPU
  • The virtual memory system brings in pages to physical main memory as needed
  • A coherent memory policy ensures CPU and GPU caches are up to date
  • Unified programming interface enables users to exploit parallel capabilities of GPUs within programs that rely on CPU execution as well

With uniform memory access (UMA), all processes have access to all parts of main memory using loads and stores. Access time to all regions of memory and to memory for different processors is the same. This is the standard SMP implementation.

SMP approaches do not scale, so non-uniform memory access (NUMA) is used instead. NUMA has a single address space visible to all CPUs. Access to remote memory is via LOAD and STORE instructions and access to remote memory is slower than access to local memory.

Cache coherent NUMA (CC-NUMA) is a NUMA system in which cache coherence is maintained along all the caches in the processors.

Clusters are an alternative to SMP providing high performance and availability. They are a group of interconnected whole computers working together as a unified computing resource creating the illusion of one machine. Each computer in a cluster is called a node. Clusters are easy to scale and have high availability.

I/O mechanisms

Programmed I/O is a mapping between I/O related instructions that the processor fetches from memory and I/O commands the processor issues to I/O modules. Instruction forms depends on the addressing policies for external devices. When a processor, main memory and I/O share a bus, two addressing modes a possible

  • Memory-mapped
  • Isolated

Memory-mapped I/O uses the same address bus to address both memory and I/O devices. Memory on I/O devices is mapped to address values. When the CPU accesses a memory address that address may be in physical memory or an I/O device.

Memory-mapped I/O is simpler than many alternatives and allows use of general purpose memory instructions. Portions of memory address space must be reserved which is not ideal on processors with smaller address spaces.

Isolated I/O has a bus equipped with an input and output command line as well as read and write lines. Command lines specify whether an address refers to a location in memory or an I/O device. This means the full range of addresses are available for memory and I/O but additional hardware in needed for command lines.

Most I/O devices are much slower than the CPU so good synchronisation is necessary. Polled I/O is simpler to implement but busy-wait polling wastes power and CPU time and interleaved polling can lead to delayed response.

Interrupt-driven I/O uses interrupt requests (IRQs) and non-maskable interrupts (NMIs). Code can ignore IRQs but NMI can't be ignored. An interrupt can force the CPU to jump to a service routine. There a two design issues when implementing interrupt I/O. The processor needs to be able to determine which device issues the interrupt and if multiple interrupts occur it needs to decide which to process.

Device identification can be achieved through several methods

  • Multiple interrupt lines - Most straightforward approach, one interrupt line for each I/O device
  • Software poll - When the processor detects an interrupt it branches to an interrupt service routine (ISR) which polls each I/O module to determine which caused the interrupt, time consuming
  • Daisy chain - Uses an interrupt acknowledge line and a unique identifier for each I/O module
  • Bus arbitration - An I/O module must gain control of the bus before it can raise and interrupt

Interrupt-driven I/O has fast response and doesn't waste CPU time, but all data transfer is still handled by the CPU and it requires more complex hardware and software.

Direct memory access (DMA) uses a DMA controller (DMAC) which takes control of system buses and performs data transfer. DMA-based I/O can be more than 10 times faster than CPU-driven I/O.

The DMAC must only use a bus when the processor does not need it or it must force the CPU to suspend execution, known as cycle stealing. There are several methods for DMA organisation

  • Single bus detached - All modules share the same system bus, straightforward but inefficient
  • Single bus integrated - Connect I/O devices to the DMA instead of the system bus
  • Dedicated I/O bus - Separate bus for I/O modules, DMA acts as bridge between system and I/O bus

DMA has 3 operation modes

  • Cycle stealing
  • Burst mode
  • Transparent mode

RAID

Redundant Array of Inexpensive Disks (RAID) is a method for storing the same data over several disks to survive disk failure.

Data striping distributes data over multiple disks to make them appear as a single large disk. It improves aggregate I/O performance by allowing multiple I/O requests to be services in parallel.

There are 7 RAID levels

  • Level 0 uses non-redundant striping. Data is striped across all disks but there is no redundancy. It has the best write performance and lowest cost but any disk failure will result in data loss.

  • Level 1 (mirrored) stores two copies of all information on separate disks. Whenever data is written to disk it is also written to a redundant disk. Better read performance as data can be retrieved from either disk. If one disk fails, the data is still in the other. Data could also be striped across disks. This is referred to as RAID Level 1+0 or 0+1

  • Level 2 (redundancy through Hamming codes) uses very small stripes and uses Hamming codes to detect and correct errors. It is more storage efficient than mirroring. As disks don't fail that often RAID 2 is usually excessive

  • Level 3 (Bit-interleaved parity) employs parallel access with data distributed in small strips. Instead of error-correcting codes, a simple parity bit is used. In the event of failure data can be reconstructed from the parity bits. Only one redundant disk is required, it is simple to implement and can achieve high data rates. However, only one I/O request can be executed at a time

  • Level 4 (Block-interleaved parity) uses independent access. This is good for high I/O request rate but not high data rate. Data striping is used with larger stripes and parity bits are used. Level 4 is less write-efficient because the parity disk needs to be updated on every write

  • Level 5 (Block interleaved distributed parity) eliminates the parity disk bottleneck by distributing parity over all disks. It has one of the best small read, large read and large write performances but small write requests are relatively inefficient.

  • Level 6 (Dual redundancy) uses two different parity calculations and stores them on different disks. This provides extremely high availability but has more expensive writes. Read performance is similar.

SSDs use NAND flash memory. As the cost has dropped they have become increasingly competitive with HDDs. SSDs have

  • Higher performance I/O operations per second
  • Better durability
  • Longer lifespan
  • Lower power consumption
  • Quieter and cooler running
  • Lower access times and latency rates

Storage area networks (SANs) are networks which store data.