# General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19<sup>th</sup> 2015 #### Jan Lemeire (jan.lemeire@vub.ac.be) - Graduated as Engineer in 1994 at VUB - Worked for 4 years for 2 IT-consultancy companies - 2000-2007: PhD at the VUB while teaching as assistant - <u>Subject</u>: probabilistic models for the performance analysis of parallel programs - **Since 2008**: postdoc en parttime professor at VUB, department of electronics and informatics (ETRO) - Teaching 'Informatics' for first-year bachelors; 'parallel systems' and 'advanced computer architecture' to masters - **Since 2012**: also teaching for engineers industrial sciences ('industrial engineers') - Projects, papers, phd students in parallel processing (performance analysis, GPU computing) & data mining/machine learning (probabilistic models, causality, learning algorithms) - http://parallel.vub.ac.be ### A bit of History #### The first computer, mechanical ## Babbage Difference Engine made with LEGO http://acarol.woz.org/ This machine can evaluate polynomials of the form $Ax^2 + Bx + C$ for x=0, 1, 2, ...n with 3 digit results. #### The first informatician Ada Lovelace 1815 – 1852 - Describes what software is - She brings the insight that a computer goes beyond plain calculations - She writes the first algorithm/program #### **ENIAC** First computer: WWII John Mauchly and John Eckert, 1945 #### Von Neumann rethinks the computer John Von Neumann #### The Von Neumann-architecture #### Execute program step-by-step #### For acceleration: pipeline Long operations Combination of short operations time #### Pipeline Design - Typically five tasks in instruction execution - IF: instruction fetch - ID: instruction decode - OF: operand fetch - EX: instruction execution - OS: operand store, often called write-back WB #### Superscalar out-of-order pipeline ### 'Sequential' processor: superscalar out-of-order pipeline (in order) (out of order) Different processing units Pipeline depth ALU MEM1 BR **Out-of-order execution** MEM2 FP2 **Branch prediction** FP3 Register renaming (out of order) (in order) **Pipeline width** # Now we are computing sequentially! # Parallel computing? #### Super computer: BlueGene/L - IBM 2007 - 65.536 dual core nodes - E.g. one processor dedicated to communication, other to computation - Each 512 MB RAM - No 8 in Top 500 Supercomputer list (2010) - www.top500.org #### Clusters - Made from commodity parts - or blade servers - Open-source software available #### Distributed-Memory Architectures - Each process got his own local memory - Communication through messages - Process is in control #### BlueGene/L communication networks - (a) 3D torus (64x32x32) for standard interprocessor data transfer - Cut-through routing (see later) - (b) collective network for fast evaluation of *reductions*. - (c) Barrier network by a common wire #### Message-passing - The ability to send and receive messages is all we need - void send(message, destination) - message receive(source) - boolean probe(source) # Multicore computing #### Shared Address-space Architectures Example: multiprocessors #### AMD Barcelona: 4 processor cores #### Thread / core - A different thread per core - Each thread can run independently - Multi-threaded programming - Thread synchronization necessary - Multiple threads per core also possible - Context switches necessary - Hardware threads/hyperthreading #### Latency Hiding ### Next level #### Parallel processors Courtesy of Allera #### **GPU** Architecture #### 1 Streaming Multiprocessor The Same Instruction is executed on Multiple Data (SIMD) width of pipeline: 8 - 32 #### Why are GPUs faster? Devote transistors to... computation #### GPU processor pipeline - $\pm 24$ stages (old), now $\pm 8$ - in-order execution!! - no branch prediction!! - no forwarding!! - no register renaming!! - Memory system: - relatively small - Until recently no caching - On the other hand: much more registers (see later) #### Multi-Threading (MT) possibilities #### Processing power not for free # **Obstacle 1** Hard(er) to implement ## **Obstacle 2** Hard(er) to get efficiency #### **CPU** computing manual automatic **Algorithm** **Implementation** Compiler Write once Run everywhere efficiently! Automatic optimization Low latency of each instruction! # Computer science is not about computers Programmability solutions ### **Auto-parallelization** - Key requirements - A compiler must not alter the program semantics - If the compiler cannot determine all dependencies, it has to forego parallelization - Compilers sometimes need to act very conservatively - Pointers make it hard for the compiler to deduce memory layout - Codes may produce overlapping arrays through pointer arithmetics - If the compiler can't tell, it does not parallelize - Past 30 years have shown that auto-parallelization - is a tough problem in general - is only applicable to very regular loops - sannot take care of manual parallelization tasks. # Computer science is n about computers #### **Challenges of GPU computing** **Implementation** **Optimization** performance portability programmability # Intel Xeon Phi ### Single Instruction Multiple Data (SIMD) Instructions can be performed at once on all elements of vector registers - Operate elementwise on vectors of data - E.g., MMX and SSE instructions in x86 - Multiple data elements in 128-bit wide registers - Data has to be moved explicitly to/from vector registers - All processors execute the same instruction at the same time - Instruction has to be fetched only once ### Vector processors (SIMD) - Highly pipelined function units - Stream data from/to vector registers to units - Data collected from memory into registers - Results stored from registers to memory - Has long be viewed as the solution for highperformance computing - Why always repeating the same instructions (on different data)? => just apply the instruction immediately on all data - However: difficult to program - Is SIMT (OpenCL) a better alternative?? ### Intel's Xeon Phi coprocessor #### Intel's Xeon Phi's core # Vectorization needed for peak performance!! # Conclusions The third pillar of the scientific world ### Parallel Programming Paradigms #### **Distributed memory** #### **Shared memory** #### Hardware? Software? - You need to have insight into the hardware! - No universal hardware/programming model (yet) - Intel-approach (SIMD) - Intel sticks to x86 architecture - That's what programmers know & they won't change - Vectorization necessary - OpenCL-approach (SIMT) - Will semi-abstract model remain valid?