Practicum and Project of Multiprocessors course

Preferably use Visual C++ (community edition is for free)
Consider the following basic algorithms:

convolution/filter on a 1-dimensional signal or a 2-dimensional matrix, reduction of an array (e.g. sum of all elements), dot product of two vectors (arrays), matrix multiplication

random number generator: for each element of the input array of integers, use it as start value (seed) and calculate the 100th random number with a pseudo-random number generator of the form: x = (A.x + B) % C

or propose a different algorithm which you like

sequential programming (with and without optimization flags, and manual optimization such as loop unrolling and inline functions), auto-parallelization/auto-vectorization, vectorization (SIMD), multithreaded, with OpenCL (GPU), and with OpenMP.

Compare the outcomes to make sure the algorithms are doing the same thing (use random data as input).
Measure time, calculate computational performance (operations/second), bandwidth (bytes per second) and cycles per basic instruction (query the clock frequency).
Compare this for all the versions.
Calculate the speedup with the sequential version with full optimization (-O2 flag).
Also compare the sequential version with versions with lower optimization levels and manual optimization. As such, try to guess what optimizations the compiler is doing.

Run your code on at least 2 systems.
Describe in detail the processor and GPU characteristics.
Estimate the peak performance of the processor (the theoretical maximal performance which the processor can deliver in ideal situations, for both memory access and computational performance) and measure the GPUs with our microbenchmarks: www.gpuperformance.org. Test in advanced mode and upload the results to the database.

You can work with 2 or alone.
You can visit me to discuss the results and get feedback.
Try to explain the different results. Is the peak performance reached? Why not? why are optimizations better? or not?
Did auto-parallelization and auto-vectorization work? Check the compiler reports!
Check the assembler code for automatic loop unrolling and vectorization.

As results we expect:

The code, preferably following my template.

The experimental results on at least 2 systems (runtime, speed, bandwidth, ... see my template)

you can write your results to file with the following command in cmd.exe: VectorElementWiseProduct.exe > results.txt (do not forget to press enter if required by the code)

Description of the CPU-GPUs used, theoretical peak performance estimation of CPU, results of GPU microbenchmarks in our database

Analysis of the results: explain why optimization tricks and parallelization works or does not work

Documentation

Visual Studio settings:

Project Properties -> Configuration Properties (to be set separately for Debug & Release configurations)

C/C++ -> Optimization -> choose Optimization level

you can override this configuration for each file separately: right-click on file and choose Properties

C/C++ -> Optimization -> enable Intrinsic Functions: Yes or No
C/C++ -> Output Files -> Assembler output: Assembly With Source Code (/FAs)

an .asm file is generated in the x64\Release folder

C/C++ -> Command Line -> Additional Options: /Qpar /Qpar-report:2 /Qvec-report:2

meaning of the compiler messages or here.

Measure time accurately (std::chrono or the high-precision clock)

Use the free tool CPU-Z to check processor capabilities, and GPU-CAPS for your GPUs

auto-parallelization and auto-vectorization in Visual C++

with #pragma loop(ivdep) you indicate that loop iterations are independent. Sometimes this pragma is necessary.

x86 vector instructions

use hadd (horizontal add) to make sums of array elements (reductions)

Multithreading in C++ 11, another short tutorial

Condition variables: the mutex must be a unique_lock. Use for instance the constructor (number 3) that transforms a mutex to unique_lock. With this constructor, the mutex is locked! See also here.

OpenMP in Visual C++
GPU, OpenCL 1.2 documentation

you might have to copy
Opencl.dll to both C:/windows/System32 and C:/windows/SysWow64
the dll can be found on https://www.dllme.com/dll/files/opencl_dll.html (pick the Khronos one)

Multiprocessors and Reconfigurable Hardware: practica on multiprocessors

Practicum assignment

convolution/filter on a 1-dimensional signal or a 2-dimensional matrix, reduction of an array (e.g. sum of all elements), dot product of two vectors (arrays), matrix multiplication

random number generator: for each element of the input array of integers, use it as start value (seed) and calculate the 100th random number with a pseudo-random number generator of the form: x = (A.x + B) % C

or propose a different algorithm which you like

sequential programming (with and without optimization flags, and manual optimization such as loop unrolling and inline functions), auto-parallelization/auto-vectorization, vectorization (SIMD), multithreaded, with OpenCL (GPU), and with OpenMP.

Documentation