Multiprocessors and Reconfigurable Hardware: practica on multiprocessors

  Practicum assignment

Preferably use Visual C++ (community edition is for free)
Consider the following basic algorithms:

        convolution/filter on a 1-dimensional signal or a 2-dimensional matrix, reduction of an array (e.g. sum of all elements), dot product of two vectors (arrays), matrix multiplication

        random number generator: for each element of the input array of integers, use it as start value (seed) and calculate the 100th random number with a pseudo-random number generator of the form: x = (A.x + B) % C

        or propose a different algorithm which you like

Implement and compare the performance based on the following approaches:

        sequential programming (with and without optimization flags, and manual optimization such as loop unrolling and inline functions),  auto-parallelization/auto-vectorization, vectorization (SIMD), multithreaded, with OpenCL (GPU), and with OpenMP.

Compare the outcomes to make sure the algorithms are doing the same thing (use random data as input).
Measure time, calculate computational performance (operations/second), bandwidth (bytes per second) and cycles per basic instruction (query the clock frequency).
Compare this for all the versions.
Calculate the speedup with the sequential version with full optimization (-O2 flag).
Also compare the sequential version with versions with lower optimization levels and manual optimization. As such, try to guess what optimizations the compiler is doing.

Run your code on at least 2 systems.
Describe in detail the processor and GPU characteristics.
Estimate the peak performance of the processor (the theoretical maximal performance which the processor can deliver in ideal situations, for both memory access and computational performance) and measure the GPUs with our microbenchmarks: Test in advanced mode and upload the results to the database.

You can work with 2 or alone.
You can visit me to discuss the results and get feedback.
Try to explain the different results. Is the peak performance reached? Why not? why are optimizations better? or not?
Did auto-parallelization and auto-vectorization work? Check the compiler reports!
Check the assembler code for automatic loop unrolling and vectorization.

Additional questions:

          How many vector registers does your CPU have?
          Try to measure the time (in cycles) for a scalar and a vector operation.

As results we expect:
          The code, preferably following my template.
          The experimental results on at least 2 systems (runtime, speed, bandwidth, ... see my template)

                you can write your results to file with the following command in cmd.exe: VectorElementWiseProduct.exe > results.txt   (do not forget to press enter if required by the code)
          Description of the CPU-GPUs used, theoretical peak performance estimation of CPU, results of GPU microbenchmarks in our database

        Analysis of the results: explain why optimization tricks and parallelization works or does not work


- Back to the top -