Practica of
PPP
Demo code based on Visual Studio 2019
Visual Studio
Solution
VectorElementWiseProduct: compare variants of the same
algorithm <= use this as template!!
- ComplexFunctions: compute-intensive functions, ideal
for making parallel tests with high granular programs
- ArrayOfFunctionsExample: how to use functions as
variables
- MultithreadingTesting
- Most other code is part of the exercises on Multithreading
explained below
Documentation
- Use the free tool CPU-Z to check processor capabilities
- Check Compilers with godbolt.org/
- MSVC is the Microsoft Visual Studio compiler
- add option -O2 to see the optimizations!
- gcc (GNU) and icc (Intel) are also very prominent
compilers for x86/x64 CPUs
- Visual Studio settings:
- Project Properties -> Configuration Properties (to
be set separately for Debug & Release configurations)
- C/C++ -> Optimization -> choose Optimization
level
- you can override this configuration for each file
separately: right-click on file and choose Properties
- C/C++ -> Optimization -> enable Intrinsic
Functions: Yes or No
- C/C++ -> Output Files -> Assembler output: Assembly
With Source Code (/FAs)
- an .asm file is generated in the x64\Release
folder
- C/C++ -> Command Line -> Additional Options:
/Qpar /Qpar-report:2 /Qvec-report:2
- Assembler code (to be checked with godbolt
or generated .asm-files):
- watch out: movss xmm0
is a scalar move on a vector register, mulss is
a scalar multiplication. Scalar means working on 1
element.
- movups and mulps are
the true vector instructions!
- List
of x86 instructions
- Measure time accurately (std::chrono is a
high-precision clock since C++ 11)
- auto-parallelization
and auto-vectorization in Visual C++ (Use of pragmas)
- with #pragma loop(ivdep) you indicate
that loop iterations are independent. Sometimes this
pragma is necessary.
- x86
vector instructions
- Multithreading
in C++ 11, another
short tutorial
- OpenMP
in Visual C++
- Activate
OpenMP in Visual Studio: project properties ->
Configuration Properties -> C/C++ -> Language: Open
MP Support: Yes OR add in All Options
/openmp
-
MPI in Visual C++
-
MPI on the VUB/ULB Hydra cluster
Mini-project on optimization (10%
of marks)
Project VectorElementWiseProduct: Study
the different optimizations and compare 2 related array operations.
Every student or group of 2 students chooses a
different combination of 2 array operations and send it to me for
OK.
- Array operations:
- element-wise array operations: add - mul - mul-add -
div - sqrt (NEW!) - max - bitwise operation (and)
- compare for instance mul with div, or mul with mul-add
- the above is bandwidth-limited, to reach the computational
peak performance we have to :
- reductions: sum, product, mul-add or max of an array
- compare for instance sum with add (same number of
operations)
- perform element-wise array operations multiple
times: what happens if we do the element-wise
operation repetitively? the memory load will decrease, so the
performance might increase. Test with different number of
iterations.
- compare different data types for the same operation:
byte, short integer (16-bit), integer, _int64, float, double
- Add the following optimizations/parallelizations:
- try the different optimization flags: O2 &
Od intrinsics yes/No
- Q: is there a difference with intrinsics on and off? check
the assembler code (though I noted that godbolt doesn't
always gives the same as the assembler code!)
- Try to achieve the O2-performance with manual
optimization and level Od! Can you do as good or better as
the compiler!?
- Inlining
functions: using a multiplication functions and
then inlining does not seem to change a lot, so you don't have
to do it. (the versions with functions are in
Versions_Od_NoIntrinsics.cpp)
- Try optimization pragmas: #pragma unroll, #pragma
loop(no_vector)
- they only make sense with optimization turned on (not with
Od)!
- Q: does it make a difference? take a look at the assembler
code
- manual loop unrolling
- vector instructions making
reductions with intrinsics
- pragma simd
- Howto-video
for doing the last 3 threads - Solutions-video
- ...
- Tips & tricks
- Make sure to compile your final code
in Release Mode!
- Check your clock frequency via the task manager and hardcode
it in your code if the query-function doesn't give the right
result.
- I don't expect you to get to understand all details of
compiler optimization, pragmas etc. These exercises just want
to give you a feeling of all these concepts.
- The basic element-wise operation is bandwidth-limited which
means that the CPU is spending most of its time to memory
reads and writes! So you don't expect large speedups.
- Compare the manual optimization with Od. Is the optimization
flag O2? Then the compiler might have optimized better than
you manually!
- Together with the code I provided a Visual Studio tips file.
Read it.
- Hand in a report (10% of marks for the course) containing:
- CPU info [extra: test on 2/3 different processors]
- performance results (generated table) for all optimizations
- shortly discuss results, analyze and take conclusions
- For sequential code answer the following questions:
- How much can the compiler optimize the code? Are the loops
unrolled? The instructions vectorized?
- Which pragmas help the compiler in further optimizing the
code?
- Can yo reach the same speed with manual optimization and
flag Od?
- For multi-threaded code, answer the following questions (no
need to compare with Od)
- Does multi-threading further speedup the code (O2-flag)?
- Does automatic parallelization (pragma, openmp) works?
- try to estimate the maximal GFlops and maximum GB
from your results
- visual studio code (zipped). Remove object and exe-files
(and all other intermediate files). Use Build -> clean for
instance or remove manually. Minimally I need source files,
and solution (.sln) and project file (.vcxproj).
- Bonus points: check and discuss generated assembler code in
godbolt.org/
Exercises on multithreading All solutions - note: to have
speedup for the Counting3s, compile without optimizations (Od)
Explanation
of solution for exercise 1 & 3 (9 nov 2020).
Explanation
and solution of exercises 4, 5 & 6.
1. Implement a multithreaded fibonacci
algorithm [project Fibonacci].
- Base the calculations on the recursive definition. This
is of course not the most intelligent solution, actually
terribly dumb. Let's assume
that it is the only way of calculating fibonacci.
Since it generates a lot of computations, it is good for
obtaining speedup when run in parallel!
- Fibonacci (n) = Fibonacci (n-1) + Fibonacci (n-1) if
n > 2
- Fibonacci (n) = 1 if n = 1 or n = 2
- Fibonacci (n) = 0 if n < 1
- There are several ways to decompose the problem. Try to find
a solution for which you can parameterize the number of
threads.
- Study the performance in function of the number of
threads.
- Try to get an ideal speedup (= number of cores).
- If this ideal is not met, try to find an explanation!
- Advanced questions:
Is it a load-balanced solution? Count the number of threads.
Count the number of work of each thread or time the
computation time of each thread.
- Use the task parallelism capacities of OpenMP: see
PPCP pages 209-212.
2. Implement a multithreaded
matrix multiplication [project MxM].
- Try to make a solution in which you can set the number of
threads.
- What speedup do you obtain?
- What is the maximal number ot threads?
- How does the speedup evolves with increasing number of
threads? And with increasing matrix size?
- Try to estimate the thread overhead by testing solutions
with different number of threads.
3. Test the the performance of various Count3s
implementations [project Counting3s].
- The given first parallelized solution is wrong. Why?
- Explore different solutions (their correctness and their
performance).
4. Check Invariance-breaking (See
slides on MT: first example of thread safety) [project
InvariantBreaking].
- In the threads, do a
lot of updates of the upper and lower values to generate race
conditions. Check the invariant each time, so that you see when
it is broken.
- Don't put extra print statements in the threads. They will
considerably slow down the threads and lower the chance of a
race condition.
- Why don't you get a thread-safe solution with volatile? Which
solution gives a correct program?
- Test it with at least 10.000.000 iterations.
- Can something go wrong with CheckInvariant?
5. Implement a Barrier [project
Barrier].
- Implement the barrier function such that all threads wait for
the last thread to finish.
- 6. Implement a multithreaded solution for the
following matrix update algorithm. [project
Wavefront].
- The matrix is updated row-by-row from left to right. Each
element is updated by adding the latest value of
the upper element and by subtracting the left element.
- Implement a highly-parallel solution which is correct. Check
the correctness of the parallel result. The performance is not
the major goal here.
- The parallelism lies in the wavefront, as indicated on the figure.
- Hint: use one
thread per row.
- How could you make it faster?
- Optional: do the
update multiple times.
- Back to the
top -