Parallel Systems - Practica

main page (theory) - project

Until 2016, the practicum was part of the course. Now, we discuss some of the exercises during the theory.

Documentation: parallel.vub.ac.be

Utility classes (check main method for usage) v1.1 jar-file v1 jar-file code our default libraries (install at least mpj.jar)

edu.cmu.tetrad.data.LongColumn to average out results of multiple experiments.
be.ac.vub.ir.data.Chart to show data.
be.ac.vub.ir.data.FrequencyTable to show a frequency table

Room 4k228 computer specifications

Processor: Intel Core 2 Quad Q 6700 (more info)

CPU Speed: 2.66 GHz
64 bit processor
Level 2 cache: 2 x 4 MB (each L2 cache is shared between 2 cores)
Level 1 instruction cache: 4 x 32 KB
Level 1 data cache: 4 x 32 KB
Architecture

RAM: 4 GB
Graphics Card: NVIDIA GeForce 9400GT

has 16 'cores' and supports CUDA for general purpose computing...

Exercises

Tips:

Use -Xmx256M as VM argument if more memory is needed (set in Eclipse for sequential or multithreaded algorithms or after the mpjrun.bat for message-passing algorithms)
Call the garbage collector (between 2 experiments) with Runtime.getRuntime().gc();

Exercises 1-10

Memory Access Overhead (NOT THIS YEAR)

1. Test the cache miss penalty with the array traveling program ArrayWalking. Solution

Can you estimate the cost of a Level 1 and Level 2 cache miss?
Can you estimate the relative memory latency Lambda (PPP page 66) for the different memory levels?

Multi-threading

2. Test if latency hiding can be obtained with file reading from system and from url in a separate thread (FileReading.java). Solution

By running on multiple cores, by running on one core (set affinity of eclipse process to 1 core in Windows' Task Manager).

3. Implement a multithreaded fibonacci algorithm (Fibonacci.java). Solution!

Base the calculations on the following recursive definition. This is of course not the most intelligent solution, actually terribly dumb. Let's assume that it is a good way of calculating fibonacci. Since it generates a lot of computations, it is good for obtaining speedup when run in parallel!

Fibonacci (n) = Fibonacci (n-1) + Fibonacci (n-1) if n > 2
Fibonacci (n) = 1 if n = 1 or n = 2
Fibonacci (n) = 0 if n < 1

There are several ways to decompose the problem. Try to find a solution for which you can parameterize the number of threads.

Study the performance in function of the number of threads.

Try to get an ideal speedup (= number of cores).

If this ideal is not met, try to find an explanation!
Advanced questions: Is it a load-balanced solution? Count the number of threads. Count the number of work of each thread or time the computation time of each thread.

4. Implement a multithreaded matrix multiplication. Solution

Try to make a solution in which you can set the number of threads.
What speedup do you obtain?
What is the maximal number ot threads?
How does the speedup evolves with increasing number of threads? And with increasing matrix size?
Try to estimate the thread overhead by testing solutions with different number of threads.

5. Test the the performance of various Count3s implementations. Solution

The given first solution is wrong. Why?
What happens if you limit the cores to 1 on which your program may run?
Explore the following solutions (their correctness and their performance):

This is the given solution without any form of synchronization.
Try to protect the counter variable with a boolean preventing multiple access to the counter. Does this work?

Use yield() to give hints to the OS to do a context switch.

Do the race conditions disappear by making the count variable volatile?
Use synchronized to protect the critical section. There are 2 ways to do this.
Use a private counter to increase the performance.

Lin & Snyder (page 37) claim that padding increases performance (overcoming false sharing). Test the claim?
Test whether false sharing happens when a (global) array is used in which each thread uses one element as private counter (idea of student Thierry Renaux, 2011). Try if padding helps to overcome false sharing.
For an example of how padding might improve performance, examine the following code: False Sharing
(idea of student Cristoph de Troyer, 2014)
Visualization of speedup on 8 cores

Solve Count3s by making use of an atomic counter (see java.util.concurrent.atomic package, Java API Doc).
Correct solution 2 by making use of AtomicBoolean (idea of student Sven Claeys, 2009).
Alternative for 5: let the main thread globalize the private counters (idea of student Robrecht Dewaele, 2010).
Can you invent another solution?

6. Check Invariance-breaking with volatile (Slide 55). Solution

In the threads, do a lot of updates of the upper and lower values to generate race conditions. Check the invariant each time, so that you see when it is broken.

Don't put extra print statements in the threads. They will considerably slow down the threads and lower the chance of a race condition.

Why don't you get a thread-safe solution with volatile? Which solution gives a correct program?
Test it with at least 10.000.000 iterations.

Can something go wrong with CheckInvariant?

7. Break the 'monitorness' of an object so that two synchronized methods (here: f1 and f2) can be accessed simultaneously. Solution

Verify first the monitorness of the ObjectAsMonitor by comparing synchronized with unsynchronized methods.

8. Implement a Barrier (PPP page 193) and call the barrier in the previous exercise after each iteration. Solution

Replaces the join in the previous exercise with a barrier call at the end of the thread and in the main.
Optional: Implement a split-phase barrier (PPP page 193) and uses it in the previous example after each iteration. The condition is that a thread can only start the next iteration if everybody has finished f1(). f2() is therefore the useful work that can be done while waiting.
As correctly pointed out by Christophe De Troyer, the solution is a bit more complex, especially if you are going to reuse the barrier object iteratively:

We should take into account the possibility if spurious wakeups
If the same barrier is used repetitively you should ensure that all threads have left the barrier before early threads already start entering.
Indeed, the java NotifyAll does not guarantee that the ownership is passed to the awakened threads: “after activation due to a notify, the inactive thread start wanting to run in the synchronized block again, but is treated equaly with all other threads that wants it”

Christophe De Troyer's November 2014 solution takes care of these issues!

9. Implement a multithreaded solution for the following matrix update algorithm. See figure and Wavefront.java. Solution

The matrix is updated row-by-row from left to right. Each element is updated by adding the latest value of the upper element and by subtracting the left element.
Implement a highly-parallel solution which is correct. Check the correctness of the parallel result. The performance is not the major goal here.
The parallelism lies in the wavefront, as indicated on the figure.
Hint: use one thread per row.

How could you make it faster?

Optional: do the update multiple times.

10. Implement a multithreaded solution for this iterative algorithm. Start from BarrierWithReduction.java. Solution

The given program is a simple version of an iterative algorithm in which local computations have to be followed by a global calculation (reduction), as shown by this figure. This structure appears in several parallel algorithms, such as a parallel Dijkstra algorithm. The simple program is given so that you can concentrate on the synchronization, which is not trivial.
Create a thread that calculates the value of each array element. Create a separate 'master' thread that does the reduction.
Make sure it is 100 % thread-safe and all threads finish.

11. An application of the double condition problem (cf PPP p. 173) with a deadlock: ApplesOrangesWithDeadlock.java. Make it deadlock-free.

There are apple producer threads and there are orange producers. The first consume 1 apple and 1 orange to produce 2 apples, the second consume 1 apple and 1 orange to produce 2 oranges.
Note that the code is written in Pthread-style as you would do in C. It uses Locks and Condition Variables.
Note that if you remove the sleep() before the consume the application doesn't seem to deadlock.

12. Computing the surface of the Mandelbrot set Mandelbrot.txt.

You are given a template to compute the surface of a Mandelbrot set. Implement a multithreaded version of this code.
The task here is to get as good a speedup as possible. Think how to decompose the problem to achieve this

OPTIONAL: Implement an efficient and deadlock-free solution for the Dining Philosophers problem.

Efficiency is in terms of the time the philosophers are hungry. You should try to minimize starvation time.

Message-passing

Exercises code 13-18 - FastMPJ Download - MPJExpress Quick Start (Installation and user guide, Environment variables for FastMPJ , etc) - MPJ Express API doc - Quick reference 1 and page 2

13. Implement a message-passing CountingPrimes algorithm (start from Counting Primes and ). MPITest Solution Solution with Collective Communication (by Sarah ElShal)

What is the efficiency of the parallel solution? Does it maximally exploits the parallel processing power? Can you optimize the solution?
Study the speedup of CountingPrimes in function of array size and number of processes.
Compare it with the message-passing Counting3s solution. Which is more efficient?

Study the message-passing Counting3s solution. Does it give a good speedup? Why (not)?

If not done in your first solution, try to use MPI's collective communication operations in a second version.

14. Study the default send & receive protocol (DefaultProtocol.java) which can result in a deadlock situation.

MPJ Readme (check protocol switch limit)
Implement at least 3 different solution that avoids the possibility of a deadlock.

      15.   Implement a message-passing version of the BarrierWithReduction algorithm (exercise 10). Solution

      16. Implement a message-passing version of the matrix update algorithm using the wavefront approach (exercise 9).    Complete solution

      17.   Java provides automatic object serialization.

Sun Documentation and examples
Object2ByteArray shows in what bytestream an object is transformed.
Object2File shows object writing and reading to file.

Try changing serialVersionUID.

ObjectCommunication shows how the bytestream can be send through MPI.
Optional: The previous solution can be optimized:

By creating your own OutputStream. ByteArrayOutputStream.toByteArray() gives a copy of the byte array.
The byte array of Object serialization can happen more dense by implementing the Externalizable interface.

Now that we have seen how to serialize and communicate objects, it is clear that sending objects is possible in MPI.

In MPJ, send objects directly with the Object datatype. See ObjectCommunication2.

18. NOT THIS YEAR. Test the communication performance of point-to-point communication.

Measure the communication time/per integer for increasing message sizes, SendModes.java gives you the opportunity to do so.

This code lets you test different send modes.
Run with the -H options to see the program arguments.

How much computations do I have to do to make it worth to do it remotely? The computation time should exceed the time to send the data. Measure the boundary value between speedup and slowdown.

Measure a message-passing equivalent for Lambda (the relative memory latency).

Also measure the performance characteristics for processes running on the same processor.

GPU computing

go here

Performance Analysis (NOT THIS YEAR)

23. Design and do a performance analysis of a parallel pipeline algorithm.

This was an exam question in january 2008.
Design (implementing is not necessary) a generic parallel solution for algorithms that can be pipelined. The problem class is defined as follows:

We have nd data packages that have to be processed by a np process objects.
Each process object has to process the data packages in the right order, because its state is changed after the processing of each data package.
Each data package can also be changed by the process. Therefore, it should also be processed by all process objects in the right order.
A correct sequential algorithm looks like this:

Data[] data
      = getData();

      Process[] processes = createProcesses();

      for(Data d: data)

          for(Process p: processes)

             p.process(d);

Do a performance analysis (see theory)

Draw an execution profile of the parallel solution to explain it.
Define the classes and major methods. You do not have to implement them, just sketch the solution.

What are the overheads and possible performance bottlenecks. How can they be countered?

Actually there are 2 possible parallel solutions. For which solution should one opt? Depends on what?

Generic Parallel Solutions (NOT THIS YEAR)

19. Design a two-fold generic parallel solution for data-parallel farming algorithms (assume independent data parallelism). Interface Definition Ruben's implementation

data parallelism: the same thing (called the task) has to be performed on different data elements.
Independent means that the tasks do not need information from each other, they can run completely independently
Examples:

checks whether each element of an array is a prime. Input: int array; output: boolean array
checks whether each element of an array is divisible by y (with y a parameter of the program). Input: int array; output: boolean array

Combine multithreading and message-passing.
Implement a simple solution based on a general Task class.

For testing: each task counts the number of primes for a given interval. Generate a number of tasks with random intervals.
Start with the CountingPrimes example code (see earlier exercises).

List the functionalities that would improve the performance of your solution.

20. Transform the CountingPrimes program into a generic parallel solution that can be applied for a larger class of problems, without increasing the complexity of the solution.

See notes of lesson on Good Programming Practice.
Extract the parameters and methods that should be provided by the user. Define the abstract class with which the generic solution will work.

What will the user have to provide in the case of counting the prime numbers?

Change the CountingPrimes program so that it works with the abstract class.
Clearly define the problem class for which your solution applies!
There are several possible solutions. What are the differences between the solutions?

21. Implement a message-passing generic framework for reduction operations.

See notes of lesson on Good Programming Practice and PPP p 129.

22. Design a two-fold generic parallel solution for genetic algorithms.

Define the classes and major methods. You do not have to implement them, just sketch the solution.

23. Design a two-fold generic parallel solution for algorithms that can be pipelined.

The problem class is defined as follows:

We have nd data packages that have to be processed by a np process objects.
Each process object has to process the data packages in the right order, because its state is changed after the processing of each data package.
Each data package can also be changed by the process. Therefore, it should also be processed by all process objects in the right order.
A correct sequential algorithm looks like this:

Data[] data
      = getData();

      Process[] processes = createProcesses();

      for(Data d: data)

          for(Process p: processes)

             p.process(d);

Draw an execution profile of the parallel solution to explain it.
Define the classes and major methods. You do not have to implement them, just sketch the solution.
What are the overheads and possible performance bottlenecks. How can they be countered?

Actually there are 2 possible parallel solutions. For which solution should one opt? Depends on what?

This was an exam question in january 2008.

- Back to the top -