Oct. 24, Assignment #2, due Nov. 1 in class, and Nov. 9 @ 11:59 pm (15%)

WARNING: This assignment has been used before and all previous submissions have been retained to compare with, so don't be tempted to copy a former student's code!

Software that does not compile, link, and execute on lockhart will receive 0%. The report does not have to be beautifully bound, but should be presentable and not be hand written.

Red/Blue simulation, Part 1

The "Red/Blue computation" is described in textbook exercise #10 on p. 111, and you need to read that. The algorithm is not fully described here!

Be aware that, despite some similarities, this is not a "game of life": the respective quantities of coloured cells do not change as the program runs. Rather, the initial colours migrate around the board. For Part 1, design a scalable parallel algorithm based on the factors we have studied:

Start with the "3-step approach" of Chap. 4, then make a first cut at the pseudocode.

Revisit the Successive Over-relaxation case study of Chap. 6, taking it as a model for applying considerations of work assignment, load balancing, granularity, and so on. See if you can improve your pseudocode.

The purpose of this part is to instigate some intelligent planning based on applying the theory and techniques we have studied. It will not be marked before you proceed to Part 2; however, the Instructor may give you feedback if it appears your understanding of the problem is seriously awry. Not getting feedback at this stage does not imply that the Instructor believes your pseudocode to be perfect! You're still responsible for program correctness.

Again, start coding your program as soon as you have some decent pseudocode to work from; don't wait until handing in Part 1 to start Part 2!

Clarification of the red/blue algorithm

Conceptually, all legal red moves must happen simultaneously for the entire board, followed by all legal blue moves. Of course, you have to produce that "conceptual effect" by means of the usual iterative program looping.

Consider the following 8-column row (@=space here, but you should print a real space):

[RRR@@BBB]

As a human in the 3rd dimension eyeballing the row, you can see that only one red is a candidate to move right. Thus, the legal result of the red half-step is:

[RR@R@BBB] correct

Don't go over the same row again and again to move reds into spaces that open up! That would produce:

[@@RRRBBB] wrong

Clarification of stopping conditions

The two conditions should be tested after each full step, and the density is tested first:

The maximum colour density condition is looking for a point where enough red or blue cells have shifted together so that some tile has become more than c% red or blue. We'll define this unambiguously so that our programs all stop at the same time when given the same initial board:

If a tile is t*t cells, then the maximum number of coloured cells for continuing the computation = t*t*c/100, all done in integer arithmetic .

For example, if tiles are 8 cells wide, and c is 70%, then up to 8*8*70/100 = 44 red cells or blue cells are allowed, but 45+ of either red or blue in any tile will stop the program.

Thinking through the effect, we know that random initial assignment means that any board will have around 1/3 of each colour cells overall, though for small tiles, their coloured fractions could vary widely. Therefore, specifying c<<33 will likely stop the computation immediately, while c>>33 will rarely be achieved--except with small tiles. So for large tiles, the useful range of c is probably a narrow band above 33. Try it and see.

Past experience has shown that for a given combination of board and tile sizes, there's a certain c% that can be reached "soon", meaning in under 200 iterations. Call that percentage "D" for max. density. If you want to go above D, by even 1%, you won't reach it no matter how long you run. It would be very interesting to find some formula that gives D as a function of board and tile sizes.

The second condition, reaching a maximum number of full steps, is straightforward. It is only checked if the density condition was not triggered.

Pthreads implementation, Part 2

Code your solution to run on lockhart using pthreads in Intel Parallel Studio XE 2016 according to the program specification below. You'll find the following APIs in Public Documents\win32(or 64)\include :

pthread.h: (scroll down to "Methods") thread attributes, mutexes, barriers, condition variables, thread specific data, spinlocks, read-write locks

semphore.h: semaphores

sched.h: probably nothing useful for this program

The easiest way to access documention for these POSIX functions is to Google "man function " or type that at a Linux command prompt. You can also try "man pthreads" but it may be too high-level.

You may initially develop/debug on Linux if you like. In that case, if you want to use barriers, you may need to define a feature macro. Since it causes a warning in Windows pthreads (though it does not seem to do any harm), you can avoid the warning like this:

#ifndef _WIN32 // if we're not on Windows
#define _POSIX_C_SOURCE 200112L // get barriers on Linux
#endif

You can assume that the maximum no. of threads is 16, for the purpose of defining data structures. The TA will not run your program with >16 threads. However, you may wish to allow for more threads, to observe whether "oversubscribing" the 16 cores helps performance or not. You may also choose to spawn additional threads that are not for computation and are not "bound" to cores.

The program is called rbs , and takes 5-7 command line arguments, in any order. Each one (except for "i") is made of a leading character followed by an integer N. The first 5 are required.

pN: no. of processors (threads) to use

bN: width of board >1, so board size is N x N cells

tN: width of one overlay tile, each N x N cells (b mod t=0)

cN: max. colour density in integer percent, 1-100 (stopping condition)

mN: max. no. of full steps (additional stopping condition)

sN: optional random seed

i: optional interactive mode switch

Print an error and abort if p<1, b <2, the tile width t is not a factor of the board width b, or the colour percentage is out of range. If a seed is specified, use it with srand( seed ) to generate the initial board layout (specified below), otherwise, obtain a seed by reading the system clock. Specifying a seed should result in the same output every time for the same set of numerical arguments on the same platform. But do not expect that Intel's rand() on lockhart will give the same pseudorandom sequence as gcc's rand() on your Linux box. Therefore, lockhart-produced outputs will be considered the "gold standard" for correctness checking.

"t" is an interesting parameter: With t1, the program will terminate immediately. (Why?) With tN=bN, running the program makes no difference to the colour density, so it will either terminate immediately or else run until the max. steps are exhausted. Also, note that "c100" effectively deactivates the colour stopping condition.

Initialize the board by calling rand()%3 for each cell in row major order. The values are: 0=white, 1=red, 2=blue. However, you do not have to fill the cells with integers 0, 1, and 2; you can think of another (more efficient?) encoding for the white, red, and blue if you wish, since we only care about the textual output.

When started in automatic mode, the program will run until it satisfies either stopping condition, and then print the final board contents on the file redblue.txt . (Therefore, when a seed is specified, the output file can be diff'ed against a known good file. The Windows command like "diff" is comp . See "comp /?" for possible arguments.)

For this assignment, the part that we want to time is just the simulation, excluding board initialization and final output. By disregarding a major inherently serial part in this way, more obvious speedup should be observed. Use the StartTime() and EndTime() functions supplied in Public Documents\pa-demo\PrimeSingle\wallclock.h. You need to enable OpenMP support (see header of .h file) or you will get linker errors.

Follow this output format precisely: Print one row per line, terminated by \n, using the following character map: white=' ', red='>', blue='V' (uppercase vee). Also print, as the last line of the file and on stdout, the command line arguments, the number of (full step) iterations completed, the highest tile colour density in the final board (integer percent), and wallclock execution time (to at least 2 decimal places), all on one line.

The purpose of interactive mode is to allow you to eyeball the results of one (half) step, and so to conveniently check your program's correctness. It is understood that interactive mode is for debugging and amusement using small boards. It is only worth 1 mark.

When started in interactive mode, the program will initialize the board, print a prompt and then loop accepting commands:

<Enter>: i.e., empty input; perform one full step

#: perform (integer) # full steps

h: perform one half step

c: continue in automatic mode to normal termination and output file

x: output current board into file and exit

The <Enter> and # commands should "resynchronize" by doing a blue half step if preceding "h" commands ended on a red half step. For example, if I initially enter "h" then "5", the program will do one red half step, followed by one blue half step and 4 full steps.

If a command cannot be interpreted, print the word "Error" and reprompt. There is no need for fancy error messages.

After any stepping command, print the board followed by the same numerical status as at the end of the program using iterations so far, and current maximum of tiles' single-colour percentage. You can easily obtain coloured shell output using the cross-platform functions in Public Documents\putcolour.c . They work in both the Windows and Linux environments. See putcolour.h for interface.

Part 2 Deliverables

Source code via electronic submission using Subversion under tag cis3090-A2 :

1. Pthreads red/blue simulation program "rbs" in Visual Studio "solution" folder. Each source file that you create must include a file header containing your name and student number, the filename, and a brief description of the purpose of the file. Submissions lacking proper file headers will be penalized.

2. Accompanying Visual Studio files in the same directory, so that the solution can be readily opened and built. The odd-named files that are important are: .sln, .vcxproj, .vcsproj.user, and .vcxproj.filters. Please avoid adding/committing in svn the numerous large files produced by Parallel Inspector and Parallel Amplifier! We also don't need the subdirectories named Debug and Release.

Documentation, printed and delivered to class:

1. Write your impressions of working with Parallel Studio and pthreads.

2. Describe any changes you made to your application's parallel architecture as you implemented the pseudocode and tuned the program. Justify the changes based on the concepts we've studied (e.g., reducing false sharing, load balancing, etc.).

3. Make a timing graph of execution time (Y axis) vs. number of cores (1-16 with at least 8 data points) running the Release (optimized O2) version using the set of arguments given directly below. Also make speedup and efficiency graphs based on the same data. Draw the linear speedup line on the speedup graph. Write some observations about the graphs. Given the results, how scalable was your parallel architecture?

p[varies 1-16] b875 t125 c100 m2000 s3090

c% is intentionally set so that the full 2000 iterations run.

4. For this assignment, you don't need to rerun and graph every trial with the unoptimized Debug version, but do enough testing with O0 so you can report whether compiler optimization made a difference and to what extent.

Part 2 Bonus marks

Bonuses of 15% (=entire assignment mark * 1.15), 10%, and 5% will be given for the 3 fastest programs (disregarding initialization and file output, as explained above). A benchmark set of arguments to be announced on CourseLink will be applied to all submissions, and the programs that compute the correct output file in the shortest times on an unloaded system. Timing will be determined by the average of 3 runs.

NOTE: Rematch provision! The 2nd and/or 3rd place runners-up may request a rematch. In that event, block time (exclusive use of machine) will be available for further profiling and tuning prior to the runoff match using a set of arguments with different "TBA" values. The bonuses will be redistributed among the 3 contenders according to the new times.

Marks for Assignment #2 Part 2

For the software, mark is 0-6 based chiefly on functionality:

5 = Parallel program that produces correct results and reports timings for any tested args

+1 for interactive version in colour

1 = Parallel program that runs cleanly, and is coded to address the problem, but results are all incorrect or missing

0 = Fails to compile and link, or crashes without producing any results

In addition, 1/2 mark penalties will be deducted for specific "sins": numerous compiler remarks/warnings, file headers missing/incomplete, Visual Studio "solution" folder incomplete, program name wrong, arguments mishandled, output filename incorrect, etc.

The documentation starts with 4 marks, then deduct 1 mark for any item missing, or 1/2 mark for any lame or poorly-done item. If you didn't have any working software to graph, you'll lose at least 1 more mark to be fair to those who conducted timing and produced their graphs.

Gardner Home | Course Home