Nov. 17, Assignment #4, due Nov. 30 @ 11:59 pm (10%)

Marking: For this assignment, the software is worth 60% and the documentation 40%. Software that does not build and execute on lockhart will receive 0%. The report does not have to be beautifully bound, but should be presentable and not be hand written.

Note: Special thanks to Dave McCaughan for the basis of this assignment, and the parallaldo generator program.

Where's Parallaldo? parallel pattern searching

"Where's Waldo?" is a popular series of books (ostensibly for children) in which they are challenged to find a particular character in a very complex picture. From a computational point of view, this can be seen as a form of template matching, in which we attempt to locate a region of a larger space which matches some small template example. This is easiest to visualize with images where the problem becomes one of finding a region of a larger image which matches a small exemplar pattern.

For this assignment, you will implement a parallel version of a template matching algorithm in which a large number of images are scanned in order to locate a smaller number of template images (called parallaldos to make our abuse of the real name of this book complete). To simplify the problem, we will use basic black-and-white images that can be represented in plain text as a two-dimensional array of 1-or-0 "pixels."

Example

In the example below, the parallaldo is found in two places (red) in the image, the second being a 90-degree clockwise rotation of the template.

Note how the findings are reported: (y,x,r) where (y,x) denotes the location in the image of the (unrotated) template's upper left corner, and r being the degrees of clockwise rotation. The convention of the first coordinate being the row (here numbered starting from 1) is normal for graphics and matrix data. Of course, you can store the data in memory however you want.

Also note the opportunity for ambiguity: If the parallaldo is symmetrical--which is true for the example--it could be validly matched in any of 2 or even 4 rotations. That is, (1,1,0) is the same finding as (3,4,180), and (6,19,90) is the same as (9,17,270).

For the purpose of further simplifying the program, we'll make the following rules:

1. There won't (intentionally) be more than one parallaldo in a given image. (If someone finds a test case with more than one, it will be replaced.)

2. Symmetrical parallaldos will be avoided, so that the rotation ambiguity does not arise.

These rules will insure determinism: that there is only one correct result for any pair of parallaldo and image files, regardless of how you write your search algorithm. Furthermore, if your program finds a match, it can terminate its search immediately without looking for additional matches or other rotations of the same parallaldo in the current image.

Data specification

Data will be furnished in two directories, one containing candidate parallaldos for which to search, the other images in which to search. Your program will search for every combination of parallaldo and image pairs.

The directories will be populated with text files consisting of a header line giving the dimensions of the rectangular image in R rows x C columns, followed by R rows of C 1s and 0s. Your program may assume that all files in the specified directories are intended as input.

To help with testing, you can find in Public Documents on lockhart the following files:

parallaldos: sample templates to search for (*.txt)

targets: sample images to be search in (*.img)

results.txt: correct output for above sample files

parallaldos.tgz (19 MB): tarball of the above in case you want to test on another computer

parallaldo_gen.c : program to generate random test cases (read the comments for how to run)

Program specification

NOTE: Your program should be capable of running in serial fashion when it is started with only one thread. This is not difficult to arrange if you plan it from the start. We need to have timing from a serial version in order to calculate speedup.

The search algorithm is completely up to you. It can be as "brute force" or as clever as you want. There are several places where parallelism can be exploited (suggestions below).

The program should be called aldo , and take 3 command line arguments:

aldo aldodir imagedir cores

aldodir is the directory populated with parallaldo templates

imagedir is the directory populated with images to search in

cores is the total number of cores to utilize (see below)

Print an error message and abort if the arguments are invalid, or if either directory is empty.

The cores argument is intended to reflect the number of cores in use for computation. Depending on your work allocation methodology, your main thread may be doing some non-CPU intensive I/O and supervision and then just waiting for the worker threads to join, in which case it does not need to be counted as a core. This means that with an argument of 8, for example, you could create 8 worker threads, or make a thread pool of size 8. This cores argument will later be used as the X axis on your graphs, so you should be clear what it represents.

What about cores > 16? This can happen if you are intentionally "oversubscribing" the physical cores. A legitimate use for this in a scalable program could be appointing threads to read files (which assumes they will mostly be blocked on I/O and hardly occupy a core). If you do all your file reading in the main thread, this comment does not apply, and you should only oversubscribe to see for yourself if/how it affects the performance.

The output of aldo is printed on stdout, where it can be captured in a log file using the command redirection operator ">". Start each line of result output with the character "$" so that meaningful output can be readily grepped out from among any other lines printed (e.g., in case you inserted printf's for debugging).

Each match is printed on a separate line using the following format exactly (to facilitate automatic verification):

$parfile imfile (y,x,r)

where parfile and imfile are the bare filenames (no directory path) of the parallaldo and its matching image, respectively, and (y,x,r) are bracketed integers printed without extra spaces or leading zeroes. Thus, the result line will have only two spaces. Nothing is printed for image files that have no parallaldo match. The order of output lines is not significant; it will be sorted for verification purposes.

End with a report of total elapsed time in seconds to at least 2 decimal places. This should be the final line printed on stdout.

The sample file results.txt is in the required format (except for missing the elapsed time report).

Suggestions for parallelizing

You're under no obligation to follow any of these suggestions. If you can think of a faster way to do something, go for it.

One way to structure the program is to start by reading all the parallaldos into memory (since there aren't a lot of them and they are small). The program can then have a main loop that reads in each target image (some of which are quite large), and an inner loop that searches for each parallaldo in turn. Note that when you find a match, you can break out of the inner loop and go on to the next image.

Opportunities for exploiting data parallelism include:

The main image file loop (these iterations are independent of one another).

The inner parallaldo search loop (these are independent, but once one match is found, further searching in that image can be aborted; this problem characteristic might have the potential to yield superlinear speedup if less work is done than for serial search).

The loops of the pixel search algorithm (going through the rows and columns of each image).

Consider carefully how you can overlap (slow) disk I/O operations with (fast) computation. One popular technique is known as "double buffering." This is where you are reading one image file into a buffer while the processor(s) are searching through another image already read in. There are a couple ways of approaching this. One is to use an asynchronous read (you start the read, then check the status later to see if it finished); another is to devote a separate thread to do a normal synchronous read (while it blocks, other threads are carrying on so no CPU time is wasted).

Keep load balancing in mind, because the image sizes vary greatly. Furthermore, if you find a match, the search could terminate quickly; but if you don't, you'll have to run right through the entire image.

Anyway, the trick is to find useful work to keep all 16 cores busy, thus reducing idle time. Generally speaking, the higher/coarser levels of parallelism will be easier to program, so if you can occupy all the cores that way it's good enough, and there may be nothing to be gained by parallelizing at the lowest/fine-grained level.

Java implementation

Code your solution to run on lockhart using Java 8 in NetBeans according to the program specification below. Don't forget that Parallel Amplifier claims to work with Java. It might help you optimize your program.

You can assume that the maximum no. of cores is 16, for the purpose of defining static data structures. As mentioned above, you may choose to spawn additional threads that are not for computation.

For this assignment, we will time the entire execution from start to finish. There are several ways to carry out execution timing in Java. We will use System.currentTimeMillis() which reports the current time to milliseconds, more than adequate for our needs. Save/subtract those values to calculate intervals. Since Java runs are subject to arbitrary garbage collection "pauses," be sure to run each set of arguments several times taking the average or median, and you may throw out wild outliers that could be due to garbage collection.

Deliverables

Source code via electronic submission using Subversion under tag cis3090-A4 :

1. Java program "aldo" in a NetBeans "project." Each .java source file that you create must include a file header containing your name and student number, the filename, and a brief description of the purpose of the file's class(es). Submissions lacking proper file headers will be penalized.

2. Accompanying NetBeans files in the same directory, so that the project can be readily opened and built. You can commit all the files and subdirectories in the project-level "aldo" directory, except that we don't need the subdirectory named build .

Documentation, printed and delivered to class.

1. Write your impressions of working with Java threading. Compare and contrast with pthreads and OpenMP.

2. Describe your application's parallel architecture and work distributions schemes. Explain how you went about utilizing threads in Java. Did you go for the "classic" approach or use a higher-level library?

3. Make a timing graph of total execution time (Y axis) vs. number of cores (1-16 with at least 8 data points) using the sample directories provided in Public Documents. Also make speedup and efficiency graphs based on the same data. Draw the linear speedup line on the speedup graph. Write some observations about the graphs. Given the results, how scalable was your parallel design's performance?

Bonus marks

Same terms as for A2, but no rematch provision. For contest purposes this time, we'll take 5 timings, throw out the worst one, and average the rest.

Marks for Assignment #2 Part 2

For the software, mark is 0-6 based chiefly on functionality:

6 = Parallel program that produces correct results and reports timings for any tested args

1.5-5.5 = Same as 6, but with deductions according to deficiencies

1 = Parallel program that runs cleanly, and is coded to address the problem, but results are all incorrect or missing

0 = Fails to build, or crashes without producing any results

In addition, 1/2 mark penalties will be deducted for specific "sins": numerous Java warnings, file headers missing/incomplete, program name wrong, Java errors printed at run time.

The documentation starts with 4 marks, then deduct 1 mark for any item missing, or 1/2 mark for any lame or poorly-done item. If you didn't have any working software to graph, you'll lose at least 1 mark to be fair to those who conducted timing and produced their graphs.

Gardner Home | Course Home