Sep. 22, Assignment #1, due Oct. 4 in class, and Oct. 12 @ 11:59 pm (10%)

Gardner Home | Course Home

Marking: For this assignment, the software is worth 60% and the documentation 40%. Software that does not compile, link, and execute on mako will receive 0%. The report does not have to be beautifully bound, but should be presentable and not be hand written.

National Collision Database

A theme in government these days is "Open Data"--this is the "intentional" kind of open, not the Edward Snowden/WikiLeaks kind!--and so Canada has a website data.gc.ca. One can find there nearly 200,000 collections of data, some very large. The one we're going to use is called the NCDB (more info from Transport Canada and web query interface here). The full NCDB is "a database containing all police-reported motor vehicle collisions on public roads in Canada," but they don't give the public access to all that. Instead, they've posted a subset including "selected variables (data elements) relating to fatal and injury collisions" for the 14-year period 1999-2012. That is, collisions not resulting in at least an injury are not listed. Furthermore, they don't give enough details to pinpoint the date or location of the collision. Nonetheless, we can still ask some interesting questions about this data, and the size of the file is large enough to warrant HPC treatment.

The CSV file that can be downloaded is around 315 MB and contains over 5.2 million fixed-length records. I placed a copy of it on mako, including the zipped version in case you want to transfer it to your own computer. The first line of the file contains the column headings, and their explanations are given in the accompanying Data Dictionary which you should grab for yourself and read (désolé, aucune version française!).

The organization of the ASCII text file is a bit tricky, most likely because it results from eliminating fields that they didn't want to disclose (and I don't guarantee I figured out all the tricks). Where I write "collision" below, I mean "involving an injury or fatality," but there's no sense repeating it.

  • Each record (=one line) represents one person who was involved in one collision. The exception is where a "dummy person" has been created to represent an empty parked car.
  • A record identifies its associated collision (by year, month, day of week 1-7, and hour 0-23) and associated vehicle (one of the collision's N vehicles)--the exception being a pedestrian, cyclist, etc., who has no vehicle.
  • Thus, the number of records for a single collision varies from one up to the number of people (and parked cars) involved. Only one person had to be injured or killed to get the collision into this dataset; the other people may be unharmed.
  • There is no unique identifier or "key" for a collision! The extent of a single collision's records must be inferred by observing a change in the association data (day/hour and/or vehicle).
  • Data processing strategy

    A reasonable, scalable way to use a cluster of W worker processing elements (PEs) on this dataset is to simply divide it evenly into W chunks, and put each PE in charge of answering queries about the data it owns, which should be capable of fitting into its memory (therefore only needing to read the entire big file one time). This strategy will minimize the disk I/O and put the emphasis on computation. We don't want to split the same collision's records between workers, so a little care must be taken about that.

    Data specification

    On mako in /work/wgardner/a1 there's a single file called NCDB_1999_to_2012.csv which for testing purposes you may wish to cut down in size. In principle, you can open the file in a spreadsheet application. The program will accept a single argument giving the input filename with absolute or relative path file. The fastest-access place you can put your copy of input files is /scratch/ login. /scratch is per-cluster storage whose contents may automatically disappear after 2 months. Every node in the cluster can access files in your /home, /work, and /scratch directories. (There's also a per-node /tmp directory, but I doubt you'll need to use that.)

    Don't forget that the first line represents the column headings and is longer (145 bytes). After that, every line is the same length (61 bytes). Each line is terminated by two end-of-line (EOL) characters (CR+LF) in the Windows/DOS style. Avoid hardcoding known constants like 145 and 61; instead, use #define or "const int" to make them symbolic. When you make smaller files for testing, it would be smart to paste in the header line, otherwise your counts may be off.

    Program specification

    The program should be called bang and take at least one command line argument:

    sqsub ... bang csvfile queries [ Pilot options ]

  • csvfile includes the relative or absolute path to the file. Make sure that your program accepts both forms, a relative and absolute path! This is for the sake of the TA who will run it.
  • queries is a series of numbers (which can include repeats) indicating which queries will be performed, or if no numbers are supplied, the program just loads up the database and prints the number of records and collisions read in.
  • These would be some valid commands:

    bang /work/jdoe/short.csv 1 2 3 4 5
    bang ../data/my.csv 3 -pisvc=d
    bang just2000.csv
    bang /work/wgardner/*.csv 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

    Note that when you call PI_Configure, the Pilot options (if any) will be removed from argc/argv, after which you can check for the ones you expect to by typed. Print an error message and abort if the fopen fails. You may ignore any invalid query arguments. (Don't forget that until reaching PI_StartAll, every node executes the same statements, so you don't want them duplicating work unless that makes logical sense. You certainly don't want 100 nodes all printing "Cannot open file.")

    The output of bang is printed by PI_MAIN on stdout, where it can be captured in a text file specified using the -o option of sqsub. Start each line of required result output with the character "$" so that meaningful output can be readily grepped out from among the lines printed by Pilot, MPI, and mako's job scheduler. You can print anything additional that you like, as long as the line doesn't start with "$". Detailed output examples are below. The format must be followed exactly to enable automatic checking!

    Your program will utilize Pilot processes in a master/worker pattern. I suggest the following division of labour after PI_StartAll:

    PI_MAIN:

  • It knows how many workers were created; call that number W. Get the input filename from argv[1] and try to open it. Find out its length in bytes ("man fseek"), from which you can calculate its number of data records, call that R. Don't forget about the header record.
  • Essentially, you want to assign each worker R/W records, but it won't be that neat due to possible multi-record collisions. For each worker (other than the first one, which is a special case), seek to the record where you would like its portion to start, then read forward until you find a record that starts a new collision. Make a note of that record's number as the starting point for that worker. Repeat until you've discovered a logical starting point each worker.
  • Using PI_Broadcast, ship all the workers their starting record numbers (broadcasting also lets each worker know the next one's starting point, hence its own end of data).
  • Then wait for the workers to finish reading the file, and get back the number of records and number of collisions each read in. Collect these using PI_Reduce to add them up, then compare the total records to what you were expecting as a sanity check (PI_MAIN doesn't know the correct number of collisions, only the total records from the size of the file). If the sanity check fails, call PI_Abort to halt the program with an error message.
  • Worker(i):

  • PI_Read its starting record number, then open the file (filename comes from argv, now known to be good) and fseek to that file position. Read till the end of its portion, storing every record in memory (data structure is up to you), also recording the starting index of each collision in another array. Now the worker will be able to either scan the data by collision or by vehicle/person, depending on what kind of query it gets.
  • PI_Write back to the master the number of records read in and number of collisions found in those records. Since PI_MAIN is calling PI_Reduce, don't forget that the corresponding PI_Write formats also need the same reduction operator "%+/..."
  • Hint: Be sure to make your initial debugging runs using -picheck=2 or 3, so that Pilot will check your sending/receiving format strings to verify that they match up properly (level 2), and that your input arguments (for PI_Read and PI_Reduce) are really variable/array locations and not values (level 3, that's a typical newbie mistake for fscanf that can also happen with PI_Read/Reduce).

    Doing the above gets your program into a fully initialized state where the data is sitting in W workers' memories. Then the fun starts by answering some questions. From this point, Master is driving the processing by interpreting the series of numeric query codes from the command line and sending them to the workers using PI_Broadcast. Meanwhile, each worker is waiting at PI_Read for a query code. It processes the query, sends back the results, and repeats until receiving some kind of "done" code from Master, at which point it returns from its work function. Finally, Master calls PI_StopMain to coordinate the shutdown, and returns from main().

    Note that the approach described above has two distinct phases: The I/O phase where the W workers are each reading/storing their portion of the file, and the query answer phase which is pure computation. We want to separately record the time taken by each phase so that we can see whether multiple processors helps the I/O or only the computation, and to what degree of speedup. Since the I/O goes through the OS file system and eventually down to a single disk file, it is hard to guess how much it can be parallelized.

    Queries are given in the table below by query number, along with hints for processing and output (pattern first, followed by sample).

    GENERAL RULE: If the specific field you need to evaluate for a query is marked "N", "U" (when it means Unknown), or "X" then do not include that record in the query results! This does not mean to ignore every record that has "N" "U" or "X" somewhere in it, because you may not need that data to answer the present query.

    Query 1 is probably the hardest. Once you get that pattern rolling, the rest should be fairly routine.

    1

    What is the worst month of the year for collisions (by quantity, by fatalities)?

     

    Let each worker fill a 14 x 12 x 2 array: 14 years, 12 months, total collisions and total fatalities for that month. Master can (1) combine the arrays and print out for each year the month with most collisions and the month with most fatalities; and (2) further reduce the data to a 12 x 2 array and print out (summing all years) the month with most collisions and the month with most fatalities.

     

    $Q1,year,highest_collision_month,highest_fatality_month //for each of 1999..2012
    $Q1,9999,highest_collision_month,highest_fatality_month //over all years

    $Q1,1999,6,12
    $Q1,2000,11,1
    ...
    $Q1,2012,4,1
    $Q1,9999,6,1

    2

    Who is more likely to be killed in a collision, men or women?

     

    Let each worker run through its persons with fatal P_ISEV and add up the number of men and women, respectively. Master will print the grand totals, call them Km and Kw. Then the respective probabilities will be Km/(Km+Kw) and Kw/(Km+Kw). Print those to 2 decimal places.

     

    $Q2,men_killed,women_killed,P(man killed),P(woman killed)

    $Q2,12046,8934,0.57,0.43

    3

    What was the most number of vehicles ever involved in a single collision during the 14 years, and when did it take place?

     

    Let each worker run through its collisions and report the maximum number of vehicles C_VEHS along with the collision's date. Master will print the max. of all workers. It can do this with a single "%max/" reduction if you use a classic programmer's trick: Pack the number of vehicles (which can't exceed 99) along with the date into a single integer like so: VVYYYYMMD. This way the date goes along for the ride, and after the max. is determined based on the high-order digits, unpack the date for printing.

     

    $Q3,most_vehicles,year,month,day

    $Q3,26,2005,12,3

    4

    In a year on average, how many people typically wreck their new car (defined as vehicle year >= collision year), and how old was the average vehicle in a collision over the 14 years?

     

    Let each worker run through its vehicles, reporting the number whose V_YEAR >= C_YEAR, the total number of vehicles, and the sum of vehicle ages as C_YEAR-V_YEAR+1 (avoids counting new model vehicle as -1). Master will calculate total "new" cars/14 (to get an annual average), and total ages/total vehicles (to get average age). Print the average age to 1 decimal place. Don't forget that vehicles are repeated with each person! Just include each unique vehicle one time.

     

    $Q4,avg_wrecked_new_cars,avg_wreck_age

    $Q4,478,4.2

    5

    Where is the most likely place to have a collision?

     

    Let each worker run through its collisions and add up each different type of road configuration C_RCFG, returning a 13-element array (let index 0 stand for "QQ" Other). Master prints the grand totals: code of most likely place, total code 01, code 02, ..., code 12, code QQ.

     

    $Q5,most_likely_place,total_place_01,total_place_02,...,total_place_12,total_place_QQ

    $Q5,11,135,5890,2047,120,548,149,188,236,67,7302,14762,230,23

    Timing reports:

  • PI_MAIN will use PI_StartTime and PI_EndTime to provide timing reports. Call PI_StartTime just after PI_Configure. To preserve that starting point, don't call PI_StartTime again. Just keep calling PI_EndTime, and save/subtract those values to calculate intervals. Print times as seconds to 1 decimal place. To make it easier to collect the data for drawing your performance graphs, also print the number of Pilot processes (as returned from PI_Configure, not W) used in the run (so 1 = serial, 2 = master + 1 worker, etc.) along with each elapsed time.
  • Initialization time: Report the time just after all workers return their number of records and collisions read (the "1" here means a serial run), so this includes all the initial I/O: $T0,1,9.5
  • Query time: Report the time with the query number (here doing query 2 with 24 processes): $T2,24,16.1
  • Computation time: To exclude the initialization time, report the time from T0 to just before calling PI_StopMain (here "9" represents "the end"): $T9,8,126.9
  • In terms of software modularity, I suggest breaking out three functions:

  • One that takes a record number and returns the byte offset that you would need to call fseek. This function can account for the odd-length header record, and calculate the right offset based on the record number and record length plus EOL characters.
  • One that parses a record of NCDB data and fills a struct. Strictly speaking, "parsing" (in the sense of scanning character by character) is not necessary because the data is fixed in known columns. You can basically ignore the commas and just interpret an entire line using fscanf with a lengthy format string. This is a trivial problem that can be made needlessly complicated if you try. Note that numeric fields may be populated with alphabetic codes to represent odd cases like Unknown. Read the Data Dictionary to avoid treating characters as numbers by mistake!
  • One that compares a just-read record with the previously-read record and returns whether they belong to the same collision or not. This is the kind of logic that is easy to get wrong, and needs fixing up when you discover special cases you didn't think of at first. We can post on the forum how many collisions we think are in the database. If there are ambiguous cases, we need to agree on the decision rules so that everyone can arrive at the same output. By isolating this logic into a function by itself, you can tinker with it until you get consistent results.
  • Serial mode:

    Your program must be capable of running in serial fashion when it detects that only one Pilot process is available. In serial mode, it does not send messages to workers, but instead does all the work in PI_MAIN (presumably by calling functions). This is not difficult to arrange if you plan it from the start so that the work is factored into functions that can be called by both PI_MAIN or a worker. We need to have timing from a serial version in order to calculate speedup. Make sure you run the serial version using "sqsub -n 1" just like the multiprocess version, even though sqsub will want to put it in the "serial" job queue, not the "mpi" queue. Avoid using mpirun instead of sqsub, because you'll be competing with everyone using mako's login node and the run time will be affected.

    Suggestions for using Pilot
  • Use the return value of PI_Configure to calculate how many worker processes can be created. Don't forget that one process is reserved for PI_MAIN. If the value is one, you must run in serial mode.
  • Load balancing is handled by dividing the dataset evenly amongst the workers. The time that each worker spends answering a query is going to be pretty comparable, so there's no need to consider elaborate schemes such as work stealing, nor to wait for whichever process finishes computing a query first.
  • To make sure all the processes finish cleanly, send every worker a special query code that essentially signals "done." The workers can then return from their functions.
  • Start with a small number of workers until you're sure the logic is correct and the answers are consistent. Obviously, the answers must remain the same as you crank up the number of processes.
  • Don't forget to use maximum Pilot error checking and deadlock detection initially until you are confident of your program's logical correctness. If you're having trouble figuring out what your program is doing, try the Jumpshot log (-pisvc=j) which visualizes all your message traffic.
  • Performance measurement

    We are interested in how the T0 initialization (I/O) time and T9 computation time vary as the number of processors increases. We hope it's going to decrease in a fairly linear fashion, since this is an "embarrassingly parallel" computation. Nonetheless, it is still subject to Amdahl's Law, and won't decrease indefinitely.

    For these timings, run queries "1 2 3 4 5" and if turns out that this is too brief, you'll get instructions on forcing repetitions. We want your best times, both with and without maximum compiler optimization.

    To present the results, draw six graphs--three for each time execution time of interest--being careful to label your axes:

  • Execution time (T0 or T9) vs. number of Pilot processes.
  • Speedup (=serial time/parallel time) vs. number of Pilot processes. Serial time is for one process. The speedup graph must also contain a straight line that shows perfect speedup (proportional to the number of processes) for comparison.
  • Efficiency (=speedup/no. of processes) vs. no. of Pilot processes. Ideal efficiency would stay at 1.0 as processes increase, but this is not likely.
  • The 2nd and 3rd graphs are just different ways of presenting the 1st graph's numbers.

    For the above graphs' X axis, do timings for 1-8 processors, and then 12, 16, and 20. You may go higher if you wish. This means that you are plotting a minimum of 11 points, where the point 1 is serial execution, and point 2 (master + one worker) may be even worse than serial.

    To make sure the timings are fairly consistent, and aren't being unduly disturbed by other mako users, do some of the runs at least a few times and verify that the timing variance is small.

    IMPORTANT: Try disabling all compiler optimizations and rerun the timings, drawing that line on the graph as well.

    NOTE: The Intel C/C++ compiler, unlike gcc, defaults to high optimization -O2. (You can also try other, more complex optimization options.) To compare the impact of optimization, you have to turn it off explicitly with -O0. The Pilot library has been compiled with -O0 for possible debugging purposes.

    For extra credit (see below), you may also explore what happens to the timing when you force each process onto a node by itself instead of letting the job scheduler pack multiple processes onto the same node.

    WARNING: Mako may have non-SOCS users who are directly running programs on some processors without going through the job scheduler (sqsub). This is the nature of an experimental cluster, and it could impact your run times if you are unlucky enough to get scheduled onto such nodes. This is one reason you want to take multiple timings and check their variance.

    Don't underestimate how long it will take to do these timing runs, record the measurements, and draw the graphs! Your assignment isn't finished just because your program compiles and runs correctly.

    Part 1 Deliverables (due Oct. 4 in class)

    The purpose of Part 1 is to get people working with Pilot and the mako environment, and get the file read in and records counted. Your output is a paper submission consisting of:

  • 1. A process/channel diagram (with any bundles indicated) showing the architecture of your Pilot program (can be hand drawn).
  • 2. Printed log file of a sample run, demonstrating that your skeleton program runs, that the channels shown in your diagram have been exercised, that PI_MAIN is capable of distributing the work and the workers are capable of reading/counting their records. The printed grand total should be correct. Of course, it's better if you also count the collisions correctly, but that's not required at this stage. Mark up the printout so that the TA can easily understand what it is showing.
  • Do not submit source code for Part 1. Not handing in Part 1 on time, or an incomplete/lame submission, will result in loss of functional marks (see below).

    Part 2 Deliverables (due Oct. 12)
  • Source code via electronic submission using Subversion:
  • 1. Pilot program "bang" in .c source file(s). Each source file that you create must include a file header containing your name and student number, the filename, and a brief description of the purpose of the file. Submissions lacking proper file headers will be penalized.
  • 2. Makefile that compiles bang on mako using mpicc when make is executed. You may use the Makefile for the Pilot labs as a model. You can introduce compiler flags by adding a CFLAGS +=... line. Have a clean target in your makefile.
  • Documentation, printed and delivered to class:
  • 1. Write your impressions of working with the Pilot library. We're looking for feedback! What did you like/dislike? Was there anything you found hard to understand or to use properly? Did you use the Jumpshot visual log feature? If so, what did you think about it? If not, why not? Do you have any suggestions to make it more useful? In your opinion, would Pilot be easy for a novice scientific programmer to learn? Why or why not?
  • 2. Describe your application's parallel architecture and work distributions schemes. See the part above "Suggestions for using Pilot." Either explain how you implemented these suggestions, or explain the approach you adopted.
  • 3. Print your six graphs and write some observations about them. Given the speedup and efficiency results, how scalable was your parallel architecture? Was it the same for the I/O phase as for the computation phase?
  • 4. Did compiler optimization make a difference? Did it affect both execution time and speedup, or just execution time? (That is, did it especially help the parallel version, or did it equally help the serial version?)
  • Marks for Assignment #1

    For the software, mark is 0-6 based chiefly on functionality:

    6 = Parallel program that produces correct results for all queries, reports timings, and runs in serial mode when started with one process

    1.5-5.5 = Same as 6, but with deductions according to deficiencies

    1 = Parallel program that runs cleanly, and is coded to address the problem, but results are all incorrect or missing

    0 = Fails to compile and link, or crashes without producing any results

    In addition, 1/2 mark penalties will be deducted for specific "sins": numerous compiler remarks/warnings, file headers missing/incomplete, makefile problematic, program name wrong.

    Concerning Part 1, 2 functional marks will be deducted for no submission, or 1 mark if the submission does not meet the requirements.

    The documentation starts with 4 marks, then deduct 1 mark for any item missing, or 1/2 mark for any lame or poorly-done item. If you didn't have any working software to graph, you'll lose at least 1 mark to be fair to those who conducted timing and produced their graphs.

    Extra credit

    As was discussed in lab, when you use plain "sqsub -n P" on mako, the job scheduler is free to start your requested P processes on whichever node(s) 1-32 it wants. It will likely put the first 8 on one node, the next 8 on another, and so on. The upside of sharing a node is that your Pilot channel communications will take place in that node's RAM (much faster than a network message); the downside is that they will compete for the node's 8 or 16 GB of available RAM. It would be interesting to see whether the location of your processes makes any noticeable performance difference for this bang application, either in spreading the I/O (possibly faster?) or the computation (slower?).

    To get up to 10% extra credit (defined as assignment mark * 110%), set up the following test cases and compare the total execution times (using full compiler optimization):

  • "sqsub -n 8" scheduled onto a single mako node
  • same job, but scheduled onto 8 different nodes (this requires the special --ppn argument to sqsub)
  • Using the uname() library function, which you can call in main() before PI_Configure, make your Pilot processes print out the nodename that each is running on, in order to prove that they are the same/different.

    You should make a few runs and average the times to make sure that there's a real pattern. Don't use the deadlock checker or Jumpshot log because those will introduce other factors to influence timing.

    In your report, include the log printout with the node names and the timings, and write an explanation for the results you find. To get the full 10%, your explanation has to be thoughtful and logical.

    Gardner Home | Course Home