5. Data Processing and Analysis
After questionnaire development, pretesting the instrument and designing the sample, fieldwork – or the actual gathering of the required data – must be undertaken. However, we will not be discussing the complex and expensive tasks associated with fieldwork as part of this course.
Once the results start to come back from the field, the information needs to be prepared for input in order to be tabulated and analyzed. Before the questionnaires are given to someone for data-entry, they must be edited and coded. There should be no ambiguity as to what the respondent meant and what should be entered. This may sound simple, but what do you do in the following case:
So is it their first trip or not? And what do you instruct the data-entry person to do? In spite of clear instructions, this type of confusing response is not as rare as we might think, particularly in self-administered surveys.
If the questionnaire was not pre-coded, this will be done at the same time as the editing by the researcher. Coding involves assigning a label to each question or variable (as in "q15" or "1sttrip") and a number or value to each response category (for instance 1 for "yes" and 2 for "no"). Sometimes, people will write in a response such as "can’t remember" or "unsure", and the editor must decide on what to do. This could either be ignored or a new code and/or value could be added. All of these decisions as well as the questions and their codes are summarized in a "codebook" for future reference. Pamela Narins and J. Walter Thomson of SPSS have prepared some basic guidelines for preparing for data entry, that you should be sure to read.
Even in a structured questionnaire, you may have one or two open-ended questions, which do not lend themselves to coding. This type of question needs to be content analyzed and hopefully grouped into categories that are meaningful. At this point, they can be either tabulated manually or codes can be established for them.
Once the data has been input into the computer, usually with the assistance of a statistical package such as SPSS, it needs to be ‘cleaned’. This is the process of ensuring that the data entry was correctly executed and correcting any errors. There are a number of ways for checking for accuracy:
- Double entry: the data is entered twice and any discrepancies are verified against the original questionnaire;
- Running frequency distributions and scanning for errors in values based on the original questionnaire (if only four responses are possible, there should be no value "5", for instance); and
- Data listing refers to the printout of the values for all cases that have been entered and verifying a random sample of cases against the original questionnaires.
The objective is of course to achieve more accurate analysis through data cleaning, as explained by Pamela Narins and J. Walter Thompson of SPSS.
The data is now ready for tabulation and statistical analysis. This means that we want to do one or more of the following:
- Describe the background of the respondents, usually using their demographic information;
- Describe the responses made to each of the questions;
- Compare the behaviour of various demographic categories to one another to see if the differences are meaningful or simply due to chance;
- Determine if there is a relationship between two characteristics as described; and
- Predict whether one or more characteristic can explain the difference that occurs in another.
In order to describe the background of the respondents, we need to add up the number of responses and report them as percentages in what is called a frequency distribution (e.g. "Women accounted for 54% of visitors."). Similarly, when we describe the responses made to each of the questions; this information can be provided as a frequency, but with added information about the "typical" response or "average", which is also referred as measure of central tendency (e.g. "On average, visitors returned 13 times in the past five years".)
In order to compare the behaviour of various demographic categories to one another to see if the differences are meaningful or simply due to chance, we are really determining the statistical significance by tabulating two or more variables against each other in a cross-tabulation (e.g. "There is clear evidence of a relationship between gender and attendance at cultural venues. Attendance by women was statistically higher than men’s".).
If we wish to determine if there is a relationship between two characteristics as described; for instance the importance of predictable weather on a vacation and the ranking of destination types, then we are calculating the correlation. And finally, when trying to predict whether one or more characteristic can explain the difference that occurs in another, we might answer a question such as "Are gender, education and/or income levels linked to the number of times a person attends a cultural venue?