Critical Issues to Consider in Research Methodology

views updated

Critical Issues to Consider in Research Methodology

Errors in Research
Survey Methods
Tactics to Increase Response Rate in Surveys
Data Collection by Observation
Experimental Research
Types of Scales
Reliability and Validity of Questionnaires and Psychometric Tests
The Art of Drafting Questionnaires
Problems Unique to Interview Schedules
Sampling Methods

Errors in Research

While it is important to ensure that you do not make serious errors, there is no such thing as a perfect research or a research that is free from errors. Every piece of research, even those that are published in top journals have got some degree of error. How can this be possible? Permit me to explain. Total error can be divided into two categories: random sampling error and systematic error. Random sampling error is the error caused by chance variation. Imagine for instance that you have a sack and in that sack there are 100 balls. 50 of the balls are white and the rest are black. You pick out 10 balls from that sack. In an ideal situation, you would expect to retrieve 5 black balls and 5 white balls. However, isn't it possible that you end up picking all 10 balls that are black or all 10 balls that are white? This is what is meant by chance variation or random sampling error. This error can be minimised by using other sampling methods such as stratified random sampling error (which will be explained later in this chapter).

The next category of error is systematic error and this can be divided into two categories: administrative error and respondent error. Administrative error is where the error is caused by some improper administration or execution of the research task such as data-processing error, sample selection error, interviewer error and interviewer cheating. Data-processing error can occur where, for example, due to some computer error, the number 4 appears whenever you strike the number 2 key. Note that in such an instance the error is indeed systematic. This error should be avoided at all costs.

Sample selection error can occur, for example, when you select your sample from a telephone book. Sample selection error occurs because unlisted numbers are not included. Interviewer error can occur when the interviewer is not able to write fast enough to record answers verbatim. Interviewer cheating is when the interviewer fills in the questionnaires themselves. In view of this, I would strongly urge PhD students to collect the data themselves and not to contract out this task. This is because the lack of honesty, dependability, diligence and accuracy of any one of your research assistants can seriously jeopardise the quality of your data and hence the quality of the results that you get. This can even result in you failing to get the significant correlations that you have predicted. Hence this would seriously undermine your chances of successfully completing your PhD. All the errors mentioned in this paragraph should be avoided at all costs. I repeat, that is why I always advocate that PhD students should collect the data themselves and not subcontract the task! This can ensure the highest quality of data. You can definitely increase your chances of establishing the predicted relationships between variables. Furthermore, collecting the data yourself gives you invaluable experience as a researcher.

I have stated earlier that systematic error is made up of administrative error and respondent error. We have already talked about administrative error. Now I shall describe to you respondent error. Respondent error can be categorised as non-response error and response bias. Non-reponse error occurs in the following instance. You send out 1,000 questionnaires but you receive only 200. You conduct analysis of the 200 questionnaires that you have received and you found that overall, the respondents are highly dissatisfied with the product. You also found that there were a few highly satisfied respondents in the sample, but most of them are highly dissatisfied. However, isn't it possible that the other 800 people found the product to be fairly satisfactory, and because of that, they did not bother to respond? Therefore, if you got all 1,000 respondents, the results would be entirely different, i.e. the respondents would be neutral or even slightly satisfied. You see, this is natural human behaviour—to respond only if they felt strongly about the product—either favourably or unfavourably. People whose opinions or attitudes are in the middle tend not to respond. Also, isn't it more likely for a person to say something when they hate it rather than when they like it? There is not much you can do about this other than perhaps trying to increase your response rate. How can you do this? This is explained later in this chapter.

Response bias (either deliberate or unconscious) can be divided into the following categories: acquiescence bias, extremity bias, central tendency bias, interviewer bias, auspices bias and social desirability bias. Acquiescence bias is the tendency for respondents to agree with a statement. Consider the statement ‘I find it easy to get up in the morning’ or ‘I find it difficult to get up in the morning.’ These are opposing statements—the first is framed in the positive whereas the latter is framed in the negative. There is a tendency of respondents to agree with statements, either positive or negative. So the way in which questions are phrased becomes critical. The best way to reduce acquiescence bias is to have approximately the same number of questions phrased in the positive and in the negative. Extremity bias is the tendency of respondents to respond to the extremes. Phrasing all questions either in the positive or in the negative tends to increase this. Central tendency bias is the tendency for respondents to give marks at the middle point of the scale. This may occur where the respondents are not really interested in filling up your questionnaires and want to complete them with the minimum amount of thought, effort and time. I have rarely come across this although some of my friends have. If after you have conducted your pilot study you found that most respondents ticked the middle point of the scale, then you will have to use even-number scales (such as 1, 2, 3, 4) rather than odd-number scales (1, 2, 3, 4, 5). Respondents are then forced to make a stand and will not be allowed to sit on the fence. Interviewer bias occurs when the presence of the interviewer (predominantly in face-to-face interviews) influences the answers of the respondents. Consider this situation. You are a married man with kids and along comes this beautiful young lady interviewer who asks you, ‘What is your income?’, ‘What car do you drive?’, ‘What is your marital status?’ If your real income is RM 4,000, you drive a small car and you are married, wouldn't you be tempted to say, ‘I earn RM 10,000’, show her your BMW car key ring and say that you are single? Of course not! Not you! But everyone else would!

Auspices bias is the tendency to indicate your response because of the organisation conducting the study. This is somewhat related to the next bias which is social desirability bias . Consider the first instance when you are asked this question by a PhD student conducting a study: ‘Do you make an effort to send your waste paper to a recycling centre?’. Suppose you don't. You may be honest with the PhD student and admit it. But even then, some of us are shy to admit it because of social desirability bias. In other words, social desirability bias is the bias caused by the respondent's desire, either consciously or unconsciously, to appear in a favourable social position. Now, what if the person conducting the study is from Greenpeace or some other conservationist group? The tendency to say ‘yes' to that statement although the truthful answer is ‘no’, is even stronger. This is the auspices bias—a bias in the response of the subject as a result of being influenced by the organisation conducting the study. You are able to reduce all the biases mentioned in this paragraph by being aware of them in the first place and then consciously trying to avoid them. The saying, ‘Forewarned is forearmed’ is applicable here.

Survey Methods

There are different ways of communicating with respondents in a survey. They are door-to-door, mall intercept, telephone and mail surveys. Comparisons of these different methods are listed in the table below.

**Comparison of the different methods of survey**
Criteria	Door-to-door personal interview office or home)	Mall intercept interview	Telephone surveys	Mail surveys
Speed of response during data collection	Moderate to fast	Fast	Very fast	Slow. Researcher has little control over return of questionnaires
Cost	Highest	Moderate to high	Lower but more than mail	Lowest
Geographical coverage	Limited	Confined to urban areas	Wide	Very wide
Respondent co-operation	Moderate in office, low in homes	Moderate to low as shoppers are busy	Moderate. Better if calls are kept short	Low. Even lower for poorly designed questionnaires

**Comparison of the different methods of survey**
Criteria	Door-to-door personal interview office or home)	Mall intercept interview	Telephone surveys	Mail surveys
Questionnaire length	Longer than telephone	Moderate to short	Short	Longest especially where incentives are provided
Versatility of questioning	Versatile because can clarify and modify	Versatile because can clarify and modify	Moderate	Low as highly standardised format needed
Item non-response	Low	Low	Low to medium	High
Degree of interviewer influence of answer	Highest	High	Moderate	None
Anonymity of respondent	Lowest	Low	Moderate	Highest
Ease of call-back or follow-up	Easy to trace but they may not cooperate a second time	Impossible if no address is obtained	Easy but they may not cooperate a second time	Easy if you can identify the respondent from the questionnaire received
Possibility of respondent misunderstanding	Low	Low	Low to average	Highest as no interviewer is present for clarification
Supervision of interviewers	Difficult to moderate	Moderate if they are in one shopping mall	High where telephonists are located in one room	Not applicable
Special features	Visual materials may be shown or demonstrated. Taste test possible	Visual materials may be shown or demonstrated. Taste test possible	Only audio materials can be played	Pictures and samples may be enclosed

Tactics to Increase Response Rate in Surveys

Among the most effective tactics I would recommend that you use in order to increase response rates in mail surveys are:

give advance notification
write a ‘sales oriented’ cover letter
stimulate respondents' interest with interesting questions
have a questionnaire with a good layout that is easy to read
enclose a stamped return envelope
conduct follow-up enquiries on those who have yet to return the questionnaires
use the name of the organisation conducting the study.

If all else fails, then money is a good motivator. However, researchers have a very limited budget. One PhD student in the Manchester Business School used a lottery ticket to induce participants to respond. The lottery ticket costs merely one British pound—but the prize for the winner is a minimum of one million pounds! However, the use of a four-digit lottery ticket in Malaysia may not have the same motivating effect. Many Muslims would object to that on religious grounds. It may also be not as motivating for non Muslims because the prize money would be far less than the British lottery.

As an alternative to mail questionnaires, the drop-off method can be used. This is where the researcher makes an appointment to see the respondent. During the meeting, the researcher explains the questionnaire briefly and then leaves it with the respondent to complete at his or her leisure. There is usually a deadline set upon which the interviewer will return to collect the completed questionnaire. I found this method to have a much better response rate compared with the ordinary mail method. However, I found that for my PhD, the best method to use is a variation of the drop-off method. I would first of all explain the questionnaire to a group of respondents and then, rather than leaving, I would wait for them fill in the questionnaires in my presence. In other words, the questionnaires were self-administered in my presence. The advantage of this method is that I did not allow them to ‘conveniently forget’ to complete my questionnaire. Secondly, I was available should any respondent have any questions. Thirdly, the respondents do not have to worry about returning or posting the questionnaire. Fourthly, I can check the completed questionnaire for any questions that were inadvertently unanswered.

With the Internet, it is now possible to send questionnaires by email instead of traditional mail. The pros and cons of email are somewhat similar to traditional mail but with some notable exceptions. In the case of email,

gthe cost is free (if using university or office facilities)
the speed of distribution and return is faster
there is an increased likelihood of response as the respondent does not need to place the questionnaire in a letterbox
multimedia messages in the form of videos, audio and 3-D pictures can be sent to respondents
keying in of responses by respondents in electronic form can reduce the need for subsequent data entry by researchers
the researcher can easily trace those who have not responded to the questionnaire.

There is no such thing as the best method for all situations. If there were one, then everyone would be using that method. Cost seems to be the main factor in deciding which method to use. However, this does not mean that all PhD students opt for mail surveys since they have extremely tight budgets. In fact, I know several students who have used the personal interview method, which is the most expensive of all. Of course, they conduct the interviews themselves to save costs. This method is preferred because researchers can be relatively confident that the respondents fully understand the questions, and that the answers have been accurately recorded. Also, students would use this method when they have relatively few respondents to interview, i.e. less than 100. Having less than 100 respondents in a PhD research is acceptable where the units of analyses or respondents are companies rather than individuals. However, where larger numbers of respondents distributed over a wide geographic area need to be accessed, then mail or email survey is preferred.

A good question to ask at this juncture is ‘Can you use more than one method in the same survey?’ I would say ‘yes'. For example, there is nothing wrong with sending questionnaires by traditional mail and email, then following up by calling the respondents. This was exactly what was done by my classmate at the Manchester Business School. She initially sent all her questionnaires by mail. When some did not reply, she followed up with a telephone call. The respondents said that they misplaced the questionnaires and offered to give their responses by telephone. She then proceeded to collect the data by telephone interview. This method was accepted by the examiners as valid and she managed to pass her PhD viva. However, be prepared to be questioned by the examiners whenever you use different methods of data collection in a single study. I feel that the examiners' intention of asking this is merely to find out whether you know the implications of using different methods, as well as the pros and cons of each, rather than an outright objection.

Data Collection by Observation

What can be observed?

Physical actions, for example shoppers' movements in the store, teachers' movements in the classroom, etc.
Verbal behaviour, i.e. statements made by customers who wait in line.
Expressive behaviour, i.e. facial expressions, tone of voice, or other forms of body language of customers or employees.
Spatial relations and locations, i.e. how close people stand facing each other when they talk.
Temporal patterns, i.e. how long customers in a restaurant wait for their order to be served.
Physical objects, i.e. items on supermarket shelves.
Verbal and pictorial records, i.e. bar codes on product packages.
Physical trace evidence, i.e. finding out what a household consumes by looking into its garbage.

Categories of observation

Observation can be:

done by humans or mechanical counters, and is
visible or hidden.

Prior consent should be obtained where observation is hidden, for example, where the cameras cannot be seen by the respondents. Ethics demands that people should be informed in advance that their movements and actions in the store will be recorded. This is usually done by way of a notice at the entrance of the store. If they continue to walk into the store, they are deemed to consent to being videotaped. Customers normally forget about the notice and the cameras soon after they enter the store. They go about their activities as they normally would do and their actions and behaviour can be regarded as fairly normal.

Some benefits of observation of human behaviour include the following:

Communication with respondents is not necessary.
There is no need to rely on the respondents' memory.
Data is collected quickly and without distortions due to self-report.
Environmental conditions may be recorded.
Data may be combined with the survey to provide supplementary evidence.

Consider this situation. You might want to study the eating habits in a survey of habits of weight-watchers. More precisely, you may wish to find out how often they raid the fridge. If you asked them how many times they take food from the fridge in a day, they may not remember. Worse still, some may remember but lie to you (i.e. exhibiting social desirability bias). The camera however, never lies. Researchers, with the consent of house owners of course, can install cameras in the living rooms, dining rooms, and kitchens (installing cameras in toilets and bedrooms are usually not done). Thus the movement and actions of all members of each household can be recorded accurately. However, this kind of research is more commonly done by professional market research firms rather than by PhD students.

There are many obvious limitations of collecting data by observations of human behaviour:

Cognitive behaviour cannot be observed. You cannot tell what a person is thinking about just by looking at his or her face.
Interpretation of data may be a problem and observer bias is possible. This problem is often aggravated in cross-cultural studies.
Not all activity can be recorded. As previously mentioned, the bedroom and toilets are ‘no-go’ areas.
Only short periods can be observed as most people would not want to have cameras installed in their houses for more than a few weeks. In the case of a store, only movements within the store can be recorded.
There is a possible invasion of privacy.

There are some terms commonly used in observation which are useful to know. They are:

Scientifically contrived observation , i.e. the creation of an artificial environment to test a hypothesis.
Response latency , i.e. recording the decision time necessary to make a choice between two alternatives. The shorter the time, the greater the preference between the two alternatives.
Content analysis , i.e. obtaining data by observing and analysing the content of advertisements, letters, articles etc. to see the content of the message itself and/or the frequency of its occurrence.

Some electrical/mechanical devices used in PhD research in the area of psychology include:

Eye-tracking monitors which record eye movements, thus providing information on how the subject actually reads or views an advertisement.
Pupilometer, a device which observes and records changes in the diameter of the subjects' pupils—the more excited the subject, the more dilated the pupils.
Psychogalvanometer, a device which measures the galvanic skin response, i.e. the electrical resistance of the skin or physiological changes which usually accompany the varying emotional states of the subject.
Voice pitch analysis which measures the emotional reactions through the physiological changes in a person's voice.

However, these equipment are expensive. I have not come across any of them in any of the local universities.

Experimental Research

In experimental research, conditions are controlled. One independent variable is manipulated at a time and its effect on the dependent variable is measured. All other variables are controlled i.e. kept constant. Experiments are often used to test hypotheses.

The easiest way to understand the mechanics of an experiment is to use an example. Suppose you wish to test the effect of light levels on workers' productivity. You formed the following hypothesis:

H₁: Increased light levels will result in increased productivity.

The way to test this is by experiment. You set up two rooms—room A (which we will call the experimental group) and room B (the control group). Both rooms have five workers each. All the workers have been randomly selected. Further checks are made to ensure that the workers in room A are somewhat similar in terms of skills and experience to those in room B. At the initial stage of the experiment, the light levels in both rooms are the same. Measurements of workers' productivity, known as O₁, are taken from both rooms A and B. The productivity levels of both groups should be roughly equal. After the measurements are taken, the light level in room A is increased, whereas it remains unchanged in room B. Then, a second measurement of productivity known as O₂ is taken from both rooms A and B.

In order to prove their hypothesis, the researchers would need to find the following: at O₁ the productivity levels of workers in both rooms should be roughly equal. At O₂ however, the productivity levels of workers in room A should increase whereas those in room B should be the same as O₁. All other conditions should remain constant. If that happens, the researchers can be confident that the increase in productivity was caused by the increase in the light levels. Note that in this experiment, the light level is the only independent variable that has changed. All other variables are kept constant. Also, the increase in productivity of workers in room A occurred immediately after the increase in light levels. Thus one can safely conclude that the increase in light levels improves productivity.

The above example is a description of what researchers in the Hawthorne experiments hoped to find. However, what happened in that experiment was that the productivity of workers in both rooms A and B improved. The researchers correctly concluded that productivity was not increased as a result of the increased light levels (as the productivity levels continued to increase even though the light levels were lowered). However, they wrongly concluded from the experiment that it must have been the attention paid to the workers that improved their productivity.

Hawthorne's findings have been severely criticised. This was because if attention was indeed the variable influencing work performance, where was the control group? Attention was received by workers in both groups A and B and therefore there was no real control group as far as the variable of ‘attention’ was concerned. This is the major experimental flaw of the Hawthorne studies and until today, the research is heavily criticised by research purists, although managers are quite content to accept the findings and act upon them.

Types of Scales

Nominal scale

A nominal scale is a scale in which the numbers or letters assigned to objects serve as labels for identification or classification. For example, the number 2 does not indicate more of a characteristic than the number 1. Suppose you assign 1 for Malay, 2 for Chinese and 3 for Indians. This does not mean that there are more Chinese than the Malays or that there are more Indians than the Chinese. Nominal scales are often used to identify race, gender and location.

Ordinal scale

An ordinal scale is a scale that arranges objects or alternatives according to their magnitudes or rank. Consider for example your favourite car. In order of ranking, 1 being highest, it could be 1 for Proton, 2 for Honda and 3 for Peugeot. This means that of all the brands of cars, you prefer a Proton. However, it does not mean that the magnitude of your preference for Proton over Honda is equal to your preference for Honda over Peugeot.

Interval scale

An interval scale is a scale that not only arranges objects according to their magnitudes but also distinguishes this ordered arrangement in units of equal intervals. The best example is the temperature scale; the thermometer has markings of equal intervals. However, an important distinction between interval and ratio scales (see below) is that there is no absolute zero point in interval scales. If the temperature is 100¿ Celsius, it cannot be said that it is twice as hot as 50¿ Celsius. The zero point is arbitrary and does not represent the total absence of heat. As we all know, temperatures can drop below zero. Other examples of indicators that use the interval scale are the consumer price index (CPI) and intelligence quotient (IQ).

Ratio scales

Ratio scales have an absolute rather than relative quantities. It has the same characteristics as the interval scale but with the added benefit of having an absolute zero point. For example, money and weight can be considered as ratio scales. If you have no money in the bank, then your account balance is zero. No doubt if you are a bankrupt and owing money, your bank balance will be negative. But the point is, the absolute zero point in your bank balance account does indicate a total absence of money. That is why it is a ratio scale.

Reliability and Validity of Questionnaires and Psychometric Tests

Two concepts which are absolutely critical to know and understand are reliability and validity.

Reliability

Reliability , in the context of tests, has two distinct meanings. One refers to the stability of scores from a test over time, the second to internal consistency of the items that purport to measure the construct in question. The reliability of scores of a test over time is known as test-retest reliability (r_tt).

Test-retest reliability

If a person does the same test twice, the scores in the first test should be identical to the second if the test is totally reliable. Consider this example. I take a measuring tape and use it to measure the length of a table. I do so by placing the correct end of the tape to one edge of the table and place it taut along the table to the other end. I observe that the reading is 6 feet. I then remove the tape and take another measurement a minute later. The reading is now 5 feet 8 inches instead of 6 feet. Somebody must have chopped off the table, you say! Suppose that has not happened and there were no changes in the table length. Then, it must have been the tape, you say.

What if I reveal to you that the measuring tape I used was made of stretchable rubber, and that in the second measurement, I had stretched it more. Since the tape is elastic, it is not a reliable measuring instrument for each time it is used, it will give a different measurement depending on how much I stretch it. In a questionnaire, test-retest reliability is measured by correlating the scores from a same set of subjects who take the same test on two occasions. The correlation coefficient measures the degree to which the two sets of scores agree with each other. The more they are in agreement, the higher the correlation coefficient, thus the higher the test-retest reliability. Correlation coefficients range from +1 (complete agreement), to −1 (complete disagreement). It is extremely rare for the correlation coefficient in such cases to be −1. That would mean that the person would score at one extreme (high or low) when doing the test for the first time and scoring at the other extreme when doing the same test for the second time. For example, in an intelligence test of low test-retest reliability (−1), a person, after the first test may have achieved scores indicating that he is a genius and after doing the same test for a second time, achieved scores suggesting that he is mentally retarded. In the case of a personality test of low test-retest reliability, the person, after the first test may have achieved scores indicating that he is an extrovert and after doing the same test for a second time, say a week later, achieved scores suggesting that he is an introvert. A correlation of zero indicates that there is absolutely no relationship between the two sets of scores. Most correlations tend to fall between 0 and +1. Squaring the correlation coefficient indicates the extent of agreement between the two sets of scores. For example a correlation of 0.8 shows an agreement between the sets of scores of 64 per cent. Obviously, the closer to +1 the test-retest reliability of a test is, the better the test is. If scores are vastly different on two occasions, there is no reason to trust either of them. A correlation of 0.7 is often regarded a minimum figure. Also, the sample should have at least 100 subjects who are representative of the population for whom the test is intended. Of course, no real change in the characteristics of the respondents being measured should occur between the two periods.

Factors influencing the measurement of test-retest reliability

Changes in subjects

The time period between the first test and the second should be a reasonable period, i.e. not too long or too short. There is no fixed rule as to the ideal time frame as this would depend on the nature and complexity of the test involved. A three-month gap can be considered as reasonable. However, if the test involves measuring personalities of children, it is quite possible that they may have changed during the period between the first and the second tests. For instance, during the first test, the child may have been an introvert but by the time the second test was done five years later, the child may have turned into an extrovert. Similarly, during the first test, the child may have obtained low scores on an ability test, but subsequently, the child's mental ability has improved such that he or she obtained much higher scores doing the same test a second time. In such cases, the correlation coefficients would suggest that the test-retest reliability is lower than it actually is. In other words, the changes in scores that contributed to lowered reliability are assumed to be errors of measurement, not real changes in variables.

In connection with the above, I myself discovered how the personality of adult subjects can change during the period between the two tests. During the administration of a personality test, the subject, who was completing the test for the second time a mere six months later, remarked, ‘It's funny how my perspective has changed since I became pregnant!’

Measurement error

These can be divided into the following:

Environmental factors . Suppose that the first time around, the subject completes the test in a conference room with adequate lighting, ventilation and working space, free from outside noise and distraction (the way tests should be administered). The second time the subject completes the test in an environment totally opposite to the earlier one, i.e in dark, stuffy, cramped and noisy conditions. The tests of the two scores may vary. This is especially so where intelligence tests are concerned.

Fatigue of subjects . Subjects will not feel at their best when they are suffering from physical ailments such as colds, flu or headaches or when they had a late night out and are suffering from hangovers. If the subjects are sick when doing the test on one occasion but are healthy when doing the test during the second occasion, then scores may vary as a direct result of these differences in health states.

Poor test instructions . If the test instructions are ambiguous, the subjects may not grasp what they are supposed to do when completing the test for the first time. They may only figured it out when doing the test the second time. Their scores may have been better in the second test (in the case of ability tests), or more accurate (in the case of personality tests).

Subjective scoring. It is important for the scoring of personality and intelligence tests to be objective, i.e. the test scores should be the same regardless of who happens to be the examiner. Subjective scoring can result in differences arising between two sets of scores, either in a case involving different scorers or when the same scorer is used on different occasions.

Guessing . This mainly affects ability tests. Unlike environmental factors, poor test instructions and subjective scoring, this source of unreliability is often beyond the control of the test administrator. Subjects will still guess despite being expressly told not to do so, as I have found out in numerous surveys. In one test, The Ingleton Word Recognition Test (a perceptual ability test), respondents were required to identify sixty-four partially printed words. Prior to doing the test, they were expressly told not to guess if they were not sure what the words were. Despite the explicit, unambiguous instructions, subjects went ahead and guessed, getting some of the answers wrong.

Factors that can artificially boost test-retest reliability

Two factors that can artificially increase the reliability score and does not represent true test-retest reliability in the sense of stable error-free measurement.

Too short a time gap . If we were to administer a test to a group of subjects and immediately after they have finished, we were to ask them to do it again, we would obtain a high test-retest reliability (provided that the subjects are not bored or fatigued). This would be simply because the subjects would remember their answers, so the correlation coefficients would be especially high.

Too simple a test . If items are too easy for the subjects, then subjects will always get them right. Test-retest reliability will be unusually high. This, of course, applies to ability tests, as there are no right or wrong answers for personality tests.

Internal consistency reliability

This is the second meaning of reliability. Most psychometric test constructors aim to make their psychological tests as internally consistent as possible. This means that the different items (questions) that purport to measure the same construct should be highly correlated with each other if the test is to be internally consistent. If they are not highly correlated, then some of the items may be measuring one construct and other items may be measuring other constructs. Consider this example: Question 1 measuring stress states: ‘I find it difficult to not worry about my work.’ Question 2 measuring stress states: ‘I frequently have so much fear that it affects my health.’ Scores on question 1 should be highly (but not perfectly) correlated with those on question 2. Consequently, high consistency or reliability is a prerequisite for high validity. This is the more popular view held by psychometric theorists. Internal consistency is often used as a basis for retaining items in a test that correlate highly with each other and deleting items that do not correlate highly.

There appears to be only one significant dissenting voice against the argument that internal consistency must be high. Some have argued that an ideal questionnaire should have individual items correlating highly with the criterion but not with each other. This is an important feature of the 16 Personality Factors test (16PF). There are many other tests whose items were virtually paraphrases of each other and it comes as no surprise that these items are highly correlated with each other. These can be referred to as bloated specifics and as having low validity (or more precisely low incremental validity). However, this is not to say that high reliability always precludes validity. What you should be concerned with is that questionnaires that attempt to measure multi-dimensional constructs that have considerable breadth and complexity such as extraversion, should have items that attempt to measure the different aspects of that construct. Hence, lower internal consistency should actually be preferred in such circumstances, contrary to the popular belief that test constructors should always strive for high internal consistency.

From the arguments above, we can safely conclude that internal consistency should be high, but not too high, especially where the construct has considerable breadth, such as certain personality traits.

Statistical methods of measuring internal consistency reliability

Split-half reliability . This technique is accomplished by splitting a multi-item scale in half and then correlating the summarised results of the scores in the first half with those in the second half. It can be done in a variety of ways. One method is by using the first half of the test and the second half of the test. It can also be used by using the scores on the even and odd items. Sometimes, it is done by randomly assigning questions to one half or the other. However, the major problem with the split half technique is that the estimate of the coefficient of reliability obtained is wholly dependent upon the manner in which the items were split, i.e. all these splits give a different estimate of reliability. Consequently, this method of estimation is no longer popular, in the age of computerisation. The next method is more popular.

Cronbach Coefficient Alpha measure of internal consistency . This overcomes the weakness of the split-half technique. The Cronbach Alpha technique computes the mean reliability coefficient estimates for all possible ways of splitting a set of items in half. With the aid of computers and software packages such as the SPSS (statistical package for social sciences), the Cronbach Coefficient Alpha can be obtained in seconds. There are even options within the SPSS to obtain the reliability coefficient if each item of the scale were deleted. This is useful if deleting one of the items from the scale would make the scale much more reliable. However, this rests on the assumption that higher reliability should always be achieved (see earlier argument against this view). In practice, I have found that only negligible improvements can be obtained, where the scales involved have already been well established. However, this may not be the case where a new questionnaire is being developed.

Bearing in mind that internal consistency reliabilities ought to be high, but not too high (see arguments earlier), Cronbach Coefficient Alpha should ideally be high, around .9, especially for ability tests, and should not drop below .7.

Validity

A measurement scale is valid if it measures what it is supposed to measure. Let me ask you a question. What would you use to measure your height? A ruler or a measuring tape are valid instruments to measure height. Would you use a thermometer to measure your height? Of course not. Thermometers are valid instruments to measure temperature but not height. Hence if you purport to measure your height using a thermometer, it would be an invalid measure.

The following are examples of validity of questionnaires:

Face (content) validity

It is important to point out that face validity is not a requirement of true validity. A test is said to be face valid if on the face of it, it appears to measure what it claims to measure. From the point of view of selection, the question that arises is whether tests used in selection should always be face valid. It has been suggested that tests used in personnel selection should be face valid since it can increase the motivation of the subjects to expend their best efforts when completing the tests—an essential element if testing is to be valid. For example, when trying to select pilots, face-valid tests of tracking ability or rapid reaction time will motivate full cooperation because subjects believe them to be valid indicators of flying skill. On the other hand, if these subjects were subjected to tests that required them to make animal noises, or add up numbers while distracted by jokes (genuine objective tests in the compendium of tests by a distinguished psychologist), many would refuse, thinking them to be absurd, even if they were valid. The subjects would object to having to spend valuable time and energy completing the tests, the nature of which in their view, is totally irrelevant to the job that they have applied for. If forced to do so, they will spend little thought when completing the questionnaires, thus affecting the test scores. This problem becomes even more serious in third world countries where the completion of questionnaires is relatively rare, compared to the US.

However, as with the issue of reliability mentioned earlier, a minority are of the dissenting view that tests should not always be face valid. This is because in a face-valid test, subjects can guess what the questions are really trying to measure. Hence it is likely to induce fake or deliberate distortion, especially during personnel selection. A person applying for the job of a salesperson will not admit that he or she prefers quiet interests such as reading books as opposed to meeting people, even if it were true. Job applicants will have a picture in their minds of the ideal person that the company seeks. The applicant would want the company to think that he or she is that person. Consequently, applicants' answers will try to reflect the desired qualities.

In conclusion, personality questionnaires should not be too face valid. Some of the items in the 16PF were deliberately worded in such a way that they were not face valid. For instance, it is extremely unlikely for respondents to be able to deduce that preferring chess to bowling or preferring violin solos to military band marches means that one is more open to change. Respondents are equally unlikely to deduce that preferring Shakespeare to Columbus means that one is more sensitive or that having fantastic and ridiculous dreams means that one is not a perfectionist. Also, questions on whether one should take a gamble or play it safe would have been thought to relate more to risk tendency than to dominance. The rationale for making questions less face valid is to prevent subjects from determining exactly what the tests are measuring and hence make it more difficult for subjects to give fake answers during selection. The greater complexity and length of the test makes it less ‘transparent’ to the applicant and so less susceptible to faking.

Concurrent validity

A test is said to possess concurrent validity if it can be shown to correlate highly with another test of the same variable which was administered at the same time. Correlations above .75 would be regarded as good support for concurrent validity of a test where there are benchmark criterion tests. It would be extremely difficult to obtain correlations above .9 as this would indicate that the tests are virtually identical in their ability to measure the construct in question.

There are some problems regarding benchmark tests. First of all there are very few psychometric tests that can be properly regarded as benchmark tests. The next logical argument that arises is, if there is a test so good that it can be taken as standard, what is the point of a new test? There are several reasons why new tests are sometimes still constructed. First of all, a new test may have been constructed for purely economic (and self-serving) interests. Researchers may not wish to continue paying royalties for the use of the questionnaires. Researchers thus develop their own questionnaires so that they do not have to pay royalties. They may even charge others for using their questionnaires. Secondly, the new questionnaire may have been developed as a shorter alternative to the original questionnaire. Sometimes the short version will only contain some of the items taken from the original questionnaire. For example, there is a 20-item short version of the original 100-item Minnesota Satisfaction Questionnaire. In other instances, the short version may be totally different from the original. For instance, Ahmad (2001) compared two scales that measured job satisfaction—Job Descriptive Index (JDI) (Smith, Kendall & Hulin, 1969), a well-established job satisfaction questionnaire that contained 72 items, and Numerical Facet Satisfaction Scales (NFSS) that has five items. The JDI consists of five separate sections. Each section measures one separate facet, i.e. the job itself, pay, promotion, supervision and coworkers. The NFSS has five scales, one for each facet. These facets are identical to those in the JDI. Each NFSS scale ranges from 1 to 20 (1 being lowest and 20 being the highest level of satisfaction). It was found that the correlation between the two scale were as follows:

Item	Correlation coefficient
significant at the 0.01 level
JDI work facet satisfaction—NFSS work facet satisfaction	.626**
JDI pay facet satisfaction—NFSS pay facet satisfaction	.633**
JDI promotion facet satisfaction—NFSS promotion facet satisfaction	.408**
JDI supervision facet satisfaction—NFSS supervision facet satisfaction	.757**
JDI co-worker facet satisfaction—NFSS co-worker facet satisfaction	.675****

The presence of a highly significant correlation between the facet satisfaction as measured by the JDI and that by the NFSS was expected. In fact it would be very surprising if the correlation were not significant. However, one would expect such correlation to be stronger since the two instruments were supposed to measure exactly the same thing. Furthermore, the respondents completed both instruments simultaneously. This raises questions as to why the correlation coefficients were not as high as expected. One possible cause is that the JDI uses items that requires respondents to indicate either ‘yes', ‘no’ or ‘undecided’ against a particular description of the work environment. The JDI assumes that a given description of the work situation would indicate either satisfaction or dissatisfaction regardless of who the worker happens to be. However, there are a few (only a few) descriptions of the work situation in the JDI (e.g. ‘routine’, ‘simple’) that if true, would cause satisfaction in some people and dissatisfaction in others depending on their personality. This is one weakness of the JDI. This is not to say that the NFSS has no limitations either. Human feelings cannot be precisely quantified in integers on a 20-point scale. After taking into account the weakness of both instruments, one would not expect a correlation between the two instruments to be in excess of the .80 mark and that the average correlation was only slightly below the aforesaid .75 threshold mentioned earlier. In fact, a correlation above .90 would theoretically not be possible. In conclusion, it can be said that the aforementioned correlations found were reasonably satisfactory but could be better.

Predictive validity

A test can be regarded as having predictive validity if it is able to predict some criterion or other. This is extremely important in personnel selection. For instance predictive validity of intelligence tests might be demonstrated by correlating intelligence with success at the job. However, the problem with this is that it assumes that success at work depends, in part at least, on intelligence. Measurement of success at work could be in terms of number and amount of salary increments. Since intelligence is only one factor in success at work (other factors may be the number of contacts the person has, experience, technical knowledge, etc), a moderate (but significant) correlation of .3 or .4 can properly be regarded as evidence of a predictive validity of a test.

The Art of Drafting Questionnaires

The problem of unstructured, open-ended verbal interviews can be explained by the following example: if forty different interviewers conducted interviews in different parts of the country, the data they collect will not be comparable unless they follow specific guidelines and ask questions and record answers in a standard form. Thus there is a pressing need for the proper development of a questionnaire.

Objectives of a questionnaire

First, it must translate the information needed into a set of specific questions that the respondents can and will answer. This is not as easy as it sounds. Two apparently similar ways of presenting a question may yield different information.

Second, a questionnaire must motivate and encourage the respondent to become involved in the questionnaire answering process. Incompletely answered questionnaires will result in missing values. In designing a questionnaire, the researchers should strive to minimise respondents' fatigue, boredom and effort in order to reduce incomplete answers and non-response.

Third, a questionnaire should minimise response errors. Response error can be defined as the error that arises when respondents give inaccurate answers or when their answers are mis-recorded or misanalysed. A questionnaire can be a major source of response error. Minimising this error is an important objective of questionnaire design.

How to draft questions

Zikmund¹ (2000, pp. 310–326) lists some of the following points to be borne in mind when drafting questionnaires:

Questions have to be relevant and accurate.
Questions can be phrased as open-ended or closed-ended questions.
Phrasing questions can be different for mail, telephone and personal interview surveys.
Avoid leading questions.
Avoid loaded questions.
Avoid ambiguity: be as specific as possible.
Avoid double barrelled items.
Avoid making assumptions.
Avoid burdensome questions that may tax the respondent's memory.
Be aware of order bias.
Pay attention to the layout of the questionnaire.

Questions have to be relevant and accurate

A question should only be asked if relevant. Respondents may be unwilling to divulge information that they do not see as serving a legitimate purpose. Why should a firm marketing cereals want to know about their age, income and occupation? Explaining why the data are needed can make the request for information more legitimate and may increase the respondents' willingness to answer. A statement such as: ‘To determine how the consumption of cereal and preferences for cereal brands vary among people of different ages, incomes and occupations, we need information on …’ can make the request more legitimate.

Questions can be phrased as open-ended or closed-ended questions

Open-ended questions allow respondents to answer them in any way that they choose. A closed-ended question, on the other hand, would require the respondent to make choices from a set of alternatives given by the question designer. Closed-ended questions help respondents to complete the questionnaire faster by merely ticking the appropriate box. They also help the researcher to code the answers more easily as numbers have already been assigned to the answers. However, the disadvantage of closed-ended questions is that it is not possible for the researcher to list out all the possible answers to the question. In other words, the answer that the respondent wishes to give may not be in the list. This may happen in exploratory studies where the researcher has not had a full grasp of the phenomena of interest. Therefore, to what extent a questionnaire should contain open-ended responses is often a question of judgement and common sense.

Phrasing questions can be different for mail, telephone and personal interview surveys

Generally speaking, telephone and personal interview questions should be shorter compared to mail interview questions. Questionnaires for telephone and personal interviews should be written and read in a conversational manner.

Avoid leading questions

‘What is it that makes you like your job?’ is a leading question because it assumes that the respondent likes his or her job. Even if it is an open-ended question and the respondent can answer ‘I don't’, the seed of bias has already been sown in the respondent's mind. The respondent must really hate his job to say ‘I don't’. An important point to note is that although the questions per se may not be leading, the manner in which they are arranged can make them so. For instance, if respondents are asked how often they pray followed by how religious they are, they are more likely to say that they are religious if they have already answered that they pray a lot. It is the respondents' desire for consistency that makes them answer this way. In this case, the answer given to an earlier question has affected the respondent's answer to later questions.

Avoid loaded questions

Loaded questions are sensitive questions or questions that have a social desirability bias. Respondents can sometimes answer untruthfully or refuse to answer sensitive questions. One approach may be to desensitise the issue. For example one might say, ‘As you know, many people throw litter on the streets without a second thought. Do you happen to have thrown litter in such a manner yourself?’ This attempts to describe the event in question as commonplace and nothing unusual. It might not work with more sensitive topics such as murder and theft. Another method is to ask, ‘Do you know of any people who have thrown litter on the streets?’ Pause for the reply and then ask, ‘How about yourself?’ The first part of the question totally detaches the answer from the respondent as a person. The respondent is answering on behalf of other people. While some psychologists may argue that the respondent is not actually talking about other people but actually himself or herself (some kind of projective technique), there is no way of knowing exactly when he is talking about himself and when he is talking about other people. Nevertheless, even if the first part of the question manages to detach the issue from the respondent, the following question brings it right back home to the respondent. However, this method may not work for more sensitive questions.

Another aspect of social desirability, often not mentioned in textbooks, is the reluctance of respondents to admit ignorance. Studies have shown that respondents often answer questions even when it appears that they know very little about the topic. To avoid this problem, the researcher should start with a question of whether the respondent knows the topic or not. Furthermore, the question should not be phrased or asked by the interviewer in such a way as to make it embarrassing for the respondent to admit ignorance.

Avoid ambiguity: be as specific as possible

Although the questions may appear perfectly clear to the researcher, they may not be so to the respondent. For instance, the questionnaire distributed to a respondent at his workplace may contain the question, ‘Are you happy?’ The respondent may have interpreted the question as, ‘Are you happy at your workplace?’ whereas in fact the researcher intended to enquire about the feelings of the respondent at work as well as outside it. Pretesting the questionnaire can help bring out ambiguities.

Avoid double-barrelled items

Consider the question, ‘Do you enjoy drinking coffee and tea?’ A ‘yes' answer will presumably be clear but what if the answer is ‘no’? Does this mean that the respondent enjoys:

Coffee but not tea?
Tea but not coffee?
Neither tea nor coffee?

Such a question is called a double-barrelled question because two or more questions are combined into one. Catch-all questions can even combine more than two questions in one. Such questions can be confusing to respondents and result in ambiguous responses.

Avoid making assumptions

Consider the following question:

Should the University of Antartica continue its excellent Eskimo-training programme?

This question contains the implicit assumption that people believe that the Eskimo-training programme is excellent.

Avoid burdensome questions that may tax the respondent's memory

Some questions may require respondents to recall experiences from the past that are hazy in their memory. Most people are able to recall important or unusual details in their lives but forget minor details or insignificant events. A researcher should avoid asking questions requiring recollection of insignificant events such as, ‘What did you have for lunch last week on Wednesday?’ However, questions involving one-off and significant events like, ‘Where did you get married?’ will usually illicit more accurate answers. The same event can be remembered more accurately by people most affected by it compared to people who were not so affected by it. However, even a significant event, if cognitively painful to respondents, will likely be repressed, forgotten and pushed out of their conscious minds.

Be aware of order bias

Order bias is a bias caused by the influence of earlier answers to questions in a questionnaire. As a rule of thumb general questions should precede the specific questions . This prevents specific questions from biasing the responses to general questions. Consider the following sequence of questions:

What considerations are important to you in selecting a supermarket?
In selecting a department store, how important is the location?

Note that the first question is general whereas the second is specific. Going from general to specific is called the funnel approach . This is a strategy for ordering questions in a questionnaire in which the sequence starts with the general questions, followed by progressively specific questions. The funnel approach is particularly useful when information has to be obtained about respondents' general choice behaviour and their evaluation of specific products.

The handling of filter questions is extremely important. Sometimes whole blocks of questions should be left out for some respondents because they do not apply to the respondents. Branching questions should be designed carefully. The purpose of branching questions is to guide an interviewer through a survey by directing the interviewer to different spots on the questionnaire, depending on the answers given. Failure to make this clear will result in frustration for the interviewer and the respondent, not to mention the waste of time and the appearance of unprofessionalism.

Also, placing the difficult and sensitive questions at the beginning of the questionnaire may put the respondents off from finishing or completing the remainder of the questionnaire. As a matter of practice, sensitive and difficult questions should be placed towards the end of the questionnaire after the rapport between the respondents and the interviewer (through the questionnaire) has been developed and after the respondents have been able to understand the necessity of the question to be put before them. In short, the sequence of the questionnaire should be such that the respondent is led from questions of a general nature to those that are more specific; and from questions that are relatively easy to answer to those that are progressively more difficult. This facilitates the smooth progress of the respondents through the questionnaire.

Pay attention to the layout of the questionnaire

The layout of the questionnaire must be such that it promotes fluency of questioning. The formatting, spacing and positioning of questions can have a significant effect on the results, particularly those of self-administered questionnaires. Arguably the tendency to overcrowd questions together to make the questionnaire look shorter is false economy and should be avoided. Overcrowded questionnaires with little blank space between them can increase respondent fatigue and eyestrain and consequently lead to errors in data collection. Moreover, they give the impression that the questionnaire is complex and can result in lower co-operation and therefore lower completion rates. Although shorter questionnaires are more desirable than longer ones, the reduction in size should not be obtained at the expense of crowding. It is a good practice to divide a questionnaire into several parts.

If the questionnaire is reproduced on poor quality paper or is otherwise shabby in appearance, the respondents will think that the project is unimportant and the quality of the response will be adversely affected. Therefore, the questionnaire should be reproduced on good quality paper and have a professional appearance.

Instructions for individual questions should be placed as close to the questions as possible. Instructions should be clear, unambiguous, easy to understand and in simple everyday language. It is often helpful to repeat the instructions on the top of every page in respect of sections which take up more than one page.

The order in which questions are asked should be logical. All questions relating to a particular topic should be asked before beginning a new topic. When switching topics, brief transitional phrases should be used to help respondents switch their train of thought.

A researcher should avoid splitting a question, including its response categories, into two different pages. Split questions can mislead the interviewer or the respondents into thinking that the question has ended at the end of a page and they will overlook the responses in the following page. This will result in answers based on incomplete questions or incomplete alternative answers. Vertical response columns should be used for individuals questions. It is generally easier for the interviewer and the respondents to read down the single column rather than sideways across several columns. The convention is to have questions on the left hand side and the choice of answers on the right. This assumes that most respondents are right-handed. If open-ended questions are given, sufficient space should be given to enable the interviewer or respondent to record the answers verbatim. If there is a provision for ‘others' then adequate space should be left to record what that other is. Where more than one answer/code can be ringed then the instruction should be displayed clearly beside the question.

Other issues

The language of the questionnaire

The language in the questionnaire should approximate the level of understanding of the respondents. Convoluted instructions and jargons are only comprehensible to lawyers and therefore should not be used in questionnaires intended for the layman. Some respondents who have not had higher education will not understand certain technical words or phrases. The best way to uncover such words is through pretesting. Furthermore, certain slang words common in Western culture but totally alien to Eastern culture should be avoided. For example, the word ‘mugging’ means robbery in the West but it means studying hard in Singapore. There are differences in the use of words even among Western countries. For instance, the words ‘highly strung’ (meaning tense) is commonly used in America. However, when the author used the same words in a questionnaire distributed to Welsh and English respondents, the author was asked to clarify their meaning. Even within America itself different words can mean different things. For instance, the word ‘bad’ can even mean ‘good’ among a certain group of people in America. In short, interpretation of words often depends on the respondents' background, culture and education. Another important point to note is that the researcher should be sensitive to cultural differences and the questionnaire must be adapted if it is intended to be used in different countries. A seemingly innocent question like, ‘What did you have for lunch today?’ can be extremely insulting to a Muslim during the fasting month of Ramadhan. One should undertake a preliminary pilot work in order to understand the potential respondents, their assumptions and perspectives, and what sort of questions are seen as legitimate.

Positively and negatively worded items

It is advisable to phrase questions in the negative as well as in the positive instead of phrasing all questions positively. Doing so minimises the tendency in respondents to mechanically tick points along one end of the scale. However, the disadvantage of having negatively worded items is that it is often complex and confusing when a respondent has to give a negative answer to a negative question and it would be much easier for him to give a positive answer if the question was worded positively. However, although wording all questions positively will make it easier for the respondent to answer, research has show that this leads to extremity of answers. The recommended practice is always to alternate positive with negative answers.

Operationalising the concept

Questions such as ‘How do you rate your job satisfaction with your current job?’ can prove awkward. Job satisfaction is already a technical term; what makes it even more difficult is that it employs a uni-dimensional scale to a multi-dimensional concept. Abstract concepts like job satisfaction must be broken down into a series of meaningful questions like, ‘Is your pay sufficient to meet your living expenses?’, ‘Are you able to make friends at work?’, etc. Job satisfaction can be more accurately measured by measuring its individual facets like pay, work, promotion, supervision and coworkers.

Number of points in a scale

Attitude scales can be dichotomous (yes/no) or polytomous (more than two alternatives). I feel that polytomous scales are usually better than dichotomous scales as the former is more sensitive. Odd-numbered scales ranging from 1 to 5 have the advantage of a midpoint, in contrast to even-numbered scales (ranging from 1 to 4).

Complex preference questioning

When respondents are asked to rank too many items in order of their preference, response errors will result. Paired comparisons will yield more accurate results.

Hypothetical questions

People will not always do in an actual situation what they think they will do when asked hypothetically. How many times in our lives have we said that we would do something or feel a certain way if something happened? For instance, an angry mother would say that if the prodigal son were to return, she would cast him away, only to find that she could not do so when the situation in fact occurred. Furthermore, consider a question such as, ‘How much will I have to pay you before you agree to allow me to cut off your right hand?’ It is unrealistic to think that the respondent can make an objective and rational judgement about this.

Therefore, the use of hypothetical questions should be avoided as much as possible. It would be better to ask questions, say, about intention to leave a job by asking whether the respondent has ever considered leaving in the past. In this way the question is no longer hypothetical but requires a description of actual past behaviour, i.e. whether the respondent has thought of leaving in the past.

Secondary questions

This is when a respondent becomes a proxy for another respondent. It is a mistake to assume that a husband knows subjective things like habits, tastes and preferences of his wife and vice versa. Care should be taken even when asking objective questions of fact such as the household income. You should ask the respondent himself or herself if you want to get an accurate picture.

Problems Unique to Interview Schedules

While some of the problems we have discussed apply just as much to interview schedules as they do to self-administered questionnaires, the following are some of the potential problems which are unique to interview schedules.

When the respondent, in giving an answer, actually answers two or more questions at once

The interviewer may either record all the answers at once or if he is unsure, read out the subsequent questions exactly as worded as if the respondent had not already answered them earlier.

When the respondent is unclear on a term and explanations or definitions are not provided by the interviewer

This is an inherent weakness in the interview schedule. Where possible, interviewers should be trained and briefed regarding the questionnaire before they begin collecting data. Interviewers cannot expect respondents to understand the questions if they do not understand the questions themselves.

When there are too many open-ended questions and the respondent is talking too fast while the researcher is too slow in recording

As mentioned earlier, a questionnaire should not contain too many open-ended questions but if this is truly unavoidable then the respondent should be asked to speak more slowly. The researcher must be able to record the answers speedily so as not to impede the flow of the interview. In extreme cases, a tape-recorder can be used. Taping the interview, however, will probably result in the respondents not answering certain questions truthfully, especially those that are sensitive and have social desirability bias.

When the respondent's answers are contradictory, clearly showing that the respondent has not understood the question

The answers can be read out again to the respondent to ensure that it was not the interviewer who recorded the answers wrongly. This procedure should, however, not be overdone as it disrupts the smooth flow of the interview process. The interviewer should also not give his own interpretations to the questions or suggest answers to the respondents. The researcher should probe inadequate answers non-directively.

Overuse of show cards in an interview

This can be counter-productive if the interviewer ends up fumbling, creating confusion and even leaving out some cards. A reasonable amount of show cards should be used. Show cards should assist and smooth the interview process, not disrupt it. Furthermore, untidy or difficult-to-read cards are a nuisance rather than an aid.

Non-standardised Interview

Interview schedules can be standardised or non-standardised. Standardised schedules use identically worded questions for all respondents, whereas in non-standardised interviews, the interviewer compiles a list of information required from the respondent and formulates the questions as he goes along. Non-standardised interviews are very much heavily dependent on the interviewer's skill, hence are extremely prone to errors and not recommended. It is suitable only for focus groups or in-depth questioning for exploratory studies.

In conclusion, it can be seen that some of the problems involving questionnaires and interview schedules can be solved whereas some cannot. In spite of the existence of problems which can never be solved, research involving questionnaires and interview schedules should nevertheless continue. Such research results are still regarded as valid and are a great contribution to knowledge.

However, this brings to our attention problems in questionnaires and interview schedules which can be solved but were not. A large number (too large to be ignored) of questionnaires were drafted seemingly in ignorance of the above problems and in violation of principles of proper questionnaire drafting techniques. One does begin to wonder about the validity of a vast number of studies in which researchers have not gone through painstaking efforts to ensure that their questionnaires do not suffer from the above problems. No wonder there is such a mass of research with conflicting results!

Sampling Methods

A sample is a subset of a larger population. There are pragmatic reasons why a sample should be taken rather than a complete census. PhD students have budgets and time constraints. They simply cannot afford the time and money to do a census, i.e. interview or distribute questionnaires to every single person in the population. The irony of it is that sometimes a sample can be more accurate than a census. In a census of a large population, there is a greater likelihood of errors caused by the sheer volume of work. Data may be recorded wrongly due to exhaustion. A small group of well-trained and closely supervised research assistants can do a better job than a large group of poorly trained, poorly coordinated and poorly supervised research assistants. The census may also be impossible to conduct where it involves the destruction of units. Some research projects in quality-control testing require the destruction of the items being tested, for example, tyres, electrical fuses, etc. If such a census was done there would be no product left.

Having established the usefulness of sampling, I shall go on to describe the two main categories of sampling. They are probability sampling and non-probability sampling . Probability sampling is a sampling technique in which every member of the population will have a known, non-zero probability of selection. In order for that to be possible, you need to know the exact number of units of analysis (respondents) that comprise the entire population. Take, for example, practising lawyers in Malaysia. We know the exact number because we can obtain the list (or sampling frame) from the Bar Council. Other professional associations keep records of their accountants, doctors and engineers, But what about managers? Do they need to register anywhere? The answer to that is no. We do not know the exact number of managers in Malaysia. What about the number of managers in any named company in Malaysia? Yes, the exact number can be obtained from the company itself. Hence the question that you have to ask yourself is, ‘Can I find out the exact number of elements in the population?’ If the answer to that is yes, then you can choose probability sampling methods. If the answer to that is no, then you have to opt for non-probability sampling methods.

Let us look at probability sampling methods. The first is the simple random sampling procedure. Tossing a fair dice, flipping a coin, picking balls from a bag or picking the winning raffle ticket from a large drum are typical examples of simple random sampling. Assuming that the person taking up the ticket is not looking into the drum, and the tickets are thoroughly stirred, each ticket should have an equal chance of being picked.

Since every unit has an equal chance of selection, simple random sampling appears to be a fair method of selection. However, random sampling errors can still occur in that the final sample can be unrepresentative of the population. I gave an example earlier in this chapter about picking all 10 balls that are all white from a sack that contains 50 white and 50 black balls. So there has to be a sampling method that can ensure that the sample selected are from different parts of the population. One such technique is the systematic sampling procedure in which an initial starting point is selected by a random process and then every nth number on the list is selected, for example every 100th name that appears in the telephone directory. This will ensure that the sample is relatively well spread out throughout the telephone book. There is only one problem with this procedure—when the list itself has a systematic pattern, i.e. it is not random. Suppose you decide to draw your sample from residents of a household estate. Your first random selection results in the owner of a corner lot being selected. You then proceed to choose every seventh resident along the street which coincidentally happens to be a corner-lot unit each and every time. Therefore your entire sample would consist of corner-lot residents who may be different in characteristics from the intermediate lot owners. Corner-lot owners or are usually more wealthy compared to intermediate lot owners. This problem is called periodicity and it occurs whenever the sampling frame or list has a systematic pattern.

Stratified sampling method is a probability sampling method in which sub-samples are drawn from samples within different strata. Examples of strata are rural and urban. Variations of stratified sampling are:

Proportional stratified sample in which the number of sampling units drawn from each stratum is in proportion to the relative size of that stratum.
Disproportional stratified sample where the sample size for each stratum is disproportionate to the population size. This is because the quantity selected was decided by analytical considerations.

There is a third option called the optimal allocation stratified sample which purportedly is based on both the size and variation of each stratum. Cluster sampling is another form of probability sampling. However, discussions of both these methods are beyond the scope of this book.

There are four types of non-probability sampling methods:

Convenience
Judgement or purposive
Quota
Snowball

Convenience sampling is conducted by obtaining units or people who are most conveniently available. For example, you may choose your office colleagues, classmates, neighbours, relatives and friends as your respondents. The advantage of this method is that data can be collected quickly and economically. The disadvantage is that the selection criteria is unscientific and more likely to be unrepresentative of the population. Perhaps a slightly better method is the judgement or purposive sampling in which the respondent is selected based upon the judgment of the researcher about some appropriate characteristic of the sample members. The difference between judgment and convenience sampling methods is therefore small and I have known people to claim that they have used judgement sampling just because it sounds better whereas in fact they have used convenience sampling. Now, the question that you might ask is, ‘Is convenience sampling acceptable in a PhD research?’ I would say that using convenience sampling method is not fatal and can be accepted in a PhD research. However, it would be better to use judgmental sampling because here you can demonstrate that you have made efforts to ensure that the respondents selected are in fact suitable because they have fulfilled certain criteria in your research. In addition, you can establish that your sample is not peculiar or not too unique and that they are somewhat representative of the population, be it shopfloor workers, managers or professionals.

Quota sampling is a non-probability sampling procedure that ensures that certain characteristics of a population sample are represented to the extent that the researcher desires. Quota sampling is similar to stratified sampling—the difference is that the former is a non-probability sampling method whereas the latter is a probability sampling method. Quota sampling can be described by using a very simple example. Suppose you decide that you wish to have 100 respondents comprising 60 males and 40 females. You commence by selecting first the female respondents. Once you have managed to interview 40 females, you proceed to interview only male respondents—up to 60 of them. The advantage of quota sampling is that you can ensure that the sample you have selected is somewhat representative of the entire population (at least as far as the characteristics that you have selected are concerned). Note that I said ‘somewhat’. This is because, since the exact number of elements in the population is not known, we can only approximate.

Finally there is the snowball sampling procedure. An MBA student of mine used this method in his research involving expatriates. He initially went to an area which is well known for its nightspots called Bangsar in Kuala Lumpur where expatriates loved to visit. He would buy them drinks, talk to them and distribute questionnaires. In addition, he got them to pass the questionnaires to some of their friends. Hence, in the snowball method, the initial respondents are selected by you, but the subsequent respondents are ‘referrals'. This method was very effective for his study because he was able to get access to expatriates who do not hang out in bars. These are usually married expatriates with families. If he had only interviewed expatriates in bars, the sample would have consisted mostly of single male expatriates. By having referrals, he was able to get access to married expatriates with families. Therefore his final sample was more heterogeneous and representative of the population of expatriates in the Klang Valley in Malaysia.

PhD: The Pursuit of Excellence