Additional tools

Survey basics


Factor Analysis
A more complex statistical procedure is called factor analysis. There are a number of variants of this procedure, and most statistical packages nowadays offer them. The bones of this procedure are that it is hypothesised that there are one or more invisible factors which determine the responses of users to the actual questions. Each question is said to "load" onto one or more of these invisible factors. It is up to the analyst to decide whether there is really only one invisible factor underlying the data, or whether there are two, or more. This is never an open-and-shut case which can be decided on purely statistical grounds. You have to look at the wordings of the questions as well as their loadings on the invisible factors that come up. What often happens is that there is one factor on which a lot of the questions load quite heavily, and then a slew of factors onto which the questions load with less and less weight. The analyst decides how many questions to retain on the basis of a good weight (whatever that weight may be - again, nothing set in concrete here.)

The last judgement
It is important to remember that statistical criteria are not the final judgement. When a set of questions has emerged from the statistical analysis, the analyst must review both them and any questions that have been rejected and decide on the basis of their professional judgement:

what questions can be passed on to the second iteration,
what questions may need a little tweaking, and
what questions are really not adding much to the information the questionnaire provides.

Do not abdicate responsibility to your statistical methods! And now you see why iteration is a critical part of the process. The questionnaire that results from one iteration is a hypothesis. Hypotheses have to be tested.
A very real risk a developer runs when constructing a scale is that they start to "model the data." That is, they take items in and out and they compute their statistics, but their conclusions are only applicable to the present sample of respondents. What the developer must do next is to try the new questionnaire (with items re-worded and rejected items thrown out) on a fresh sample, and re-compute all the above statistics again. If the statistics hold on the fresh sample, then well and good. If not, then more analysis and another run will be needed.

Warning: one sometimes sees some very good-looking statistics reported on the basis of analysis of the original sample, without any check on a fresh sample. Tke these with a large pinch of salt. The statistics will most probably be a lot less impressive when re-sampled.

In general, in answer to the question: is this a real Likert scale or not, the onus is on the person who created the scale to tell you to what extent the above criteria have been met. If you are not getting this level7 of re-assurance from the scale designer, then it really is a fishy business. And beware: a scale item which may work very nicely in one questionnaire may be totally out of place in another.

How many response options should there be in a numeric questionnaire?
There are two sets of issues here. One is, should we have an odd or even number of response options. The general answer to give here is that, if there is a possibility of having a "neutral" response to a set of questions, then you should have an odd number of questions with the central point being the neutral place. On the other hand, if it is an issue of whether something is good/bad, male/female (what we can call bi-polar) then basically, you are looking at an even number of response options.
If you wish to assess the strength of the response you are actually asking two questions in one: firstly, is good or bad, and secondly, is it really very good or very bad. This leads you to consider more than two (or three) response options.

Some people use even numbers of response options to "force" the respondents to go one way or another on an issue which is in fact not at all bi-polar. What happens in practice here is that respondents end up giving random responses between the two middle items. Not very useful.

Odd numbers of response options somehow feel more natural: the central value feels like a dividing line between positive and negative, and subjective issues are usually not considered bi-polar in our civilisation. There is always the possibility of a state of doubt about one's feelings. This state of doubt may also be exacerbated by the respondent not understanding the question, or judging that the question does not actually correspond to any experience they may have had, or that the question is not relevant. We do hope that by the time the questionnaire has emerged from the development process that these issues have been dealt with, and that questions are all concise, clear, and to the point (but sometimes, maybe not, so, back to the drawing board?) So usually we hope that the central response option feels like the dividing line, on which some respondents may feel they are allowed to settle - but that most won't want to.

The other set of issues is how wide should the response options be. A scale of 1 to 3, 1 to 5, or even 1 to 100? The usual answer is five. But this is missing the point. It depends on how accurately can the majority of respondents distinguish between flavours of meaning in the questions. If you suspect that the majority of respondents are going to be fairly uninformed about the topic or vague in their judgements, then stick with a small number of response options. If you are going to be dealing with people who can give nuanced responses, then you can use a much larger set of response options.

A sure way of telling if you are using too many response options is to listen to the respondents talking after they have done the questionnaire. When people have to differentiate between fine shades of meaning that may be beyond their ability, they will complain that the questionnaire was "long" and "hard."

How do you get numbers out of a Likert questionnaire?
A Likert scale will consist of a number of statements (let's say, for the sake of an example, 10 statements, why not.) Each statement will have a response surface - a set of boxes labelled from "Agree" or "Strongly Agree" to "Disagree" or "Strongly Disagree". Let's continue for the sake of our example that we have a 5-choice response surface.
We code the "Strongly Agree" response as a 5, down to the "Strongly Disagree" response as a 1, for each statement. If some of the statements are negative then we do those statements the other way round. That is, an extreme positive response ("Strongly Agree" to a positive statement or "Strongly Disagree" to a negative) is always a 5 and an extreme negative is always a 1. That makes sense, doesn't it? The stronger, the larger. We now add up the numbers for each respondent. In our example scale, we will have numbers between 10 (that is, all 1s, "Strongly Disagrees") and 50 (all "Strongly Agrees".)

A bit of jargon here: 10 is the base of our questionnaire, and 50 - 10 = 40 is the range.

Now, if we score all our respondents, we should find a distribution of scores between 10 and 50, hopefully with most scores lying somewhere in the middle, and extreme scores being fairly rare. There are many definitions of what a "normal distribution" should look like (and tests galore to determine normality), but for the jobbing analyst, the best definition is a distribution with a bump in the middle and symmetrical fringes at either end. If your distribution doesn't look like that then it might be an accident of sampling, or - the questionnaire itself is flawed (attitude questionnaires usually exhibit a "ceiling effect" - that is, the data tends to get squeezed up into the high end.) Careful questionnaire development should mitigate against consistent "floor" or "ceiling" effects, so the chances are if you get a skewed distribution from a sample, this is what statisticians would call an accident of your sampling (ie, the software you are measuring is actually considered to be very good by your respondents.)

So you are perfectly entitled to compute an arithmetic mean (average) of your data, over all your respondents. This is your numeric summary. But to be honest, this does not look very user-friendly. So we have to do some transformations to get this numeric summary into a more palatable form. See the next section.
How do you transform numeric data?There are at least four ways of transforming data into a more palatable form. These are:Straight linearPercentile rank from dataz-scoresPercentile rank from parametersLet's tackle each in turn.Straight linearThis is the simplest kind of transformation but it is really only number magic. The following example shows you how you can transform data from a 10 - 50 distribution to a 0 - 100 distribution for a questionnaire with 10 statements. Please remember that this is not a percentile transformation. It is not a percentage. The number 50 has no specific meaning.We do the following for each respondent's data, after we have summed it to give us a value we will call X. The process will give us a value we will call T, or transformed value.We first of all note the base, which is what happens when the respondent has scored 1 for each statement. In our example, base = 10.We note the range which is the difference between the base and the situation where a respondent has scored 5 for each statement. In our example, range = 50 - 10 = 40.Suppose the sum of our respondent's statements is 35, that is, X = 35. T = ((X - base) / range) * 100 = ((35 - 10) / 40) * 100 = 62.5 We can check that if X = 10, T comes to zero. And if X = 50, T comes to 100. But do note that we do not have all the numbers between 0 and 100. The difference between Ts when X = 10 and X = 11 is 2.5. That is, we go up the scale from zero to 100 in steps of 2.5.This slightly embarrassing feature can get neatly hidden if we take the data from many respondents, and average their T values. Hey presto! It looks as we have a scale that uses all the numbers between zero and 100.

Percentile rank from data Using this method, we scale our data in terms of the percentile rank achieved by each respondent with reference to the collection of respondents we have sampled. Statisticians define a percentile as a score at or below which a given percentage of the rest of the data falls. So if a respondent is right in the middle, their percentile rank is 50. Percentile ranks are easy to understand (who doesn't understand a score of 100%, right?) but they have some undesirable mathematical properties, which I summarise at the end of this section.It's easy to compute with a spreadsheet if you don't have much data. Let's say you have n items of data, in our case, n = 10. First of all you order your data from lowest value to highest (Col. X, below), and then assign a rank so that the lowest is a rank of 1. It doesn't matter if some of your data items are repeated at this stage (Col R, below.) Next, if there are repeated data items (as there surely will be!) you assign the rank to each bunch of repeated data items the highest rank of the repeated bunch (Col R', below.) Finally you compute PR. The example is given for computing the PR of a score of 30 in our sample data set PR = R'/n = 7/10 = 0.7 Percentile ranks are usually expressed as whole numbers between zero and 100, so we multiply the result by 100. Thus a raw score of 30 produces a percentile rank of 70%. X R R’ PR 10 1 2 20 10 2 2 20 11 3 3 30 12 4 4 40 15 5 5 50 30 6 7 70 30 7 7 70 50 8 10 100 50 9 10 100 50 10 10 100 One important warning. As you can see, the numeric difference between the percentile ranks will not reflect the actual differences in the corresponding raw data values. So the difference between a raw score of 10 and 11 in percentile rank terms is 10. But so is the difference between a raw score of 12 and 15! In fact your percentiles will vary greatly the more data you collect. This makes percentiles non-linear. One consequence is that computing averages and standard deviations on percentiles is frowned upon by the statistical community. You should compute medians and other rank order statistics instead.Some analysts prefer to count the number of percentiles in different ranges, so in our sample above you could say that only 3 / 10 cases (33%) were above the 75th percentile, or were worthy of an "A" rating. The dividing lines are, of course, extremely arbitrary. Let the buyer beware.If you have a very large collection of data from your questionnaire (counted in the thousands, preferably) you can actually compute a percentile for every value from base to the maximum and use this data as a way of re-expressing your raw data for any one sample just by looking up the table. Although this is better than working from tiny samples, please note that this data too is subject to change as your big reference data set changes. The distribution of the big reference data set will only reflect the conditions under which it was collected.

z-scores
A standard normal score (sometimes called a z-score) is the distance from the mean of a score expressed in units of the standard deviation. All the caveats about the normality of the population distribution from which our sample is obtained apply. None the less, do remember that accidents of sampling can produce some extremely strange sample distributions, and that for the kinds of sample sizes we normally deal with (what statisticians call "small samples", ie, less than 5,000), all tests for "normality of distribution" are actually pretty vague. Either you trust that the questionnaire you are using is capable of producing normally distributed data or you don't. If you don't, don't use this method.I'm using the same dataset as in the previous section, only this time I've used my spreadsheet to compute the mean (called AVERAGE() in most spreadsheets) and the standard deviation with a divisor of n - 1 (usually called STDEVP() in most spreadsheets: please consult an introductory statistics textbook for an explanation of this strange process of using n - 1 as a divisor.) In this example, the MEAN is 26.80 and the standard deviation (or StDevP) is 16.72. Thus the z-score of the raw score of 30 is: z = (MEAN-X)/StDevP = (26.80-30)/16.72 = 0.19 The computation of z-scores for the entire sample is given below. However, z-scores with their means of zero and standard deviations of 1 are not very eye-catching, and so for presentation purposes I advise transforming z-scores into a 50/10 distribution: that is, one in which the mean is 50 and the standard deviation is 10. Do consult a statistics textbook about the implications of this kind of re-scaling. To go from z-scores to this distribution the reverse of the previous formula is used. That is, for a z-score Z = 0.19 corresponding to a raw score of 30, the transformed 50/10 score, or X', with NEW_MEAN = 50 and NEW_SD = 10 is: X' = (Z*NEW_SD)+ NEW_MEAN = (0.19 * 10) + 50 = 51.91 The worksheet is presented here: X Z 50/10 10 -1.00 39.97 Mean = 26.80 10 -1.00 39.97 StDevP = 16.76 11 -0.94 40.57 12 -0.88 41.17 New Mean = 50 15 -0.70 42.96 New SD = 10 30 0.19 51.91 30 0.19 51.91 50 1.38 63.85 50 1.38 63.85 50 1.38 63.85 Mean 26.80 0.00 50.00 Sd Dev P 16.76 1.00 10.00 Now, if you have a very large database from the questionnaire (counted in the thousands of course) you can consider this as an estimation base and take from it the population parameters: the two parameters of interest being the population mean (PM) and the population standard deviation (PSD.) You are now allowed to use these values instead of your sample mean and sample standard deviation, the StDevP in computing the value of Z - from which you can compute the value of the 50/10 distribution. This now has the advantage of showing your audience an implicit comparison between the data you have obtained and the data you have accumulated in the "shopping basket" of your estimation base. Does you evaluation show that you are above or below the overall standard? If over 50, then yes, above. If over 60, then very clearly above (see an introductory statistics textbook for how to interpret standard deviations in terms of probabilities.)I do stress however that the techniques shown in this section will give a false reading if the questionnaire you are using does not have a normal distribution as evidenced by its large estimation base (if the estimation base is in the thousands, then tests for normality of distribution can begin to apply - with some caution.) If you don't have this kind of data my advice is: stick to percentile ranks.Percentile rank from parameters The procedure outlined in this section really does depend on the estimation base of the questionnaire exhibiting normal distribution properties. If it hasn't been shown to do so, then this procedure will simply create nonsense.Instead of computing percentiles from the actual sample, you can convert the sample to z-scores using the population mean and standard deviation from the estimation base. If you then want to covert it to percentile ranks you then use a function usually called NORMSDIST() on most spreadsheets.If you had a raw score of 30, and the population mean is given as 25 and the standard deviation as 12, the computation of the Population Percentile Rank (PPR) is as follows: PPR = INT(NORMSDIST((30-25)/12)*100 ) = 66 NORMSDIST usually produces values between zero and 1, so we have to scale the output up to 100, and then since it is a (population) percentile rank, we truncate it to the nearest whole number.All the reservations about percentile ranks computed from samples as stated above apply here as well. Be warned!

###
People always complain. It's a fact of life. And everybody thinks of themselves as a "questionnaire expert." If you get the odd grumble from your respondents, this usually means that the person doing the grumble has something extra they want to tell you, beyond the questionnaire. So listen to them.
If you get a lot of grumbles, this may mean that you have badly miscalculated and it's time to go back to the drawing board. When you listen to people complaining about a questionnaire, listen carefully: are they unhappy about what the questionnaire is attempting to measure, or are they unhappy about the wordings of some of your items?

What other kinds of questionnaires are there?
You mean, what other kinds of techniques can you employ to construct a questionnaire? There are two main other varieties:
Semantic differential type questionnaires in which the user is asked to say where their opinion lies between two anchor points which have been shown to represent some kind of polar opposition in the respondent"s mind
Guttman scaling type questionnaires which are a collection of statements which gradually get more extreme, and you calculate at what statement the respondent begins to answer negatively rather than positively.
Of the two, semantic differential scales are more frequently encountered in practice, although they are not used as much as Likert scales, and professionals seem to have relegated Thurstone and Guttman scaling techniques into the research area. As a footnote, Likert starts his famous article by complaining about how difficult it is to get Thurstone scaling right.

There is also a (third) set of methodologies collected together under the title of Rasch Measurement. This set of methodologies is rarely used in User Experience work, mainly because it involves a lot of intensive statistical computation, and it does not produce easily interpretable results. In contrast, Classical Test Theory as presented here is intuitive to most people (once you get past the tricky business of actually developing the scale!)

Should questions be devised so they are always positive?
The jury (as always) is out on this one with claims and counter-claims. The one thing to avoid most strenuously is to frame questions with an explicit negative. Saying "no" to a negative question involves mental contortions. The reason for having both negative and positive statements in questionnaire is because it is feared that response bias will come into play. If all your statements are positive up-beat ones, a respondent can simply check off all the "agrees" without having to consider each statement carefully. So you have no guarantee that they've actually responded to your statements -- they could be working on "auto-pilot". Of course, such questionnaires will also produce fairly impressive statistical reliabilities, but again, that could be a cheat.


However, it has also been mooted that some users will get confused with a reversal of direction and will always hit what they think is the "agree" button even when the statement expresses a negative opinion which they don't want to endorse. I sometimes look at the response patterns of users. If there is a considerable number of responses on the left or the right, irrespective of the question asked, I presume the respondent has got themselves hopelessly lost with regard to the direction of statements. So I usually delete their record (with suitable warnings.) The interesting thing is that this does not happen very often at all in my experience.

Is a long questionnaire better than a short one? How short can a questionnaire be?
You have to ensure that you have enough statements which cover the most common shades of opinion about the construct being rated. But this has to be balanced against the need for conciseness: you can produce a long questionnaire that has fantastic reliabilities and validities when tested under controlled conditions with well-motivated respondents, but ordinary respondents may just switch off and respond at random after a while. In general, because of statistical artifacts, long questionnaires will tend to produce good reliabilities with well-motivated respondents, and shorter questionnaires will produce less impressive reliabilities but short questionnaires may be a better test of overall opinion in practice.
A questionnaire should not be judged by its statistical reliability alone. Because of the nature of statistics, especially the so-called law of large numbers, we will find that what was only a trend with a small sample becomes statistically significant with a large sample. This is as true of the number of respondents you have as it is of the number of questions in your questionnaire. Statistical "significance" is a technical term with a precise mathematical meaning. Significance in the everyday sense of the word is a much broader concept.

So high statistical reliability is not the "gold standard" to aim for?
If a short (say 8 - 10 items) questionnaire exhibits high reliabilities (above 0.85, as a rule of thumb) then you should look at the items carefully and examine them for spurious repetitions. Longer questionnaires (12 - 20 items) if well constructed should yield reliability values of 0.70 or more.
I stress these are rules of thumb: there is nothing absolute about them.

What's the minimum and maximum figure for reliability?
Theoretically, the minimum is 0.00 and the maximum is 1.0. Suspect a questionnaire whose reliability falls below 0.50 unless it is very short (3-4 items) and there is a sound reason to adopt it.
The problem with questionnaires of low reliability is that you simply don't know whether they are telling you the truth about what you are trying to measure or not. It's the lack of assurance that's the problem.

Can you tell if a respondent is lying?
The polite way of saying this, is, can you tell if the respondent is giving you "socially desirable" answers. You can, but the development of a social desirability scale within your questionnaire (so-called "lie scale") is a topic all of its own. "Lie scales" work on the principle that if someone is trying to make themselves look good, they will also strongly agree to an inordinate number of statements that ask about impossible behaviours, such as
"I have never been late for an appointment in my life."
"I always tell the truth no matter what the cost."
Now, some respondents may strongly agree with some of these items but they'd have to be a saint to be able to honestly agree to all of them.
Lie scales just bulk up a questionnaire and are generally not used in HCI. If you are really concerned with your respondents giving you socially desirable answers, you could always put a social desirability questionnaire into the test booklet and look hard at those respondents who give you high scores on social desirability.

Why do some questionnaires have sub-scales?
Suppose that the overall construct you are getting the respondents to rate is complex: there are different components to the construct. Thus for instance, overall user satisfaction is a complex construct that can arise from a number of separate issues, like "attractiveness of product", "helpfulness", "feelings of efficiency" and so on. If you can identify these components, it makes sense to create a number of sub-scales in your questionnaire, each of which is a mini questionnaire in its own right, measuring one component, but which also contributes to the overall construct.
How do you go about identifying component sub-scales?
The soundest way of doing this is to carry out a statistical procedure called factor analysis on a large set of questions, to find out how many underlying (latent) factors the respondents are operating with (see above on factor analysis), but often, received opinion or expert analysis of the overall construct may be used instead. Or even reading the statements carefully in a group discussion activity! The crucial questions are:
Are these factors truly independent? That is, if they are, we would expect items that make up the factors to be more highly correlated with each other than with items from other factor scales.
What use can the analyst make of the different factors? Extracting a bunch of factors that actually contributes little to our understanding of what is going on is pseudo-science. On the other hand, separating factors which are fairly highly inter-correlated but which make sense to separate out practically makes for a more usable questionnaire. For instance, "screen layout" and "menu structure" are two factors which may be fairly strongly inter-correlated in a statistical sense but separately they may give the analyst useful information about these two aspects of an interface.
How much can I change wordings in a standardised questionnaire?
In general, if a questionnaire has been through the standardisation process the danger in changing, deleting, or adding items is that you undo the statistical basis for the questionnaire: you set yourself back by unknown amounts. You are generally advised not to do this unless you have all the background statistical data and have access to user samples on which you can re-validate your amended version.
There is one general exception. If statements in the questionnaire refer to something like "this system" or "this software" you can usually change these words to refer explicitly to the system you are evaluating without making too much damage to the questionnaire. For instance:

(1) Using this system gives me a headache.
(2) Using Word-Mate gives me a headache.
Changing (1) to (2) is called focusing the questionnaire and is usually no problem.
You may think to do a more radical change of focus, hopefully without affecting the statistical properties too much. Suppose for instance you were to change all occurrences of (3) to (4):

(3) using this system...
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.