In his influential paper on measurement theory Stevens (1946) argued that different statistical operators (i.a. mean), and therefore also the statistical tests that make use of these operators, are only permissible on certain measurement scales. The appropriateness of a statistical operator on a scale is measured by whether its transformations are invariant. Transformations can be applied to the data by several formulas, for example by the simple multiplication formula x’=x*4. If a statistical operator is not invariant than conclusions drawn from results of statistical tests making use of that operator will differ depending on the how the results were measured. As Scholten and Borsboom (2009) explain: “For instance, it is possible that when scores on the aforementioned mathematical proficiency test are analyzed for sex differences with a t test, different results are obtained for the original and transformed scores. Boys may significantly outperform girls when analyzing the original scores, while boys and girls may not differ significantly in their performance when analyzing the transformed scores (or vice versa; see Hand (2004), for some interesting examples). Since there is no sense in which the original scores are preferable or superior to the transformed, squared scores, this means that research findings and conclusions depend on arbitrary, and usually implicit, scaling decisions on part of the researcher. “

Stevens summarized four measurement scales with their permissible transformations and statistical operators. All permissible transformations and statistics are always also applicable in scale higher than the scale at which they are introduced.

The first scale is the nominal scale. At this scale numbers are used to classify data into mutually exclusive, exhaustive categories in which no order can be imposed. Nominal assignments can be on an individual scale, football players’ jersey numbers, or on a group scale, a person’s religion. Permissible transformations are any one-to-one or many-to-one transformation, although a many-to-one transformation loses information. Permissible statistics are number of cases, mode and association statistics (http://salises.mona.uwi.edu/sa63c/Nominal%20Measures%20of%20Association.htm).

The second scale is the ordinal scale. Ordinal measurements describe order, but not relative size or degree of difference between the items measured. For example, scores in tennis are rank ordered but cannot be subtracted; the difference between 15-30=15 and 30-40=10 is meaningless. According to Stevens most psychological measurements are made on ordinal scale. Examples of measurements on an ordinal scale are measurements of intelligence and personality traits. Permissible transformations are any monotone increasing transformation, although a transformation that is not strictly increasing loses information. Permissible statistics are median and percentiles. Note that mean is not a permissible statistic at this scale.

The third scale is the interval scale. Data points on the interval scale are ordered and the interval between data points is equal over all data points. For example, 10˚-5˚=5˚ where 35˚-30˚=5˚. However the null on the scale is arbitrary and the ratio therefore can’t be calculated. Permissible transformations are any affine transformation t(m) = c * m + d, where c and d are constants (general linear group, a” = a’ + b in Stevens). Permissible statistics are mean, standard deviation, rank-order correlation and product-moment correlation.

The fourth scale is the ratio scale. This scale is very similar to the interval scale except that the scale has an absolute null point. Examples of measurements on the ratio scale are degrees of Celsius, monthly salary and weight. Permissible transformations are any linear (similarity) transformation t(m) = c * m, where c is a constant (i.a. Logarithmic transformation). The (new) permissible statistic is coefficient of variation.

Lord (1953) wrote a satirical comment on the conclusions of Stevens. Lord describes a story of a professor who distributed jersey numbers to his students. He often administered tests to his students. In secret he compared the means and standard deviations of test results of students with different jersey numbers. He taught his students very carefully: “Test scores are ordinal numbers, not cardinal numbers. Ordinal numbers cannot be added”. He knew very well that his comparisons of different jersey numbers were incorrect according to the latest theories of measurement.

After a while the freshmen accused the professor of distributing low numbers to the freshmen. The professor consulted a statistician who simply calculated that the chance that the freshman had this average number when the numbers were randomly distributed was very low.

When the professor argued that the statistician couldn’t use multiplication on measurements taken on a nominal scale the statistician reacted: “If you doubt my conclusions… I suggest you try and see how often you can get a sample of 1,600 numbers from your machine with a mean below 50.3 or above 58.3.” So the professor starts taking of samples and indeed finds out that it’s indeed very unlikely to find a mean below 50,3 and 58,3.

So Lord argues that statistical methods can be used regardless of the scale of measurement. “The numbers do not know where they came from (p. 751)”. However in his paper Lord is using inferences regarding the measurements instead of inferences regarding the attributes. Or as Scholten and Borsboom (2009) argue: “..it is argued that the football numbers do not represent just the nominal property of non-identity of the players; they also represent the amount of bias in the machine. It is a question about this property – not a property that relates to the identity of the football players – that the statistical test is concerned with”. Scholten and Borsboom show that when the bias of the machine is assessed the data are actually on an interval scale and that therefore Lord’s article actually supports Stevens’ view.

To give some information about the problem in psychological science that most measurements are done on ordinal instead of interval scale I will quote a text I found on the web: (http://blog.csdn.net/aris_zzy/article/details/2071923)

“Suppose we are doing a two-sample t-test; we are sure that the assumptions of ordinal measurement are satisfied, but we are not sure whether an equal-interval assumption is justified. A smooth monotone transformation of the entire data set will generally have little effect on the p value of the t-test. A robust variant of a t-test will likely be affected even less (and, of course, a rank version of a t-test will be affected not at all). It should come as no surprise then that a decision between an ordinal or an interval level of measurement is of no great importance in such a situation, but anyone with lingering doubts on the matter may consult the simulations in Baker, Hardyck, and Petrinovich (1966) for a demonstration of the obvious.

On the other hand, suppose we were comparing the variability instead of the location of the two samples. The F test for equality of variances is not robust, and smooth monotone transformations of the data could have a large effect on the p value. Even a more robust test could be highly sensitive to smooth monotone transformations if the samples differed in location.

Measurement level is of greatest importance in situations where the meaning of the null hypothesis depends on measurement assumptions. Suppose the data are 1-to-5 ratings obtained from two groups of people, say males and females, regarding how often the subjects have sex: frequently, sometimes, rarely, etc. Suppose that these two groups interpret the term ‘frequently’ differently as applied to sex; perhaps males consider ‘frequently’ to mean twice a day, while females consider it to mean once a week. Females may report having sex more ‘frequently’ than men on the 1-to-5 scale, even if men in fact have sex more frequently as measured by sexual acts per unit of time. Hence measurement considerations are crucial to the interpretation of the results.”

To conclude, always be aware of which attribute you measure, what scale you likely measure the attribute on, whether you can use certain statistics on that scale and in what way you can relate the results of the analysis to the attribute.

Lecture by Angélique Cramer

Baker, B. O., Hardyck, C, and Petrinovich, L. F. (1966), “Weak measurement vs. strong statistics: An empirical critique of S.S. Stevens’ proscriptions on statistics,” Educational and Psychological Measurement, 26, 291-309.

Lord, F. M. (1953). On the statistical treatment of football numbers. American psychologist, 8, 750- 751.

Scholten, A. Z. and Borsboom, D. (2009). A reanalysis of Lord’s statistical treatment of football numbers. Journal of statistical psychology, 53, 69-75.

Stevens, S. S. (1946), “On the theory of scales of measurement,” Science, 103, 677-680.