X Close

IOE Blog


Expert opinion from IOE, UCL's Faculty of Education and Society


People having a pop at PISA should give it a break…

By Blog Editor, IOE Digital, on 30 July 2013

John Jerrim

For those who don’t know, the Programme for International Student Assessment (PISA) is a major cross-national study of 15 year olds’ academic abilities. It covers three domains (reading, maths and science), and since 2000 has been conducted tri-annually by the OECD. This study is widely respected, and highly cited, by some of the world’s leading figures – including our own Secretary of State for Education Michael Gove.
Unfortunately not everyone agrees that PISA is such an authoritative assessment. Over the last month it has come in for serious criticism from academics, including Svend Kreiner (PDF) and Hugh Morrison (PDF). These interesting and important studies have been followed by a number of media articles criticising PISA  – including a detailed analysis in the Times Educational Supplement last week.
As someone who has written about (PDF) some of the difficulties with PISA  I have read these studies (and subsequent media coverage) with interest. A number of valid points have been raised, and point to various ways in which PISA may be improved (the need for PISA to become a panel dataset – following children throughout school – raised by Harvey Goldstein is a particularly important point). Yet I have also been frustrated to see PISA being described as “useless”.
This is a gross exaggeration. No data or test is perfect, particularly when it is tackling a notoriously difficult task such as cross-country comparisons, and that includes PISA. But to suggest it cannot tell us anything important or useful is very far wide of the mark. For instance, if one were to believe that PISA did not tell us anything about children’s academic ability, then it should not correlate very highly with our own national test measures. But this is not the case. Figure 1 illustrates the strong (r = 0.83) correlation between children’s PISA maths test scores and performance in England’s old Key Stage 3 national exams. This illustrates that PISA scores are in-fact strongly associated with England’s own measures of pupils’ academic achievement.
Figure 1. The correlation between PISA maths and Key Stage 3 maths test scores
Source: https://www.education.gov.uk/publications/eOrderingDownload/RR771.pdf page 100
To take another example, does the recent criticism of PISA mean we actually don’t know how the educational achievement of school children in England compares to other countries? Almost certainly not. To demonstrate this, it is very useful to draw upon another major international study of secondary school pupils’ academic achievement, TIMSS. This has different strengths and weaknesses relative to PISA, though at least partially overcomes some of the recent criticisms, with the key point being – does it tell us the same broad story about England’s relative position?
The answer to this question is yes – and this is shown in Figure 2.  PISA 2009 maths test scores are plotted along the horizontal axis and TIMSS 2011 maths test scores along the vertical axis. I have fitted a regression line to illustrate the extent to which the two surveys agree over the cross-national ranking of countries. Again, the correlation is very strong (r = 0.88). England is hidden somewhat under a cloud of points, but is highlighted using a red circle. Whichever study we use to look at England’s position relative to other countries, the central message is clear. We are clearly way behind a number of high performing East Asian nations (the likes of Japan, Korea and Hong Kong) but are quite some way ahead of a number of low and middle income countries (for example Turkey, Chile, Romania). Our exact position in the rankings may fluctuate a little (due to sampling variation, differences in precise skills tested and sample design) but the overall message is that we are doing okay, but there are other countries that are doing a lot better.
Figure 2. The correlation between PISA 2009 and TIMSS 2011 Maths test scores
Source: Appendix 3 of http://johnjerrim.files.wordpress.com/2013/07/main_body_jpe_resubmit_final.pdf
I think what needs to be realised is that drawing international comparisons is intrinsically difficult. PISA is not perfect, as I have pointed out in the past, but it does still contain useful and insightful information. Indeed, there are a number of other areas – ‘social’ (income) mobility being one – where cross-national comparisons are on a much less solid foundation. Perhaps we in the education community should be a little more grateful for the high quality data that we have rather than focusing on the negatives all the time, while of course looking for further ways it can be improved.
For details on my work using PISA, see http://johnjerrim.com/papers/

Print Friendly, PDF & Email

7 Responses to “People having a pop at PISA should give it a break…”

  • 1
    kristianstill wrote on 30 July 2013:

    I continue to learn and enjoy reading hour posts. I can only imagine the scope and challenge of international comparison and, somewhat aligned with your commentary here, found the recent media criticism a little lightweight and superficial.

  • 2
    kristianstill wrote on 30 July 2013:

    #your posts

  • 3
    thelearningprofessor wrote on 31 July 2013:

    Lightweight is normal in the media. What we have to guard against is (a) policy makers using these reports as an excuse for ploughing out of PISA (as the Scottish Government has done with TIMMS); and (b) our fellow academics, including many in other disciplines, assuming that they can dismiss PISA as inherently worthless. Thanks, John, for a timely and constructive posting.

  • 4
    Richard wrote on 6 August 2013:

    Two random examples to prove a pre-determined point. Pretty lightweight as far as serious academic analysis goes

  • 5
    Finding out more about education Dutch style | behrfacts wrote on 19 August 2013:

    […] Finland is an obvious choice as it does well on international PISA comparisons, but there is some evidence to suggest that this is due to very unique circumstances, plus that all is not completely well in Finnish school mathematics. Separately OECD’s approach to international educational testing is under attack from various q…. […]

  • 6
    Profe Mauro wrote on 24 September 2013:

    Reblogueó esto en Being tech smart teachersy comentado:
    Add your thoughts here… (optional)

  • 7
    Stephen Elliott wrote on 27 April 2014:

    John Jerrim
    Perhaps you can revise your support for OECD Pisa in light of this article by Dr Hugh Morrison? Readers would benefit from your “expertise” which no doubt will be evident in your expert rebuttal of his arguments. Have a go.
    When psychometricians claimed to be able to measure, they used the term ‘measurement’ not just for political reasons but also for commercial ones. … Those who support scientific research economically, socially and politically have a manifest interest in knowing that the scientists they support work to advance science, not subvert it. And those whose lives are affected by the application of what are claimed to be ‘scientific findings’ also have an interest in knowing that these ‘findings’ have been seriously investigated and are supported by evidence. (Michell, 2000, p. 660)
    This essay is a response to the claim by the Department of Education that: “The OECD is at the forefront of the academic debate regarding item response theory [and] the OECD is using what is acknowledged as the best available methodology [for international comparison studies].”
    Item Response Theory plays a pivotal role in the methodology of the PISA international league table. This essay refutes the claim that item response theory is a settled, well-reasoned approach to educational measurement. It may well be settled amongst quantitative psychologists, but I doubt if there is a natural scientist on the planet who would accept that one can measure mental attributes in a manner which is independent of the measuring instrument (a central claim of item response theory). It will be argued below that psychology’s approach to the twin notions of “quantity” and “measurement” has been controversial (and entirely erroneous) since its earliest days. It will be claimed that the item response methodolology, in effect, misuses the two fundamental concepts of quantity and measurement by re-defining them for its own purposes. In fact, the case will be made that PISA ranks are founded on a “methodological thought disorder” (Michell, 1997).
    Given the concerns of such a distinguished statistician as Professor David Spiegelhalter, the Department of Education’s continued endorsement of PISA is difficult to understand. This essay extends the critique of PISA and item response theory beyond the concerns of Spiegelhalter to the very data from which the statistics are generated. Frederick Lord (1980, p. 227-228), the father of modern psychological measurement, warned psychologists that when applied to the individual test-taker, item response theory produces “absurd” and “paradoxical” results. Given that Lord is one of the architects of item response theory, it is surprising that this admission provoked little or no debate among quantitative psychologists. Are politicians and the general public aware that item response theory breaks down when applied to the individual?
    In order to protect the item response model from damaging criticism, Lord proposed what physicists call a “hidden variables” ensemble model when interpreting the role probability plays in item response theory. As a consequence item response models are deterministic and draw on Newtonian measurement principles. “Ability” is construed as a measurement-independent “state” of the individual which is the source of the responses made to test items (Borsboom, Mellenbergh, & van Heerden, 2003). Furthermore, item response theory is incapable of taking account of the fact that the psychologist participates in what he or she observe. Richardson (1999) writes: “[W]e find that the IQ-testing movement is not merely describing properties of people: rather, the IQ test has largely created them” (p. 40). The participative nature of psychological enquiry renders the objective Newtonian model inappropriate for psychological measurement. This prompted Robert Oppenheimer, in his address to the American Psychological Association, to caution: [I]t seems to me that the worst of all possible misunderstandings would be that psychology be influenced to model itself after a physics which is not there anymore, which has been quite outdated.”
    Unlike psychology, Newtonian measurement has very precise definitions of “quantity” and “measurement” which item response theorists simply ignore. This can have only one interpretation, namely, that the numerals PISA attaches to the education systems of countries aren’t quantities, and that PISA doesn’t therefore “measure” anything, in the everyday sense of that word. I have argued elsewhere that item response theory can escape these criticisms by adopting a quantum theoretical model (in which the notions of “quantity” and “measurement” lose much of their classical transparency). However, that would involve rejecting one of the central tenets of item response theory, namely, the independence of what is measured from the measuring instrument. Item response theory has no route out of its conceptual difficulties.
    This represents a conundrum for the Department of Education. In endorsing PISA, the Department is, in effect, supporting a methodology designed to identify shortcomings in the mathematical attainment of pupils, when that methodology itself has serious mathematical shortcomings.
    Modern item response theory is founded on a definition of measurement promulgated by Stanley Stevens and addressed in detail below. By this means, Stevens (1958, p. 384) simply pronounced psychology a quantitative science which supported measurement, ignoring established practice elsewhere in the natural sciences. Psychology refused to confront Kant’s view that psychology couldn’t be a science because mental predicates couldn’t be quantified. Wittgenstein’s (1953, p. 232) scathing critique had no impact on quantitative psychology: “The confusion and barrenness of psychology is not to be explained by calling it a “young science”; its state is not comparable with that of physics, for instance, in its beginnings. … For in psychology there are experimental methods and conceptual confusion. … The existence of the experimental method makes us think we have the means of solving the problems which trouble us; though problem and method pass one another by.”
    Howard Gardner (2005, p. 86), the prominent Harvard psychologist looks back in despair to the father of psychology itself, William James:
    On his better days William James was a determined optimist, but he harboured his doubts about psychology. He once declared, “There is no such thing as a science of psychology,” and added “the whole present generation (of psychologists) is predestined to become unreadable old medieval lumber, as soon as the first genuine insights are made.” I have indicated my belief that, a century later, James’s less optimistic vision has materialised and that it may be time to bury scientific psychology, at least as a single coherent undertaking.
    I will demonstrate in a follow-up paper to this essay, an alternative approach which solves the measurement problem as Stevens presents it, but in a manner which is perfectly in accord with contemporary thinking in the natural sciences. None of the seemingly intractable problems which attend item response theory trouble my account of measurement in psychology.
    However, my solution renders item response theory conceptually incoherent.
    In passing it should be noted that some have sought to conflate my analysis with that of Svend Kreiner, suggesting that my concerns would be assuaged if only PISA could design items which measured equally from country to country. Nothing could be further from the truth; no adjustment in item properties can repair PISA or item response theory. No modification of the item response model would address its conceptual difficulties.
    The essay draws heavily on the research of Joel Michell (1990, 1997, 1999, 2000, 2008) who has catalogued, with great care, the troubled history of the twin notions of quantity and measurement in psychology. The following extracts from his writings, in which he accuses quantitative psychologists of subverting science, counter the assertion that item response theory is an appropriate methodology for international comparisons of school systems.
    From the early 1900s psychologists have attempted to establish their discipline as a quantitative science. In proposing quantitative theories they adopted their own special definition of measurement and treated the measurement of attributes such as cognitive abilities, personality traits and sensory intensities as though they were quantities of the type encountered in the natural sciences. Alas, Michell (1997) presents a carefully reasoned argument that psychological attributes lack additivity and therefore cannot be quantities in the same way as the attributes of Newtonian physics. Consequently he concludes: “These observations confirm that psychology, as a discipline, has its own definition of measurement, a definition quite unlike the traditional concept used in the physical sciences” (p. 360).
    Boring (1929) points out that the pioneers of psychology quickly came to realise that if psychology was not a quantitative discipline which facilitated measurement, psychologists could not adopt the epithet “scientist” for “there would … have been little of the breath of science in the experimental body, for we hardly recognise a subject as scientific if measurement is not one of its tools” (Michell, 1990, p. 7).
    The general definition of measurement accepted by most quantitative psychologists is that formulated by Stevens (1946) which states: “Measurement is the assignment of numerals to objects or events according to rules” (Michell, 1997, p. 360). It seems that psychologists assign numbers to attributes according to some pre-determined rule and do not consider the necessity of justifying the measurement procedures used so long as the rule is followed. This rather vague definition distances measurement in psychology from measurement in the natural sciences. Its near universal acceptance within psychology and the reluctance of psychologists to confirm (via. empirical study) the quantitative character of their attributes casts a shadow over all quantitative work in psychology. Michell (1997, p. 361) sees far-reaching implications for psychology:
    If a quantitative scientist (i) believes that measurement consists entirely in making numerical assignments to things according to some rule and (ii) ignores the fact that the measurability of an attribute presumes the contingent … hypothesis that the relevant attribute possesses an additive structure, then that scientist would be predisposed to believe that the invention of appropriate numerical assignment procedures alone produces scientific measurement.
    Historically, Fechner (1860) – who coined the word “psychophysics” – is recognised as the father of quantitative psychology. He considered that the only creditworthy contribution psychology could make to science was through quantitative approaches and he believed that reality was “fundamentally quantitative.” His work focused on the instrumental procedures of measurement and dismissed any requirement to clarify the quantitative nature of the attribute under consideration.
    His understanding of the logic of measurement was fundamentally flawed in that he merely presumed (under some Pythagorean imperative) that his psychological attributes were quantities. Michell (1997) contends that although occasional criticisms were levied against quantitative measurement in psychology, in general the approach was not questioned and became part of the methodology of the discipline. Psychologists simply assumed that when the study of an attribute generated numbers, that attribute was being measured.
    The first official detailed investigation of the validity of psychological measurement from beyond its professional ranks was conducted – under the auspices of the British Association for the Advancement of Science – by the Ferguson Committee in 1932. The non-psychologists on the committee concluded that there was no evidence to suggest that psychological methods measured anything, as the additivity of psychological attributes had not been demonstrated. Psychology moved to protect its place in the academy at all costs. Rather than admitting the error identified by the committee and going back to the drawing board, psychologists sought to defend their modus operandi by attempting a redefinition of psychological measurement. Stevens’ (1958, p. 384) definition that measurement involved “attaching numbers to things” legitimised the measurement practices of psychologists who subsequently were freed from the need to test the quantitative structure of psychological predicates.
    Michell (1997, p. 356) declares that presently many psychological researchers are “ignorant with respect to the methods they use.” This ignorance permeates the logic of their methodological practices in terms of their understanding of the rationale behind the measurement techniques used. The immutable outcome of this new approach to measurement within psychology is that the natural sciences and psychology have quite different definitions of measurement.
    Michell (1997, p. 374) believes that psychology’s failure to face facts constitutes a “methodological thought disorder” which he defines as “the sustained failure to see things as they are under conditions where the relevant facts are evident.” He points to the influence of an ideological support structure within the discipline which serves to maintain this idiosyncratic approach to measurement. He asserts that in the light of commonly available evidence, interested empirical psychologists recognise that “Stevens’ definition of measurement is nonsense and the neglect of quantitative structure a serious omission” (Michell, 1997, p. 376).
    Despite the writings of Ross (1964) and Rozeboom (1966), for example, Stevens’ definition has been generally accepted as it facilitates psychological measurement by an easily attainable route. Michell (1997, p. 395) describes psychology’s approach to measurement as “at best speculation and, at worst, a pretence at science.”
    [W]e are dealing with a case of thought disorder, rather than one of simple ignorance or error and, in this instance, these states are sustained systemically by the almost universal adherence to Stevens’ definition and the almost total neglect of any other in the relevant methodology textbooks and courses offered to students. The conclusion that follows from this history, especially that of the last five decades, is that systemic structures within psychology prevent the vast majority of quantitative psychologists from seeing the true nature of scientific measurement, in particular the empirical conditions necessary for measurement. As a consequence, number-generating procedures are consistently thought of as measurement procedures in the absence of any evidence that the relevant psychological attributes are quantitative. Hence, within modern psychology a situation exists which is accurately described as systemically sustained methodological thought disorder. (Michell, 1997, p. 376)
    To make my case, let me first make two fundamental points which should shock those who believe that the OECD is using what is acknowledged as the best available methodology for international comparisons. Both of these points should concern the general public and those who support the OECD’s work. First, the numerals that PISA publishes are not quantities, and second, PISA tables do not measure anything.
    To illustrate the degree of freedom afforded to psychological “measurement” by Stevens it is instructive to focus on the numerals in the PISA table. Could any reasonable person believe in a methodology which claims to summarise the educational system of the United States or China in a single number? Where is the empirical evidence for this claim? Three numbers are required to specify even the position of a single dot produced by a pencil on one line of one page of one of the notebooks in the schoolbag of one of the thousands of American children tested by PISA. The Nobel Laureate, Sir Peter Medawar refers to such claims as “unnatural science.” Medawar (1982, p. 10) questions such representations using Philip’s (1974) work on the physics of a particle of soil:
    The physical properties and field behaviour of soil depends on particle size and shape, porosity, hydrogen iron concentration, material flora, and water content and hygroscopy. No single figure can embody itself in a constellation of values of all these variables in any single real instance … psychologists would nevertheless like us to believe that such considerations as these do not apply to them.
    Quantitative psychology, since its inception, has modelled itself on the certainty and objectivity of Newtonian mechanics. The numerals of the PISA tables appear to the man or woman in the street to have all the precision of measurements of length or weight in classical physics. But, by Newtonian standards, psychological measurement in general, and item response theory in particular, simply have no quantities, and do not “measure,” as that word is normally understood.
    How can this audacious claim to “measure” the quality of a continent’s education provision and report it in a single number be justified? The answer, as has already been pointed out, is to be found in the fact that quantitative psychology has its own unique definition of measurement, which is that “measurement is the business of pinning numbers on things” (Stevens, 1958, p. 384). With such an all-encompassing definition of measurement, PISA can justify just about any rank order of countries. But this isn’t measurement as that word is normally understood.
    This laissez faire attitude wasn’t always the case in psychology. It is clear that, as far back as 1905, psychologists like Titchener recognised that his discipline would have to embrace the established definition of measurement in the natural sciences: “When we measure in any department of natural science, we compare a given measurement with some conventional unit of the same kind, and determine how many times the unit is contained in the magnitude” (Titchener, 1905, p. xix). Michell (1999) makes a compelling case that psychology adopted Stevens’ ultimately meaningless definition of measurement – “according to Stevens’ definition, every psychological attribute is measurable” (Michell, 1999, p. 19) – because they feared that their discipline would be dismissed by the “hard” sciences without the twin notions of quantity and measurement.
    The historical record shows that the profession of psychology derived economic and other social advantages from employing the rhetoric of measurement in promoting its services and that the science of psychology, likewise, benefited from supporting the profession in this by endorsing the measurability thesis and Stevens’ definition. These endorsements happened despite the fact that the issue of the measurability of psychological attributes was rarely investigated scientifically and never resolved. (Mitchell, 1999, p. 192)
    The mathematical symbolism in the next paragraph makes clear the contrast between the complete absence of rigorous measurement criteria in psychology and the onerous demands placed on the classical physicist.
    An essential step in establishing the validity of the concepts “quantity” and “measurement” in item response theory is an empirical analysis centred on Hölder’s conditions. The reader will search in vain for evidence that quantitative psychologists in general, and item response theorists in particular, subject the predicate “ability” to Hölder’s conditions.
    This is because the definition of measurement in psychology is so vague that it frees psychologists of any need to address Hölder’s conditions and permits them, without further ado, to simply accept that the predicates they purport to measure are quantifiable.
    Quantitative psychology presumed that the psychological attributes which they aspired to measure were quantitative. … Quantitative attributes are attributes having a quite specific structure. The issue of whether psychological attributes have that sort of structure is an empirical issue … Despite this, mainstream quantitative psychologists … not only neglected to investigate this issue, they presumed that psychological attributes are quantitative, as if no empirical issue were at stake. This way of doing quantitative psychology, begun by its founder, Gustav Theodor Fechner, was followed almost universally throughout the discipline and still dominates it. … [I]t involved a defective definition of a fundamental methodological concept, that of measurement. … Its understanding of the concept of measurement is clearly mistaken because it ignores the fact that only quantitative attributes are measurable. Because this … has persisted within psychology now for more than half a century, this tissue of errors is of special interest. (Michell, 1999, pp. xi – xii)
    This essay has sought to challenge the Department of Education’s claim that in founding its methodology on item response theory, PISA is using the best available methodology to rank order countries according to their education provision. As Sir Peter Medawar makes clear, any methodology which claims to capture the quality of a country’s entire education system in a single number is bound to be suspect. If my analysis is correct PISA is engaged in rank-ordering countries according to the mathematical achievements of their young people, using a methodology which itself has little or no mathematical merit.
    Item response theorists have identified two broad interpretations of probability in their models: the “stochastic subject” and “repeated sampling” interpretations. Lord has demonstrated that the former leads to absurd and paradoxical results without ever investigating why this should be the case. Had such an investigation been initiated, quantitative psychologists would have been confronted with the profound question of the very role probability plays in psychological measurement. Following a pattern of behaviour all too familiar from Michell’s writings, psychologists simply buried their heads in the sand and, at Lord’s urging, set the stochastic subject interpretation aside and emphasised the repeated sampling approach.
    In this way the constitutive nature of irreducible uncertainty in psychology was eschewed for the objectivity of Newtonian physics. This is reflected in item response theory’s “local hidden variables” ensemble model in which ability is an intrinsic measurement-independent property of the individual and measurement is construed as a process of merely checking up on what pre-exists measurement. For this to be justified, Hölder’s seven axioms must apply.
    In order to justify the labels “quantity” and “measurement” PISA must produce the relevant empirical evidence against the Hölder axioms. Absent such evidence, it seems very difficult to justify the Department of Education’s claims that (i) “the OECD is at the forefront of the academic debate regarding item response theory,” and (ii) “the OECD is using what is acknowledged as the best available methodology [for international comparison studies].”
    Dr Hugh Morrison
    Boring, E.G. (1929). A history of experimental psychology. New York: Century.
    Borsboom, D., Mellenbergh, G.J., & van Heerden, J. (2003). The theoretical status of latent variables. Psychological Review, 110(2), 203-219.
    Fechner, G.T. (1860). Elemente der psychophysik. Leipzig: Breitkopf & Hartel. (English translation by H.E. Adler, Elements of Psychophysics, vol. 1, D.H. Howes & E.G. Boring (Eds.). New York: Holt, Rinehart & Winston.)
    Gardner, H. (2005). Scientific psychology: Should we bury it or praise it? In R.J. Sternberg (Ed.), Unity in psychology (pp. 77-90). Washington DC: American Psychological Association.
    Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hilldale, NJ.: Lawrence Erlbaum Associates, Publishers.
    Medawar, P.B. (1982). Pluto’s republic. Oxford University Press.
    Michell, J. (1990). An introduction to the logic of psychological measurement. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
    Michell, J. (1997). Quantitative science and the definition of measurement in psychology. British Journal of Psychology, 88, 353-385.
    Michell, J. (1999). Measurement in psychology: A critical history of a methodological concept. Cambridge: Cambridge University Press.
    Michell, J. (2000). Normal science, pathological science and psychometrics. Theory and Psychology, 10, 639-667.
    Michell, J. (2008). Is psychometrics pathological science? Measurement: Interdisciplinary Research and Perspectives, 6, 7-24.
    Oppenheimer, R. (1956). Analogy in science. The American Psychologist, 11, 127-135.
    Philip, J.R. (1974). Fifty years progress in soil physics. Geoderma, 12, 265-280.
    Richardson, K. (1999). The making of intelligence. London: Weidenfeld & Nicolson.
    Ross, S. (1964). Logical foundations of psychological measurement. Copenhagen: Munksgaard.
    Rozeboom, W.W. (1966). Scaling theory and the nature of measurement. Synthese, 16, 170-223.
    Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 667-680.
    Stevens, S.S. (1958). Measurement and man. Science, 127, 383-389.
    Titchener, E.B. (1905). Experimental psychology: A manual of laboratory practice, vol. 2. London: Macmillan.
    Wittgenstein, L. (1953). Philosophical Investigations. Oxford: Blackwell.