Assessment in education: 15+ myths

[Paper read at ResearchEd Amstelveen 2017]

In 1983 I published a small book, in Dutch, on achievement test item writing. A short characterization: the book wanted to do away with the (still too common) notion of item writing being an art, not a craft. The book is in need of rewriting. Because that is a major project, a good start might be some in depth treatment of myths on assessments. I present here a listing of 15+ major myths. They will get a workout in a long series of blogs. For example, myth #1 Assessment is measurement is an important one, with many ramifications in somewhat faulty educational practices. Yet, there are good alternatives available. Myth #1 in itself will be good for a long series of blogs to treat of all its aspects.

  • Ben Wilbrink (1983). Toetsvragen schrijven. Utrecht, Het Spectrum, Aula 809. Original Dutch version download 1.4 Mb pdf]
  • Ongoing revision website [further work will be, as much as possible, in this new blog series]

The list of myths might amaze you, in the sense that you may not even recognize in them any common opinion on assessment. I will therefore provide some clues in the form of one or two references to key publications, or maybe some key quotes from the literature. Full treatments will only be given in the promised blog series. On February 25 I opened the following thread.

How many myths on educational assessment might there be? Is it possible to do a fast inventory? What do I think to be the top fifteen myths? thread opening

Myth #1 Assessment is measurement
It isn’t; it is sample-based. Of course, you have always known assessments to be samples, any particular sample could be more friendly to you, or less, depending on, on what, on luck?
The main idea of psychometrics is that psychological tests are measurements. Michell (2001) explains psychometrics therefore to be fraudulent science. The problems with this faulty idea of tests being measurements are, among many others, (1) that what is being ‘measured’ is something thing-like of the person and (2) that these ‘measurements’ may be transformed in the same kinds of ways we do with measurements of temperature (Celsius or Fahrenheit; or even Kelvin’s absolute scale measurements). Measurement would be of something hidden in the brain, more often than not called ‘latent traits’, ‘ability’, ‘intelligence’. There is nothing we know now about our brain psychology that corresponds to those constructs! See also Myth #7.
For the concept of assessments as samples a number of models are available, beginning with the models of mathematical statistics itself. In my opinion the best models are decision-theoretic models (following the lead of Cronbach & Gleser, 1965 2nd). For educational assessments the main decision-makers are the pupils or students themselves (amazed? you’d better be!), Bob van Naerssen (University of Amsterdam) presented a basic model in 1970.
Why the fuss, what difference does it make? Well, there are lots of disinformation and misunderstandings in the literature on assessment, predicated on the idea of assessments to be measurements, not samples.

  • Lee J. Cronbach & Goldine C. Gleser (1965). Psychological tests and personnel decisions. University of Illinois Press.
  • Joel Michell (2008). Is Psychometrics Pathological Science? Measurement, 6, 7-24. pdf
  • Robert van Naerssen (1970). Over optimaal studeren en tentamens combineren. Inaugurele rede. [Dutch] webpage

Myth #2. Assessment is the prerogative of teacher/school/etcetera
It isn’t; they are restrained by justice & law. In Dutch higher education it is possible to appeal (parts of) examinations in real courts, which still seems to be rather unique in the world. Job Cohen (later major of Amsterdam) wrote a dissertation on the practice, which with his permission is made available by me:

Myth #3. The more objective/reliable, the better
No, pupils or students and assessors are gaming each other. I told you so. By the way: they have always done so, through the ages, wherever on this world.

  • Ben Wilbrink (1992). The first year examination as negotiation; an application of Coleman’s social system theory to law education data. European Conference on Educational Research. paper [this paper is sporting a MTMM matrix with very high validities, indication of something serious tapped here) [tags: grade inflation; predictability of exam results] [added: correspondence with James Coleman]

Myth #4. It’s okay for intelligent pupils to have an edge
No; this is education, not psychological diagnosis or personnel selection. Pupils should be able to prepare efficiently, that’s all there is to it. It is quite difficult for teachers to grasp the fine point that differences in intellectual capacities should have played out in the preparatory traject, and should not not really effect the achievements themselves. Another way to say this is: it is the task of educators to educate, not to test for differences in intellectual capacities.
Suppose the goal of a course, lesson, whatever, is mastery of the subject for every pupil. Of course, there will be differences between pupils, making it necessary for some of them to invest more time to reach that goal than others need. Accommodate for the differences in time needed. Some allowance might be made for difference in aspiration levels, if the character of the subject allow for those differences. A simple mathematical model (Tromp & Wilbrink 1977) catches the situation, and might be used as a heuristic model to conceptually clear up many situations involving individual differences. The model itself is a fully recursive structural equation model, if you’ve got data, fill them in.

  • de Groot, A. D. Some badly needed non-statistical concepts in applied psychometrics. Nederlands Tijdschrift voor de Psychologie, 1970, 25, 360-376. webpage
  • Dick Tromp & Ben Wilbrink (1977). Het meten van studietijd. [Measuring preparation times] Congresboek OnderwijsResearchDagen. [Dutch] [top: exogenous variables like capabilitity; left: aspiration level; bottom: time spent; right: achievement]html
    77Studietijd3ORD.GIF
    For the assessment to be truly curriculum aligned, the coeffciënt of the direct path from exogenous variables to achievement should be near zero: intellectual differences work out through time spent and aspiration level posed.

Myth #5. Item writing is an art
That’s a bloody shame. Item writing should be a craft based in cognitive psychology as well as epistemology. Yet even ‘professional’ item writing for the big tests (US: ACT, SAT; Netherlands: Cito examinations) is approached as an art of the teachers involved.
Theory about content of achievement test items is virtually non-existent. Yet quality of content evidently is the primary factor determining the quality of achievement tests. It is therefore very disappointing to find textbooks on educational measurement uncritically repeating the old adagium, see for example Wesman in Thorndike (1970) Educational measurement, that the writing of achievement test items is an art one can only master through years and years of practice. Also books on test item writing, for example Roid and Haladyna’s, play down issues of content, concentrating instead on issues of form. Even the magnificent work of Bloom, Madaus and others in the seventies, on educational objectives and how to test them, does not really concern content issues either.

  • Benjamin S. Bloom, J. Thomas Hastings, and George F. Madaus (1971). Handbook on formative and summative evaluation of student learning. McGraw-Hill.
  • Gale Roid & Thomas M. Haladyna (1982). A technology for test-item writing. Academic Press.
  • A. G. Wesman (1970). Writing the test item. In R. L. Thorndike (Ed.): Educational measurement. American Council on Education.

Myth #6. Psychometrics is the methodology of choice for assessment
It isn’t; assessment is quite different from psychological diagnostics. I explained in a 1986 paper the crucial difference between psychological testing and educational assessment: for testing the assumption is that testees are not in any way specifically prepared; for assessment the assumption is that the assessment is curricular aligned, and that the pupils have prepraed themselves specifically for this assessment. Two different worlds. Yet most of you will have noticed that the literature on assessment uses the methodology of the world of psychological testing without even winking. In the Netherlands my analysis had a direct impact on the 1988 edition (still the latest edition) of the Richtlijnen voor tests an toetsen [Standards for tests and assessments].

  • NIP (1986). Richtlijnen voor ontwikkeling en gebruik van psychologische tests en studietoetsen. Amsterdam: Nederlands Instituut van Psychologen. Tweede editie. [Hoofdstuk] 8. Toetsgebruik in het onderwijs pdf
  • Ben Wilbrink (1986). Toetsen en testen in het onderwijs. In S.V.O. Jaarverslag / Jaarboek 1985. Den Haag: S.V.O., 275-288. webpage [Dutch]

Myth #7. Assessment is about pupils’ brain contents
No, we don’t know anything enabling us to pull tricks like that. The closest we can get now is by using models of cognitive architecture, such as he ACT-R model by John R. Anderson and his co-workers.
There is a lot of talk in edutopia about knowing, understanding, all kinds of psychological constructs supposedly describing what is happening in our brain cells outside of our abilities of introspection. Think of bad old Bloom’s cognitive taxonomy, and more recent variants thereof. It takes a lot of sophisticated psychological theory (based on experiments, of course) to explain simple as well as the more complex happenings we suspect to be there when we speak of ‘thinking’ pupils, teachers, or presidents. Using pseudo-psychological jargon in speaking of assessment of the results of learning is not helpful, let’s stop doing that.

  • Richly filled website of the ACT-R research group. Filled with research articles, that is, most of them available for reading or download. http://act-r.psy.cmu.edu
  • Stellan Ohlsson (2011). Deep Learning: How the Mind Overrides Experience. Cambridge University Press. info
  • Christine Counsell (Jan 11, 2017). Genericism’s children. blog [The problem exemplified in history education]

Myth #8. True assessment is on transfer to real life situations (PISA-type tests)
This misconception is a direct threat to much of our education. PISA Math illustrates the problem: it is not truly on math, nor on ‘real life situations’, yet the test holds the OECD world hostage. Amazingly, the OECD gets away with its failure to offer empirical evidence of the most important validity issues involved here. The crucial point being: according to the Standards of the American associations APA< NCME and AERA, a test without sufficient information on its validity issues, should not be used. Ever.
Having a word for a supposed phenomenon does not prove the existence of the phenomenon. There a lot of controversy on transfer, even though a century age Edward Thorndike explained the issue perfectly well. As far as new situations go, there is no guarantee whatsoever that we will use our knowledge that is relevant to that situation. Already some contradictions are surfacing: if we have relevant knowledge, why is the situation ‘new’, then? Forget the word transfer, and follow Stellan Ohlsson’s (2011, mentioned earlier) hunch that new situations require new learning. That’s all.
In the problem solving literature (Newell & Simon) of the seventies it was made perfectly clear that a precondition for problem solving is the availability of the productions that would make it possible to reach solutions (Ben Wilbrink 1983, chapter 7).

  • Allen Newell (1990). Unified theories of cognition. Harvard University Press. info
  • Allen Newell & Herbert A. Simon (1972). Human problem solving. Prentice Hall.
  • Ben Wilbrink (ongoing, on item writing) 7. Probleemoplossen [problem solving]. webpage

Myth #9. Knowledge is not that important any more (f.e., Schleicher, OECD)
It’s a pity Andreas Schleicher can’t introspect his own automated knowledge. The cutting edge psychology of learning ( = acquiring knowledge): Stellan Ohlsson 2011.
This myth is an especially dangerous one, impacting assessment in a really destructive way by emphasizing testing of generic skills (they do not exist, myth #10), and in that way impacting education also.
The amazing thing is that experimental cognitive psychology has made it abundantly clear that there is no way for non-trivial domain-specific creativity and problem solving to exist without a firm grounding in knowledge. Knowledge in the brain that is, not in the cloud. The explanation is fairly simple: all cognitive activity must pass our limited capacity working memory, it is therefore absolutely out of the question that picking up knowledge ‘on the fly’ from whatever external source (calculator, the world wide web) will allow any meaningful cognitive activity at all. Ohlsson will explain, you’ve never read such stuff in your life! (in the tradition of Newell & Simon, cognitive architecture models).

  • Andreas Schleicher (2016) The world economy no longer pays you for what you know; Google knows everything. [youtube op 2’07”].
  • Stellan Ohlsson (2011). Deep Learning: How the Mind Overrides Experience. Cambridge University Press. info Stellan Ohlsson will present at ‘Make shift happen’, Amsterdam, October 2017 (as will Anders Ericsson. And Yana Weinstein,David Didau, Lucy Crehan).

Myth #10. Generic (21st century) skills are important, assess them (think tanks everywhere)!
Generic skills do not exist. Domain-specific ones do.

  • Ben Wilbrink (May 10, 2016). 21st century skills in Dutch ‘ed reform 2032’. OECD in denial of psychological research? blog

Myth #11. Soft skills might be game changers (OECD again; hobby horse of economists)
No. They’re personality traits. Do not invade privacy.

  • Ben Wilbrink (May 17, 2016). Personal development: OECD’s social and emotional skills. Is it science? blog

Myth #12. Assessments can be used for multiple purposes
No, this is the social world, not the physical one. Subjects will behave differently as soon as the stakes are changed, invalidating all and everything.

  • American Statistical Association (ASA) (2014). ASA Statement on Using Value-Added Models for Educational Assessment.
    https://www.amstat.org/policy/pdfs/ASA_VAM_Statement.pdf
  • Sharon L. Nichols and David C. Berliner (2005). The Inevitable Corruption of Indicators and Educators Through High-Stakes Testing. Education Policy Studies Laboratory, Arizona State University pdf (180 pp.)

Myth #13. ‘I know exactly what I am doing in my assessments’
Is that so? You should be able to give it a mathematical model, then. Yes?
How would you begin? From myth #1 you know already that an assessment is a sample. In fact, every item is a sample. Assuming items are random samples from a large collection of applicable items, remember some basic statistics and see how a binomial distribution is a good model. It will help if you understand that for the pupil having to sit the assessment, the 20 items are a random sample from that (possibly imaginary) collection, no matter how in fact you, as a teacher, have assembled the assessment.
This way of modeling assessments is a secret weapon of a handful (or even less) Dutch edumetrists, starting with Robert van Naerssen (1970), the work being continued by myself. The full blown model is decision-theoretic. The main decision-maker is the pupil preparing to sit the assessment, and having to decide ‘is my preparation sufficient already?’. The teacher/school is the secondary decision-maker: designing the assessment in such a way as to enable pupils to prepare efficiently, as well as achieving acceptable results. The utility of this kond of modeling is mainly to show how just speculating on the workings of assessments is woefully insufficient for the design of systems of assessments (examinations).

  • Wolfram offers a free online service to evaluate, for example, binomial distributions. Think of an assessment of 20 items. Suppose a pupil thinks she will be able to answer 60% correct, that is, for every item the chance that it will be an item she knows will be 0.6. Play with it a bit. http://www.wolframalpha.com/input/?i=binomial+distribution+20+0.6
  • Robert van Naerssen (1970). Over optimaal studeren en tentamens combineren. Inaugurele rede. [Dutch] webpage
  • Ben Wilbrink (ongoing). Strategic preparation for achievement tests. A model. webpage

Myth #14. It is only natural for assessments to compare different pupils’ achievements
No, that’s only traditional selective/competitive educational culture. Until the end of the 19th century (and even later) ranking of pupils in class on the basis of points of merit and demerit was usual. Also exam results were reported as the rank achieved among peers participating. Quite interesting history, by the way. Modern systems of grading evolved from these methods of ranking. Grading systems essentially are just stylized systems of ranking. Knowing this history might teach the researcher, and the teacher, some humility in handling evalautions of assessment results.

  • Ben Wilbrink (1997). Assessment in historical perspective. Studies in Educational Evaluation, 23, 31-48.
    html

Myth #15. Good standardized testing of young pupils is harmless
But for: scapegoating, stereotyping, humiliation, children’s rights.
Repeatedly/unceasing telling children they belong to the worst achievers in class, or to the lowest percentile group of the country (a standard practice in Dutch primary schools) might be mental abuse in a severe form. Especially so if it is clear from examples of other schools, or even countries (Hirsch, 2016), that it is possible to reduce SES-related gaps in, for example, vocabulary. Testing children never is harmless; professional codes prohibit unnecessary testing, testing without consent of parents, interpreting results without being qualified to do so, archive results longer than, say, three years, letting third parties in on test results. Regrettably, schoolchildren will not be protected by professional codes of psychologists. Abuse of tests will result in stereotyping children, and might result in scapegoating where test results are regarded as ‘explanations’ of lagging achievements.

  • Convention on the Rights of the Child. text
  • E. D. Hirsch, Jr. (2016). Why knowledge matters. Rescuing our children from failed educational theories. Harvard Education Press. prologue pdf
  • James Murphy [Horatio Speaks] (March 5, 2017). Intelligence-ism. blog

Myth #16. ‘Well-designed multiple choice questions MCQs are just fine’
A tweet (March 4) prompted me to add this Myth #16. Of course, some questions are multiple choice in a natural kind of way, no problem with that. ‘Which planet circles the Sun closest?’ Other MCQs are artificial formats that do not belong in education. They give pupils the wrong message about the world. They de-emphasize the importance of expressing oneself in writing.

  • Ben Wilbrink (1983). Toetsvragen schrijven [Item writing]. Par.2.2 Keuzevragen [Dutch]
  • Ben Wilbrink (1977). Verborgen vooroordeel tegen andere dan meerkeuzevraagvormen. ORD, Amsterdam VU. webpage [Dutch] [Direct confrontation between young researcher and the Dutch Testing Service, Cito, promoting the dogma of MCQ tests being the only acceptable achievement tests because of their ‘objectivity’]

[I migh still add/correct some details. And add myths ;-)]

Advertisement

4 thoughts on “Assessment in education: 15+ myths

  1. Pingback: Useful bits and pieces – A Chemical Orthodoxy

  2. Thought provoking post. Thankyou.

    I’m quite fond on mcq’s.
    They give me good information about what students know and their misconceptions.
    Quick to mark. Easy item analysis.
    Not “real world” but neither is the classroom.

    Like

  3. Pingback: Jubileumboek 50 jaar Cito – Mantel der liefde | Fair schooling & assessment

  4. Pingback: Assessment and progression in history – Musings of a history teacher

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s