Unfairness in testing: Random effects

College students sitting at desks in a classroom taking a written exam

Unfairness, an aspect of testing, is as old as tests. They have been with us for centuries. Even Thomas Aquinas was worried about his master’s exam. Besides the test itself, unfairness arises with the effects of the student, teacher, and administration policies. With respect to the test, the concept of unfairness comes from several directions including the number of questions, difficulty distribution, language, answer traps, question structure, question type (MC, essay, short answer), and random effects. Random effects, as we will see, have in and of themselves considerable effects on grade outcomes.

When I recently spoke at ICTCM16, I showed a short simulation I often use with students (you can email me for the code). This simulation is based on various probability density functions, and we note how test difficulty and student ability interact to produce grade distributions. What we cannot simulate  is what the instructor does. Briefly, the instructor (a) knows the abilities of the students at least by final exam time, (b) knows well student skills on particular problem types, (c) gives partial credit, and (d) curves the grades as needed. All these constitute another form of unfairness but paradoxically in the interest of fairness, and, of course, grade consistency. This is established by the strong resistance most instructor offer to giving tests not personally prepared. The teaching process is closely linked to perceived and actual testing fairness.

Testing agencies have different advantages. They have a bank of question types generated over many years of testing. For each, they have complete information on relative difficulties. The skilled test engineer can assemble a test covering the specific curriculum, knowing to how to mix difficulties and problem types that will take the average student exactly the required xx minutes to complete. When the problems must be hand graded by a reviewer, highly strict rubrics are constructed and enforced.

When a wholly new test is required, such as with the Core Curriculum, which has quite a different curriculum, testing engineers are challenged to match all factors and generate a fair test. This charge is extremely difficult, and with the Core Curriculum, proved to be impossible. The reasons are many, but saved for another day’s discussion.

Back view of male student sitting in a college lecture hallJust three variables are considered, the statistics of the class abilities, and the statistics of item difficulties, together with the number of students involved. We model these in probability distributions, usually normal with given means and standard deviations.

For the typical instructor repeating a ten-item quiz from one year to the next controls for item difficulties, but doesn’t control for the class. The first experiment is for such a quiz, given to a class of 20 students for 15 semesters. Only the students differ as selected as normally distributed with given fixed mean and standard deviation. The theoretical scores for all test given are 50%.

For this situation, we observe a grade spread of about 30% or three letter grades. For a 10 item test, also, the same test given over 15 semesters to 50 students, the grade spread is about 13% and given to 200 students the grade spread is about 8%. Finally, if the test is given to 2000 students, a large class, the grade spread is about 3%.

Now suppose we give the same 50 item test to varying numbers of students for 15 semesters. For 20 students, the grade spread range is about 25%. For 50 students, the spread is about 15%; for 200 students the spread is about 5%, and for our very large class of 2000 students the spread is about 2%. Among these percentages, there is a wide variation. Now change the test only by small amounts (less than 1%) and the variation can be considerably greater.

As is evident, the spread range of grade averages is considerable, depending substantially on the number of students given the test. Moreover, note giving the same test is not much of a quality/consistency control, and relative certainty as to test fairness is at best vague. It is no wonder we sometimes call our current class better than average or worse than previous classes. It is true! Changing questions from year-to-year introduces more variance into the grade spreads.

We’ll save our discussion about truly large scale, e.g. high-stakes, tests for another time.


Dr. Allen was one of our featured speakers at ICTCM 2016. Access more than 30 dynamic sessions by registering through the virtual track. Or if you have an idea for next year, submit a proposal.


About the Author
Dr. G. Donald Allen

Dr. G. Donald Allen

Dr. Allen has been a professor of mathematics at Texas A&M University for more than three decades. He is currently Director of the Center for Technology-Mediated Learning in the Department of Mathematics. His mathematical research has been in the areas of probability, functional analysis, numerical analysis, neutronics, and mathematical modeling. His education research is in technology in survey design and other subjects. Allen has co-developed an online calculus course and online texts in linear algebra and the history of mathematics. In addition, he has co-developed a fully online master’s of science degree in mathematics, one of only two nationally, and the only one specifically designed for teachers. For the master’s program, he developed more than seven online courses. Recently, Allen co-developed a “course-in-a-box” pre-calculus course by combining content, pedagogy, assessment, videos, animations, and interactivity.  He is also active on two NSF funded grants, and State grants.

Allen has been working with the technology for teaching collegiate mathematics for more than two decades, and has produced an array of interventions at all levels, K-16.

Allen is associate editor for the Schools Science and Mathematics Journal and the Focus on Mathematics Pedagogy and Content. Allen, with more than 60 publications, has given nearly 40 professional development workshops and over 150 seminars throughout the North America, Africa, and Europe. In particular, he has participated in numerous professional development workshops primarily for Texas high school teachers, including those in technology, algebra, pre-calculus, and problem solving.  He has also developed a number of educational Flash interactive applets for teaching at various levels of mathematics, physics, and statistics.