Statement by Dr. Walter (Denny) Way, Senior Vice President for Measurement Services, Regarding the Use of Item Response Theory to Score Standardized Tests
August 2, 2012 —
We’d like to set the record straight on recent news coverage about UT-Austin Professor Walter Stroup’s study of standardized tests in Texas. Contrary to these reports, the tests—and the methodology used to score them—are grounded in solid scientific evidence.
As the assessment vendor supporting Texas (and nearly 20 other states), Pearson completely disagrees with Dr. Stroup’s assertion that the statewide tests are “fundamentally flawed” because they are scored using Item Response Theory. Simply put, this assertion is not supported through valid research and will not stand up to a rigorous review by qualified experts.
Item Response Theory is a well-documented statistical methodology used to understand test data and ensure that different editions of tests are fair and comparable. In fact, Item Response Theory is routinely used on educational tests throughout the world—including the National Assessment of Educational Progress (NAEP), the Programme for International Student Assessment (PISA), the Graduate Record Examination (GRE) and the Graduate Management Admission Test (GMAT), to name a few.
Dr. Stroup claims that selecting questions based on Item Response Theory produces tests that are not sensitive to measuring what students have learned. He confuses the application of a measurement model (Item Response Theory) with the more fundamental process of designing and developing tests based on the purposes for which they will be used.
Furthermore, Dr. Stroup misinterprets his supporting data and draws unfounded conclusions. For example, Dr. Stroup says that 72% of students’ scores this year can be predicted simply by knowing their scores from last year, rather than what they actually learned over the course of the school year. What his statistical analysis of the Texas Assessment of Knowledge and Skills (TAKS) actually shows is that at most, only 50% of the variance in 2007 TAKS test scores is shared with the variance in 2006 TAKS scores—but even then, it is a leap to suggest that any amount of shared variance is due to test-taking skills. There is simply no evidence to support these claims. In fact, this finding most likely reflects the fact that students are retaining what they’ve learned in previous years’ instruction and are building on that knowledge in the expected way.
The process of creating and scoring standardized state tests has many steps. Before a student ever takes the test, there are multiple reviews by testing and subject matter experts, state education officials and classroom teachers. Each state has a technical advisory committee of national experts who provide independent guidance and oversight for the testing programs. In addition, state testing programs undergo rigorous reviews by the U.S. Department of Education.
Parents, educators and public officials should know that Pearson encourages and embraces a fair and open exchange of ideas about how to create better learning opportunities for all students. There are varying opinions about the use of high-stakes standardized testing in the U.S. We welcome legitimate discourse, but unsubstantiated research claims have no place in this dialogue.