“Standardized testing” is one of the most frequently used yet most commonly misunderstood expressions in education today. Some consider standardized testing to be an important, foundational component in any school improvement effort. Others believe it is a disruptive process that serves mainly to narrow the curriculum, stifle the creativity of teachers, and limit learning opportunities for students.
In truth, standardized testing is inherently neither good nor bad. It is simply a way of gathering data that involves measurement of an attribute or trait in a consistent or “standard” manner.
A standardized test includes the same or comparable items administered and scored in the same way. In other words, all individuals take the test under the same conditions, their performance is evaluated using the same methods, and scores are determined in the same manner. “Standardizing” the testing conditions doesn’t remove the influence of those conditions on examinees’ performance; it simply ensures those conditions are the same for all examinees. In this way, results can be used to make comparisons of individuals over time or among different groups of individuals who take the test in different settings.
Standardized testing can involve any assessment format. Many standardized tests are timed assessments composed primarily of multiple-choice items, like AP exams, IB assessments, and the SAT and ACT. These types of assessments are widely used because they yield the greatest amount of information on examinees in the least amount of testing time and at the least expense. But common formative assessments, skill performances, and projects scored with a common rubric can be considered standardized tests as well. So too are personality inventories and professional certification exams. In the area of health, tests of blood pressure and heart rate are also standardized tests.
The problems most people associate with standardized testing come not from the form of the test or testing procedures, but from how the results are interpreted and used. These are issues of validity.
Unlike what most people think, validity is not a characteristic of a test itself, and no test is fundamentally valid or invalid on its own. Rather, the results from a test can be interpreted and used in valid or invalid ways (Shepard, 2005). For example, a mathematics test composed of items that require only basic numerical calculations would be a valid test of examinees’ computational skills, but it would not be a valid test of their mathematical reasoning or skill in solving complex math problems. Invalid interpretations of test results occur most frequently when a test designed for one purpose is used for another purpose for which it was not intended.
All tests are designed for a specific purpose. The SAT and ACT, for example, are designed to maximize differences among students in their knowledge and skill in specific subjects for the purpose of college admission decisions—and both do that well. In other words, they are purposefully constructed to yield the largest possible variation in students’ scores, in order to facilitate the decision-making process. If many students score the same on an assessment, it is difficult to distinguish among those students. So both the SAT and ACT are developed specifically to spread out students’ scores and provide the broadest range of scores possible. The ideal item on these tests is one that only 50 percent of students answer correctly, because that provides maximum discrimination and the greatest variation in scores.
To serve this purpose, the SAT and ACT are devised to be “instructionally insensitive” (Popham, 2007). This means that if a particular item measuring a vitally important concept is taught by teachers so well that nearly all students answer the item correctly, that item is typically eliminated from the test. Why? Not because it is a poor item or because it addresses an inconsequential concept, but because it doesn’t serve the test’s primary purpose of discriminating among students.
To measure how well students have mastered a particular curriculum, however, requires an “instructionally sensitive” test. Such a test must be well-aligned with the curriculum and measure precisely what students were expected to learn and be able to do. If instruction is effective, then all students would be expected to perform well on the test and attain the same high scores. In other words, with an instructionally sensitive test, effective instruction would serve to minimize the differences among students. Its purpose is not to spread out students’ scores but to verify students’ level of achievement with regard to the curriculum.
Hence, it makes little sense to use the results from instructionally insensitive tests like the SAT and ACT, known to be poorly aligned with the school’s curriculum and designed primarily to maximize differences among students, to evaluate the quality of a school’s instructional program. This is not the purpose for which these tests were developed and not what they are designed to do. Using SAT and ACT results in this way is a prime example of invalid interpretation and use of standardized test results.
Standardized testing is neither a good thing nor a bad thing. Standardized test results provide important and reliable information that can be highly useful to educators at all levels. But that information also has important limitations. To ensure validity in interpretation and use, educators must know the purpose for which a standardized test was developed and use results for that specific purpose only. Especially in high-stakes situations, using standardized test results for purposes other than that for which the test was designed is not only inappropriate and invalid, it also can have serious negative consequences for schools, teachers, and particularly for students.
References
Popham, W. J. (2007). Instructional insensitivity of tests: Accountability’s dire drawback. Phi Delta Kappan, 89(2), 146–150.
Shepard, L. A. (2005). The centrality of test use and consequences for test validity. Educational Measurement: Issues and Practice, 16(2), 5–24.
[author_bio id=”384″]