One unavoidable issue when discussing education is that of Assessment, which unfortunately is usually listed to testing, although it could include portfolios, performance exercises, and other measures. The new Elementary and Secondary Education Act, commonly known as “No Child Left Behind” (NCLB) requires as a condition of the receipt of Feeral add testing of all children in grades 3-8 in math and reading, with all kinds of additional requirements in testing. While some testing as a measure of school performance has been required in 10th grade (largely because that is what Texas had done), we now see moves to further expand high school testing.
Today’s diary will offer a quite negative evaluation on how we are doing our assessments, written by one of America’s great experts on the subject.
The article from which the material in blockquote is selected, entitled “F for Assessment” was written by W. James Popham, who started his career as a high school teacherin Oregon, and is professor emeritus at the University of California- Los Angeles School of Education and Information Studies. Author of 25 books, he is a former president of the American Educational Research Association, which with tens of thousands of members is the preeminent professional association for thos who do research on educational issues. The article appears in the current issue of Edutopia, a magazine produced by The George Lucas Educational Foundation, and is available for free both in print and on-line. The Foundation makes available a variety of resources for free at its website, and I urge anyone interested in education to take the time to explore.
What follows are selected passages from the Popham article. I also urge those rading this to go and read the entire article, which can be found here in html, and which will also provide a link to download it as a PDF.
The article begins with a very blunt statement:
by W. James Popham
For the last four decades, students’ scores on standardized tests have increasingly been regarded as the most meaningful evidence for evaluating U.S. schools. Most Americans, indeed, believe students’ standardized test performances are the only legitimate indicator of a school’s instructional effectiveness. Yet, although test-based evaluations of schools seem to occur almost as often as fire drills, in most instances these evaluations are inaccurate. That’s because the standardized tests employed are flat-out wrong.
Popham notes the prevalence of the current style of tests goes back to the first major Federal expenditures of funds on education in 1965, which required some form of assessment to prove the moneys provided were being “well-spent”. He then offers the following:
Popham then proceeds to offer definitions of key terms. He begins by defining a “standardized test”
and notes how SAT and ACT exams are the primary example of this. He immediately warns that
Popham gives two examples the kinds of test “ill-suited” for such purposes. The first is standardized achievement tests like the Iowa Test of Basic Skills which use a comparative measurrement strategy, comparing the score of the individual test taker against those of some predetermined norm group. Popham gives a clear explanation of what this means, Note especially what I have placed in bold in the second paragraph of what immediately follows:
Statistically, a question that creates the most score-spread on standardized achievement tests is one that only about half the students answer correctly. Over the years, developers of standardized achievement tests have learned that if they can link students’ success on a question to students’ socioeconomic status (SES), then that item is usually answered correctly by about half of the test takers. If an item is answered correctly more often by students at the upper end of the socioeconomic scale than by lower-SES kids, that question will provide plenty of score-spread. After all, SES is a delightfully spread-out variable and one that isn’t quickly altered. As a result, in today’s nationally standardized achievement tests, there are many SES-linked items.
We have now been confronted with an important reality: that much of what the most commonly used tests are really mesauring is neither innate ability nor what has been learned, but rather the socio-economic status, SES, of the students sitting for the exam. As Popham bluntly notes (and the emphasis in the final sentence is mine):
The second category of tests is those developed to measure mastery of officially approved lists of skill and contents, also referred to as goals or content aims, and commonly called content standards. Many statewide tests such as those in Florida fall into this category. This category is generally described as “standards based.”
Let me simply offer without content Popham’s next two paragraphs:
Whether or not the targets make sense, there tend to be a lot of them, and the effect is counterproductive. A state’s standardsbased tests are intended to evaluate schools based on students’ test performances, but teachers soon become overwhelmed by too many targets. Educators must guess about which of this multitude of content standards will actually be assessed on a given year’s test. Moreover, because there are so many content standards to be assessed and only limited testing time, it is impossible to report any meaningful results about which content standards have and haven’t been mastered.
Popham then notes that since it becomes impossible to properly cover all of the standards in the calss time available that teachers begin to increasingly pay less attention both to the standards and to the test, with result being
Now that he has provided the background, Popham goes on to examine what appears under the subtitled of “Wrong Tests, Wrong Consequences:
But, as Popham notes. with few exceptions all of the testing regimes currently being used by the states fall into one of the two categories above. He then describes the 3 “adverse classroom consequences” which he considers an inevitable outcome of such an approach. Let’s go through these one at a time.:
In an effort to boost their students’ NCLB test scores, many teachers jettison curricular content that — albeit important — is not apt to be covered on an upcoming test. As a result, students end up educationally shortchanged.
For this we can note that in some cases people have tried to force precisely this outcome. Thus we saw the state school board in Kansas remove evolution from the testable content, thereby hoping to pressure teachers not to cover it in their instruction, since to do so might adversely effect test scores that would not include such material. Fortunately this created such a backlash that the newly elected prelacement mebers were able to reverse this decision.
Because it is essentially impossible to raise students’ scores on instructionally insensitive tests, many teachers — in desperation — require seemingly endless practice with items similar to those on an approaching accountability test. This dreary drilling often stamps out any genuine joy students might (and should) experience while they learn.
When people refer to “drill and kill” this is what they mean. If your child seems to spend an excessive amount of time on worksheets, on practice items for such a test, this is what you are seeing.
The third consequence:
Some teachers, frustrated by being asked to raise scores on tests deliberately designed to preclude such score raising, may be tempted to adopt unethical practices during the administration or scoring of accountability tests. Students learn that whenever the stakes are high enough, the teacher thinks it’s OK to cheat. This is a lesson that should never be taught.
Please note, these remarks do not mean that either Popham nor any other opponent of such tests believes that this consequence is justifiable behavior. We are pointing out that when the stakes become high enough, including things like job security and bonuses, we are likley to see this behovior with greater frequency. Those who pay attention have in fact seen an icnrease of stories about precisely this kind of behavior, and not just in failling inner city schools. Because the requirements under NCLB to show Annual Yearly Progress (AYP) in all disaggregated subgroups, we have seen examples in high performing schools in places like Montgomery County Maryland.
Popham believes that things do not have to be this way. He offers three attributes offered in 2001 by an important national group, before the adoption of NCLB, that should be part of an “instructionally supportive” test:
To avoid overwhelming teachers and students with daunting lists of curricular targets, an instructionally supportive accountability test should measure students’ mastery of only an intellectually manageable number of curricular aims, more like a half-dozen than the 50 or so that a teacher may encounter today. However, because fewer curricular benchmarks are to be measured, they must be truly significant.
* Lucid descriptions of aims.
An instructionally helpful test must be accompanied by clear, concise, and teacherpalatable descriptions of each curricular aim to be assessed. With clear descriptions, teachers can direct their instruction toward promoting students’ mastery of skills and knowledge rather than toward getting students to come up with correct answers to particular test items.
* Instructionally useful reports.
Because an accountability test that supports teaching is focused on only a very limited number of challenging curricular aims, a student’s mastery of each subject can be meaningfully measured, letting teachers determine how effective their instruction has been. Students and their parents can also benefit from such informative reports.
Popham states that a test based on such principles will accurately evaluate schools and (what I consider of AT LEAST equal importance) improve instruction. He notes that Wyoming already has such a testing scheme (congratulation btw to any “cowboys” reading this), and urges other states to follow their example.
One question I have often been asked as a result of the various education diaries I have posted is “what can I do?” Popham offers an answer that I hope readers will find useful. It is with this that I will conclude:
With a better understanding of why it is so inane — and destructive — to evaluate schools using students’ scores on the wrong species of standardized tests, you can persuade anyone who’ll listen that policy makers need to make better choices. Our 40-year saga of unsound school evaluation needs to end. Now.