One unavoidable issue when discussing education is that of Assessment, which unfortunately is usually listed to testing, although it could include portfolios, performance exercises, and other measures.  The new Elementary and Secondary Education Act, commonly known as “No Child Left Behind” (NCLB) requires as a condition of the receipt of Feeral add testing of all children in grades 3-8 in math and reading, with all kinds of additional requirements in testing.  While some testing as a measure of school performance has been required in 10th grade (largely because that is what Texas had done), we now see moves to further expand high school testing.

Today’s diary will offer a quite negative evaluation on how we are doing our assessments, written by one of America’s great experts on the subject.
The article from which the material in blockquote is selected, entitled “F for Assessment” was written by W. James Popham, who started his career as a high school teacherin Oregon, and is professor  emeritus at the University of California- Los Angeles School of Education and Information Studies. Author of 25 books, he is a former president of the American Educational Research Association, which with tens of thousands of members is the preeminent professional association for thos who do research on educational issues.  The article appears in the current issue of Edutopia, a magazine produced by The George Lucas Educational Foundation, and is available for free both in print and on-line.  The Foundation makes available a variety of resources for free at its website, and I urge anyone interested in education to take the time to explore.

What follows are selected passages from the Popham article.  I also urge those rading this to go and read the entire article, which can be found here in html, and which will also provide a link to download it as a PDF.

The article begins with a very blunt statement:


by W. James  Popham

For the last four decades, students’ scores on standardized tests have increasingly been regarded as the most meaningful evidence for evaluating U.S. schools. Most Americans, indeed, believe students’ standardized test performances are the only legitimate indicator of a school’s instructional effectiveness. Yet, although test-based evaluations of schools seem to occur almost as often as fire drills, in most instances these evaluations are inaccurate. That’s because the standardized tests employed are flat-out wrong.  

Popham notes the prevalence of the current style of tests goes back to the first major Federal expenditures of funds on education in 1965, which required some form of assessment to prove the moneys provided were being “well-spent”.  He then offers the following:

But how, you might ask, could a practice that’s been so prevalent for so long be mistaken? Just think back to the many years we forced airline attendants and nonsmokers to suck in secondhand toxins because smoking on airliners was prohibited only during takeoff and landing. Some screwups can linger for a long time. But mistakes, even ones we’ve lived with for decades, can often be corrected once they’ve been identified, and that’s what we must do to halt today’s wrongheaded school evaluations. If enough educators — and noneducators — realize that there are serious flaws in the way we evaluate our schools, and that those flaws erode educational quality, there’s a chance we can stop this absurdity  

Popham then proceeds to offer definitions of key terms.  He begins by defining a “standardized test”

any test that’s administered, scored, and interpreted in a standard, predetermined manner. Standardized aptitude tests are designed to make predictions about how a test taker will perform in a subsequent setting  

and notes how SAT and ACT exams are the primary example of this.  He immediately warns that

Although students’ scores on standardized aptitude tests are sometimes unwisely stirred into the school-evaluation stew, scores on standardized achievement tests are typically the ones used to judge a school’s success.  

Popham gives two examples the kinds of test “ill-suited” for such purposes.  The first is standardized achievement tests like the Iowa Test of Basic Skills which use a comparative measurrement strategy, comparing the score of the individual test taker against those of some predetermined norm group.  Popham gives a clear explanation of what this means,  Note especially what I have placed in bold in the second paragraph of what immediately follows:

Because of the need for nationally standardized achievement tests to provide fine-grained, percentile-by-percentile comparisons, it is imperative that these tests produce a considerable degree of score-spread — in other words, plenty of differences among test takers’ scores. So producing score-spread often preoccupies those who construct standardized achievement tests.

Statistically, a question that creates the most score-spread on standardized achievement tests is one that only about half the students answer correctly. Over the years, developers of standardized achievement tests have learned that if they can link students’ success on a question to students’ socioeconomic status (SES), then that item is usually answered correctly by about half of the test takers. If an item is answered correctly more often by students at the upper end of the socioeconomic scale than by lower-SES kids, that question will provide plenty of score-spread. After all, SES is a delightfully spread-out variable and one that isn’t quickly altered. As a result, in today’s nationally standardized achievement tests, there are many SES-linked items.  

We have now been confronted with an important reality:  that much of what the most commonly used tests are really mesauring is neither innate ability nor what has been learned, but rather the socio-economic status, SES, of the students sitting for the exam.   As Popham bluntly notes (and the emphasis in the final sentence is mine):

Unfortunately, this kind of test tends to measure not what students have been taught in school but what they bring to school. That’s the reason there’s such a strong relationship between a school’s standardized-test scores and the economic and social makeup of that school’s student body. As a consequence, most nationally standardized achievement tests end up being instructionally insensitive. That is, they’re unable to detect improved instruction in a school even when it has definitely taken place. Because of this insensitivity, when students’ scores on such tests are used to evaluate a school’s instructional performance, that evaluation usually misses the mark.  

The second category of tests is those developed to measure mastery of officially approved lists of skill and contents, also referred to as goals or content aims, and commonly called content standards.  Many statewide tests such as those in Florida fall into this category.  This category is generally described as “standards based.”  

Let me simply offer without content Popham’s next two paragraphs:

Because these customized standards-based tests were designed (almost always with the assistance of an external test-development contractor) to be aligned with a state’s curricular aspirations, it would seem that they would be ideal for appraising a school’s quality. Unfortunately, that’s not the way it works out. When a state’s education officials decide to identify the skills and knowledge that students should master, the typical procedure for doing so hinges on the recommendations of  subject-matter specialists from that state. For example, if authorities in Ohio or New Mexico want to identify their state’s official content standards for mathematics, then a group of, say, 30 math teachers, mathcurriculum consultants, and university math professors are invited to form a statewide content-standards committee. Typically, when these committees attempt to identify the skills and knowledge the students should master, their recommendation — not surprisingly — is that students should master everything. These committees seem bent on identifying skills that they fervently wish students would possess. Regrettably, the resultant litanies of committee-chosen content standards tend to resemble curricular wish lists rather than realistic targets.

Whether or not the targets make sense, there tend to be a lot of them, and the effect is counterproductive. A state’s standardsbased tests are intended to evaluate schools based on students’ test performances, but teachers soon become overwhelmed by too many targets. Educators must guess about which of this multitude of content standards will actually be assessed on a given year’s test. Moreover, because there are so many content standards to be assessed and only limited testing time, it is impossible to report any meaningful results about which content standards have and haven’t been mastered.  

Popham then notes that since it becomes impossible to properly cover all of the standards in the calss time available that teachers begin to increasingly pay less attention both to the standards and to the test, with result being

students’ performances on this type of instructionally insensitive test often become dependent upon the very same SES factors that compromise the utility of nationally standardized achievement tests when used for school evaluation.  

Now that he has provided the background, Popham goes on to examine what appears under the subtitled of “Wrong Tests, Wrong Consequences:

Bad things happen when schools are evaluated using either of these two types of instructionally insensitive tests. This is particularly true when the importance of a school evaluation is substantial, as it is now. All of the nation’s public schools are evaluated annually under the provisions of the federal No Child Left Behind Act (NCLB). Not only are the results of the NCLB school-by-school evaluations widely disseminated, there are also penalties for schools that receive NCLB funds yet fail to make sufficient test-based progress. These schools are placed on an improvement track that can soon “improve” them into nonexistence. Educators in America’s public schools obviously are under tremendous pressure to improve their students’ scores on whatever NCLB tests their state has chosen.  

 But, as Popham notes. with few exceptions all of the testing regimes currently being used by the states fall into one of the two categories above.  He then describes the 3 “adverse classroom consequences” which he considers an inevitable outcome of such an approach.   Let’s go through these one at a time.:


* Curricular reductionism.

In an effort to boost their students’ NCLB test scores, many teachers jettison curricular content that — albeit important — is not apt to be covered on an upcoming test. As a result, students end up educationally shortchanged.  

For this we can note that in some cases people have tried to force precisely this outcome.  Thus we saw the state school board in Kansas remove evolution from the testable content, thereby hoping to pressure teachers not to cover it in their instruction, since to do so might adversely effect test scores that would not include such material.   Fortunately this created such a backlash that the newly elected prelacement mebers were able to reverse this decision.


* Excessive drilling.

Because it is essentially impossible to raise students’ scores on instructionally insensitive tests, many teachers — in desperation — require seemingly endless practice with items similar to those on an approaching accountability test. This dreary drilling often stamps out any genuine joy students might (and should) experience while they learn.  

 When people refer to “drill and kill” this is what they mean.  If your child seems to spend an excessive amount of time on worksheets, on practice items for such a test, this is what you are seeing.

The third consequence:

* Modeled dishonesty.

Some teachers, frustrated by being asked to raise scores on tests deliberately designed to preclude such score raising, may be tempted to adopt unethical practices during the administration or scoring of accountability tests. Students learn that whenever the stakes are high enough, the teacher thinks it’s OK to cheat. This is a lesson that should never be taught.  

 Please note, these remarks do not mean that either Popham nor any other opponent of such tests believes that this consequence is justifiable behavior.  We are pointing out that when the stakes become high enough, including things like job security and bonuses, we are likley to see this behovior with greater frequency.  Those who pay attention have in fact seen an icnrease of stories about precisely this kind of behavior, and not just in failling inner city schools.  Because the requirements under NCLB to show Annual Yearly Progress (AYP) in all disaggregated subgroups, we have seen examples in high performing schools in places like Montgomery County Maryland.

Popham believes that things do not have to be this way.  He offers three attributes offered in 2001 by an important national group, before the adoption of NCLB, that should be part of an “instructionally supportive” test:

* A modest number of supersignificant curricular aims.

To avoid overwhelming teachers and students with daunting lists of curricular targets, an instructionally supportive accountability test should measure students’ mastery of only an intellectually manageable number of curricular aims, more like a half-dozen than the 50 or so that a teacher may encounter today. However, because fewer curricular benchmarks are to be measured, they must be truly significant.

* Lucid descriptions of aims.

An instructionally helpful test must be accompanied by clear, concise, and teacherpalatable descriptions of each curricular aim to be assessed. With clear descriptions, teachers can direct their instruction toward promoting students’ mastery of skills and knowledge rather than toward getting students to come up with correct answers to particular test items.

* Instructionally useful reports.

Because an accountability test that supports teaching is focused on only a very limited number of challenging curricular aims, a student’s mastery of each subject can be meaningfully measured, letting teachers determine how effective their instruction has been. Students and their parents can also benefit from such informative reports.  

Popham states that a test based on such principles will accurately evaluate schools and (what I consider of AT LEAST equal importance) improve instruction.  He notes that Wyoming already has such a testing scheme (congratulation btw to any “cowboys” reading this), and urges other states to follow their example.

One question I have often been asked as a result of the various education diaries I have posted is “what can I do?”  Popham offers an answer that I hope readers will find useful.   It is with this that I will conclude:

If you want to be part of the solution to this situation, it’s imperative to learn all you can about educational testing. Then learn some more. For all its importance, educational testing really isn’t particularly complicated, because its fundamentals consist of commonsense ideas, not numerical obscurities. You’ll not only understand better what’s going on in the current mismeasurement of school quality, you’ll also be able to explain it to others. And those “others,” ideally, will be school board members, legislators, and concerned citizens who might, in turn, make a difference. Simply hop on the Internet or head to your local library and hunt down an introductory book or two about educational assessment. (I’ve written several such books that, though not as engaging as a crackling good spy thriller, really aren’t intimidating.)

With a better understanding of why it is so inane — and destructive — to evaluate schools using students’ scores on the wrong species of standardized tests, you can persuade anyone who’ll listen that policy makers need to make better choices. Our 40-year saga of unsound school evaluation needs to end. Now.  

0 0 votes
Article Rating