Review: Making the Grades

Chris McNutt
June 26, 2018
I always bite my tongue when I hear educators defend the testing industry, even if they don’t outright support it. “Standardized testing is just one tool in the toolbox!"

I never wanted to cause argument, perhaps if standardized testing wasn’t utilized as a funding tool — was used for learning purposes of students (and not punishment for everyone else) — it would be useful. However, Making the Grades: My Misadventures in the Standardized Testing Industry by Todd Farley ended any doubts on the fallacy of standardized assessment. Testing like this is not a tool — it has literally zero value to educators, and Farley gives an enlightening, often humorous, account to why.

Making the Grades is Farley’s journey of over a decade in the testing industry. Starting as a simple scorer, he worked his way through the ranks to reigning as a testing consultant — despite all of the deplorable events he witness (he himself admits it was all for the money.) Stated in the epilogue, Farley explains,

“…I’ve spent the better part of two years writing this very book, some 75,000 words I believe illustrate the many, many reasons no one in their right mind would ever entrust decisions made about this country’s students, teachers, and schools to this industry. I don’t know how anyone who’s seen what I’ve seen could come to any other conclusion.”

And he has seen a lot. Starting off, Farley was assigned with assessing a 4th grade standardized test prompt. Students were asked to make a public service announcement poster that demonstrated an element of bike safety. In theory — and while being trained — this was quite simple: give students points if they had an element of bike safety present, such as both hands on the handlebars or stopping at a stop sign, otherwise they receive no points.

Of course, the real world isn’t this simple. If you provide thousands of children this prompt, you’ll get a wide array of responses: is a crashed bike in a street a sign of bike safety, a warning? what about someone who’s wearing a helmet but going in front of a car? In one situation, Farley is called by his administrator to explain his score of “0” on a paper featuring a bike, loaded on a pick-up truck, at a stop sign. The administrator explained this was bike safety, because the rubric said, “stopped at a stop sign.”

Despite all this, Farley wanted to believe in what he was doing:

“ At this point I may have wondered about the scoring of test items, but never did I waver from the idea there existed virtual legions of education experts — surely in white lab coats, wearing glasses and holding clipboards, probably at some bastion of Ivy League learning — that could make perfect sense of it all.”

Really, this is just the tip of the iceberg.

Testing centers have reliability numbers — test scorers are given the same answers as some of their peers. Their scores must match up at least 70–80% of the time, or else they’re thrown out. What happens when a test scorer is consistently off or the prompt is simply too difficult to assess? Well, the score is manipulated! At many points, Farley demonstrates how managers would either A) change a peer’s number to another’s (whoever they trusted more) or B) just assess the prompt themselves and override the numbers.

More and more, he’s disgusted by what he sees. He explains to a friend, “‘I don’t mean to sound naive, but I thought we were in the business of education.” His friend responds, “…I’d say we are in the business of education.”

After his initial assignment, he returns to the agency but fails to meet the 70% minimum requirement entrance exam (where one scores already assessed papers). Soon after, the agency let him in regardless (with a 60%) — and weeks later, many more employees showed up who didn’t make mention of any test. The fact of the matter is that this isn’t a stellar job — it paid $8 an hour and required no experience. Despite systems in place to watch over potential employees (Farley was put on “probation” for not passing initially, but his manager didn’t know what that was), the testing agency was clamoring for anyone to bolster their numbers.

Image for post
This is not the rubric from the book, but you’d be hard- pressed to find a school not assessing in this way.

As Farley continued to take on jobs, the rubrics became more and more complex — especially in the higher grades. Instead of pass/fail, these rubrics were 6 levels (excellent, good, adequate, inconsistent, weak, unacceptable) and 6 categories (grammar, sentence structure, etc.) I’m sure most teachers would lament how ridiculous it is to subjectively grade your own class in this way, let alone thousands of essays. As one would imagine, many debates ensued:

…”’I’m with him. It should be a 5,’ she said. ‘Look at the vocabulary words in this essay. Alacrity, perspicacious, audacity. Those are nice word choices. At least ‘good’ word choices, if not ‘excellent’ ones.‘”’

‘Yes,’ Maria [the testing manager] said, ‘those are decent word choices. Someone’s been doing their SAT prep. But Anchor Paper #5 has nice vocabulary, too. Nonetheless, succinctly, beforehand.

'Beforehand?’ the woman asked, spitting the word out as if a curse. ‘’Beforehand is a good word choice?’

‘Pretty good,’ Maria said.

…’Is there any way…I could judge how different vocab words compare to each other? Do you have some sort of reference book I can use to compare pairs or trios of words, so I would know if words are 4-like or 5-like?’

Disagreements are a constant mention throughout the book. Scorers range from professors to refrigerator mechanics, and no matter how much of an expert someone was, they could never assess “properly.” The testing administrators were almost Ministry of Truth-level, convincing scorers that they would be mad to assess incorrectly, and the answers they provided were obviously right. On one occasion, the administrator levied an entire argument on why a paper was a “3” — noting how simple it was to understand — only to double check and realize it was, in-fact, a “4.”

And, with poorly trained employees and really no “why” to the position, manipulation was commonplace. Because scorers were assigned the same answers as a peer (to check their reliability ratings), they were sometimes delivered papers with the previous assessor’s filled in scores. Common practice was to memorize these scores in order, then with your fresh stack write them all down — giving you a healthy break from monotony.

Students are placed in a radical, unforgiving, and confusing process which gives no opportunity for them to succeed — let alone make sense. One rubric required a 5-paragraph essay — no matter how well the topic was written. I believe these anchor papers (for scorers to refer back to) speak for themselves:

Image for post
Anchor Paper #6 received a higher score because it is in a traditional 5-paragraph format. The instructions say to write a 5-paragraph response.

But it doesn’t stop there! In one instance, the head testing coordinator visited their facility and noticed that too many “2s” were being assigned. Testing facilities need to watch for trend data — because the same answers are given overtime, their scores should not drastically change (hence, standardization). Because the trend data was off, she simply told all employees to give more “3s” in the place of “2s” — meaning that all the scores up until that point were incorrect (and never corrected.)

Farley eventually moves up the ranks to range finder — the people in charge of determining rubric scoring. The range finder meets with a group of teachers to make this happen — and with everything else — it makes no sense. For example, one test question was, “What is your favorite food? What flavor does it have (bitter, sweet, etc.)? What part of the tongue does this affect? (a diagram is provided). For the small group of teachers and Farley, this seemed obvious. Nonetheless, when thousands of students responded, they said things like “Pizza is sweet.” — is this wrong? What about pineapple pizza? Case after case of subjective results, the final decision was to simply give students any credit for stating a descriptor (again, not taking into account all the prior results which were assessed differently.)

Within the industry, Farley eventually is a test manager. He accounts how he would often change the scores to the person he trusted. Under him, employees consisted of English language learners, senior scorers (who were never correct, but were there so long that previous managers just never used their scores), and frankly — people who should never be assessing children on an English test. Simply stated, it’s an $8 an hour job — just like your average McDonald’s employee, there usually isn’t a huge amount of concern, effort, or diligence in mindlessly looking at essays.

Notably, there was one situation where Farley believed the testing system would work. After realizing none of his team’s scores matched at all, he sat down with another administrator and they were 90% matching. However, he was told this was too accurate — the psychometrics (which again, sounds like something from 1984) showcased that scores should be 70 to 80%. So, they brought in another assessor who made their scoring less accurate.

These are just a selection of examples from the book — which explains in a gross amount of scenarios why standardized testing makes absolutely no sense. Let alone the subjectiveness of standards themselves, or the cultural bias that can exist in testing, or that we measure teacher pay and school effectiveness this way — standardized testing can simply not work because it is standardized.

People are people. They don’t conform their answers to neat responses — nor would we want them to! Why would we create a system where innovation is inherently impossible? Why would we want students, in the modern age, to all know the same thing? Why would we want them to display their “knowledge” in one day for the entire course of the year? And — most importantly based on this evidence — why would we think that any of this information is relevant in the slightest?

In addition, this book should enlighten on why grading makes no sense. Grading is inherently subjective. Yes — a single teacher may be able to grade without bias (despite their current mood, tiredness, liking certain students, being drawn to certain topics/interests, preferring a certain writing style, political-leanings, and more) — but what does that actually mean for a student? That they passed through your hoops? What happens if they go to another class, submit the same paper (or better yet, submit the same paper to you later) and receive a different grade? Are they more or less intelligent? Do they know more or less? If you use them, it’s worth taking a look at those paradoxical rubrics.

I encourage any educator — but especially administrators and districts who place any emphasis on standardized testing — to read this book. I would be shocked if someone could read these accounts, dismiss them, and continue to do this to our students. Yes — it’s not easy standing up to the government or the testing culture at large — but we must do what makes logical sense. This isn’t a matter of losing a week to testing — testing has manufactured an entire education system which supports it the entire year. We need common sense decisions which encourage students to learn for themselves, express their creativity, and find authentic solutions to our world’s problems.

Chris McNutt
Chris McNutt is the co-founder and executive director of Human Restoration Project, a nonprofit organization focused on student engagement, well-being, and motivation. His work centers on realizing systems-based change, examining how progressive pedagogical shifts (e.g. PBL, ungrading) reimagine school to best suit the needs of students and teachers alike. He was a public high school digital media & design educator who focused on experiential learning, portfolio-driven assessment, and community involvement.
The YouTube symbol. (A play button.)

watch now