Friday, March 02, 2012

Data dumped on

(Photo by The Gates Foundation used under CC license)

I like to think the best of people. I try to start with the assumption that everyone believes, honestly, in what she considers good reasons for something, even if that something and those reasons turn out to be misguided about something extremely important.

The New York City Department of Education recently released performance rankings for 18,000 New York City public school teachers. In the New York Times’s summary: “The reports, which name teachers as well as their schools, rank teachers based on their students’ gains on the state’s math and English exams over five years and up until the 2009-10 school year.” In the report, each teacher (or set of teachers) receives a percentile ranking indicating whether the teacher is Low (0-5%), Below Average (6-25%), Average (25-75%), Above Average (75-95%), and High (95-100%).

Everyone pretty much agrees by now (or should at any rate) that performance on standardized tests reveals precious little about individual students and even less about their teachers. Here’s a simple example: Scores on a single test may be indicative of academic level (and, more tenuously, of teacher competence), but such scores may better represent inadequate breakfasts, wrong sides of beds gotten up on, bad pencil luck, and on and on and on. Not to mention the deep and abiding structural walls/ramps of economic strata and cultural pressures of family and community. Average scores may also vary widely across classes because, for very real example, one teacher’s students arrived day one better prepared than the other and therefore test as such. A teacher with the lower average score could in fact be “adding more value”* than his colleagues — his students could have scored far better due to his instruction than they would have without it, even if they ultimately scored lower than fourth graders a door or county over. In sum, standardized test scores could easily be the result of a galaxy of external factors that have nothing whatsoever to do with teacher potential and performance, which means they make an exceedingly poor metric for evaluating teachers in any real way.

At this point one could just jettison the whole project of trying to judge teachers (and perhaps even students) via standardized test scores, or one could try to evaluate the scores in a more sophisticated way. The NYC DOE and other places have gone for the latter, calculating the very personal percentages dumped last week via what’s called a “Value Added Model” (VAM). VAMs more generally are mathematical models designed to factor in (and therefore factor out) the swarm of influences on students to isolate the contribution of the teacher to student performance, thereby giving the “value added” by the instructor. A very (very) simple version of this is having students take an exam early on in the year to serve as benchmarks, testing them again later in the year, and then comparing scores to gauge improvement. VAMs try to account for all sorts of external factors so that the internal cause of student performance (viz., teaching) can be revealed. (For a great and extremely accessible primer on VAMs, see a paper by John Ewing in the May 2011 Notices of the American Mathematics Society.)

Not a bad idea if it works — this is me still trying to think the best of everybody — but it’s not at all clear that it does. As some have pointed out, the data collected is shot through with errors, from mis-attributed test scores to altogether absent data to sample sizes to small for the statistics they support (classes of 10 students, etc.). These kinds of error can perhaps be avoided with tighter reporting, assuming public schools have the funds and personnel to dedicate to that sort of thing.** The more troubling inaccuracy, though, is with the VAM itself. As you can tell from the examples just two paragraphs back, the factors contributing to student preparation and success prove exceedingly complex and quickly vexatious. The accuracy of VAMs hinge on our ability to sort the causally relevant from the irrelevant and the external causes of performance from internal ones. But how do we even go about knowing when we’ve sufficiently accounted for external factors and properly weighed them? How would we feel confident that the model is comprehensive enough to produce meaningful results about classroom performance that could be leaned heavily upon in tenure and promotion decisions? Again, not at all clear, at least not at this point. VAMs therefore turn out to be extra pernicious because they give the powerful but false impression that we’re talking directly and purely and with precision about how an individual teacher contributed to overall student learning. And this type of report will be used by the State of New York to formally evaluate teachers, which is, simply put, a travesty.

The NYC DOE apparently recognizes and admits*** the report's limitations by listing individual teacher scores with margins of error of 35 percentage points for math and 53 for English. Doing so, though, reveals the rankings as farce as well as travesty. Even a 35-point swing means a teacher rated dead average by the DOE’s model (a 50) could for all anyone knows be an Above Average 85 or a fireable Below Average 15. The swing is even more dramatic and therefore more risible for English scores. As commenters on various New York Times pieces quickly pointed out, metrics with this kind of error slack would be laughed right out of most anywhere else. Consider: “In 2012 Jeremy Lin had a field-goal percentage between 27 and 80%”; “Last year this fund had a rate of return of 9%, give or take 35%;” and, more pointedly, “Chancellor Dennis M. Walcott performed his duties somewhere around the Below Average or Average or Above average level,” etc., etc.

Look, I’m not one of those Teaching Is An Art Not A Science people. I’ve been working in higher education for around 15 years as an instructor and administrator. Colleges and universities have likewise been swept up in assessment mania, and I see and hear lots of resistance to attempts at evaluating what students gain by attending a particular institution. Push-back in many cases is warranted. But principled resistance — resistance to the idea of evaluating what you do in your classrooms as a teacher and/or a administrator — sells people and disciplines and institutions extremely short. A teacher of any sort should be able to say pretty explicitly what she wants students to learn skill- and content-wise, how she plans to get her students there, and how she will figure out whether they’ve made it.**** Assessing teaching and learning isn’t easy, of course, on any level, but it’s worth doing and doing well. The problem is, “well” most likely means either more individualized and extensive personal evaluations of students and their teachers (hyperlocal) or, if you like your tests standardized, more aggregate measurements of entire districts or systems looked at as aggregates à la Finland. Either way, “well” means not viewing the results as opportunities for punishment and reward. The NYC teacher data dumped last week doesn’t get close to assessing well: It’s too flawed, too personalized, too high-stakes, too susceptible to misunderstanding and misuse.

I could go on. I could speculate about why the NYC DOE and the state would distribute and base decisions this report, but that exercise always pushes me downhill into despair to the point where I begin to wonder exactly when we stopped being serious about important things. I’ll stare into these abysses on my own time.

I do, though, want to spend a little time here not understanding why the New York Times acted as it did. As the Times itself will tell you, the paper sued to obtain the teacher data and won, despite the protestations of the United Federation of Teachers. No problem there; the data being used to hire, fire, and promote public school teachers should be looked at by people other than those who hire and fire. However, the real story turned out to be the extent to which this data is flawed and misused. But the Times published the data itself anyway in an easily searchable form, thereby making it nearly effortless to gawk at the results that various of its writers wring hands over.

If the data is so bad and misleading, why publish it? Here’s the Times’s answer:
Why did SchoolBook decide to publish these evaluations?
The New York Times and WNYC, who jointly publish SchoolBook, believe that the public has the right to know how the Department of Education is evaluating our teachers. Since the value-added assessments were being used for tenure and other high-stakes decisions, we sued for access to the reports. While we share some critics’ concerns about the high margins of error and other flaws in the system, we believe it is our responsibility to provide the information, along with appropriate caveats and context, for readers to evaluate.
Sorry, but I don’t buy it. If Times writers and editors really did “share some critics' concerns” to the degree indicated by their treatment of this whole issue, they wouldn't have made it so simple to see the 36 given to your fifth-grader's teacher. The data could have been anonymized and/or excerpted for publication. As is, the concerns and caveats preclude any proper context for these results: I have no way of knowing whether the competence of my kids' teachers falls on the thick black error bar let alone where. Putting up a mechanism for teachers and the general public to respond doesn't help either; those kinds of things don't receive the same level of attention as the numbers themselves. In my more cynical moments, I begin to think that the actual context for the story involves the Times's pageview counters.

Teaching is good, hard work. I remain puzzled why we seem so determined to make being a teacher so difficult.

*Pardon the cringe-inducing quotes. **A yellow-bus sized assumption there, of course.
***I did have “pretends to recognize” and “sort of admits,” but I'm still trying to exude a generosity of spirit here.
****In many ways the problem for higher ed is more acute because college-level teaching involves somewhere around zero hours of required faculty development or teacher training, even though it's almost a theorem that research performance is inversely proportional to teaching ability. Lord knows I struggled early on in my classrooms, and I still find myself confounded sometimes. K-12 teachers at least have to go through education courses and student teach, say what you will about their training.

No comments: