The United Federation of Teachers - A Union of Professionals

July 4, 2008  

Print Version
home> insight> news and issues> new york teacher> insight> ‘gotcha’ gets data

Insight

‘Gotcha’ gets data

But should teacher evaluations be numerical?

The surprise revelation over the Martin Luther King Jr. holiday weekend that the Department of Education is looking at ways to evaluate teachers based on student test scores seemed of a piece with the DOE’s earlier efforts to “improve teacher quality.” When the department last went public on this front, it was to announce that it had hired a stable full of high-priced lawyers to help principals get rid of “bad” teachers.

(That new unit, which UFT President Randi Weingarten dubbed the “gotcha squad,” was announced, coincidentally, on the day when the DOE had to report national test results showing New York City 4th- and 8th-graders have not advanced appreciably in math or reading since 2003.)

Trying to identify and eliminate “bad” teachers is a teacher quality initiative in the same way that punishing children for bad grades is an education initiative. It is hardly going to elevate the enterprise.

Of course, evaluating teachers in some way, shape or form is always going to be essential. Teachers know they work as a team, and build on the work of others. Few, if any, want poorly prepared, unqualified or indifferent colleagues teaching students. The question is how to do it so it is fair and accurate.

Using student outcomes

Measuring teacher effectiveness by looking at standardized test scores strikes noneducators as an advance over simple classroom observations, with their potential for subjective or biased evaluations. But hard data also has its limitations. For one thing, test results data are not always reliable measures of achievement. For another, these data rarely give a full 360-degree picture of a teacher. Indeed, in New York, test publishers say they are not valid for teacher evaluation.

Test scores may well be part of the evaluation mix, along with other quantifiable outcomes like promotion rates. But most people understand that the art of teaching requires skills that are essential but can’t be measured the same way — enthusiasm, the ability to differentiate instruction, a good rapport with kids, a facility for collaboration, strong communication skills, etc., etc., etc.

Value-Added Modeling

If outcomes, especially test scores, are to be part of the mix, there are a host of technical problems to address. For example, how do you compare teachers whose students start the year at very different academic levels?

One promising way to do this is with “value-added assessment,” also called “value-added modeling” (VAM), a statistical tool developed over the last few years. The Feb. 25, 2004, “Insight” column described VAM as having the potential to improve assessment despite its hugely complex methodology.

VAM tries to solve the problem of student variability by looking at each student’s progress over time, rather than using the “snapshot” view that standardized test scores provide. The design can control for variables like family income and educational background that make comparisons difficult. It measures students against their own predicted rates of growth based on previous results, rather than comparing them to their classmates.

The concept is simple enough, and it has won a lot of attention, especially as states now give tests every year in grades 3-8 so annual data is available. However, actually isolating what part of a student’s progress is due to a single teacher is immensely difficult, and may not ever be accurate enough at the individual teacher level.

Even William Sanders, the statistician who developed the best-known VAM model, warns against using it in this way. And in a major study of VAM in 2003, the RAND Corporation warned, “The current research base is insufficient to support the use of VAM for high-stakes decisions, and applications of VAM must be informed by an understanding of the potential sources of errors in teacher effects.”

VAM and the DOE

Despite this, the DOE is pursuing VAM as a way to measure the work of individual teachers in a pilot study in 140 schools. Teachers in the study don’t even know who they are, yet the department has occasionally threatened to use the VAM measures in evaluation and tenure decisions and to award merit pay. (Weingarten said she would fight this on multiple fronts — educational, legal and technical. Both the contract and state law prevent such uses.)

It’s a shame VAM is making its New York City debut in such a negative way. The DOE could actually set back research that is trying to improve this data-based assessment method if the DOE model is shown to be unreliable or if it is used inappropriately.

But the problems are not just technical. VAM also has to meet the test of practice. Aside from the statistical challenges, there are educational questions that practitioners would immediately identify:

  • Education is cumulative, the result of instruction by a succession of teachers. VAM attributes a single year of scores to one teacher.
  • Education is collaborative, with most students exposed to several teachers every year. VAM isolates teachers.
  • Learning is not linear — students often acquire knowledge in fits and starts. VAM assumes a straight path.

Researchers would probably share educators’ concerns about the limits of the data itself. For example, testing data is available only for ELA and math teachers in grades 3 through 8. Contributions of early grades teachers, high school teachers, art, social studies, science or language teachers are not included. In addition, there are questions about the comparability of test data. Comparing students across grades requires a statistical manipulation to equate the tests— a potential source of error. And stepping back further, there is the question of the quality of the state tests.

As Thomas Toch and Robert Rothman write in “Rush to Judgment,” a new report about measuring teachers, “A majority of the standardized tests that would be used in teacher evaluations today — statewide tests required by NCLB — focus on low-level skills such as the recall or restatement of information and on only a few subjects, primarily reading, math and science. They don’t measure more advanced skills such as expository writing or an ability to think creatively or analytically, and they sidestep history, art, music and other subjects. As a result, they can’t capture a teacher’s skill in energizing students to learn astronomy or in scaffolding a series of lessons that draw students into the life of a novel.”

Caveat educator

“Mr. VAM” himself, William Sanders, has said that with a large enough data set, VAM can pretty reliably identify the very top teachers and the very weakest ones based on their students’ scores. But it is not sensitive enough to sort out the differences among the majority of teachers in the vast middle range. This is not surprising given that standardized tests are designed as rough measures of a single benchmark. (Are students over the Level 3 cutoff or not?)

The process of ensuring that every student has a high-quality teacher may eventually include value-added measures, but even a perfected VAM should never be a sole or predominant measure. Teaching takes place on many fronts at once and requires multiple skills. Teacher assessments must reflect this.

Even more important, teacher quality is a process, too, not a snapshot. Ensuring quality teaching is less about evaluations aimed at kicking out a few bad apples than it is about improving every teacher’s practice.

The opinions expressed in this column are those of the author and not necessarily the UFT.

Login



MEMBER SERVICES
NEWS AND ISSUES
MY CHAPTER
NEW TEACHERS
ABOUT US
UFT CALENDAR
WELFARE FUND
HOTLINE
55/25 UPDATE
The New York Teacher Edwize - UFT Blog UFT Providers Political Action UFT Course Catalog Randi's School Visits Randi's NY Times columns
Copyright © 2008 United Federation of Teachers
Home
Login
Register
Contact Us
Privacy Policy
Search