Dryer, Dylan B., and Irvin Peckham. “Social Contexts of Writing Assessment: Toward an Ecological Construct of the Rater.” WPA: Writing Program Administration 38.1 (2014): 12-41.
Dylan B. Dryer and Irvin Peckham argue for a richer understanding of the factors affecting the validity of writing assessments. A more detailed picture of how the assumptions of organizers and raters as well as the environment itself drive results can lead to more thoughtful design of assessment processes.
Drawing on Stuart MacMillan’s “model of Ecological Inquiry,” Dryer and Peckham conduct an “empirical, qualitative research study” (14), becoming participants in a large-scale assessment organized by a textbook publisher to investigate the effectiveness of the textbook. Nineteen raters including Dryer and Peckham, all experienced college-writing teachers, examined reflective pieces from the portfolios of more than 1800 composition students using the criteria of Rhetorical Knowledge; Critical Thinking, Reading, and Writing; Writing Processes; and Knowledge of Conventions from the WPA Outcomes Statement 1.0 (15). In addition to scores on each of these criteria, raters assigned holistic scores that served as the primary data for Dryer and Peckham’s study. Raters were introduced to the purpose of the assessment and to the criteria and benchmark papers by a “chief reader” characterized by the textbook publisher as “a national leader in writing assessment research” (19). The room set-up consisted of four tables with three to four raters and a table leader charged with maintaining the scoring protocols presented by the chief reader. Dryer and Peckham augmented their observations and assessment data with preliminary questionnaires, interviews, exit surveys, and focus groups.
Dryer and Peckham adapted MacMillan’s four-level model by dividing the environment into “social contexts” of field, room, table, and rater (14). Field variables involved the extent to which the raters were attuned to general assumptions common to composition scholarship, such as definitions of concepts and how to prioritize the four criteria (19-22). The “room” system consisted of the expectations established by the chief reader and the degree to which raters worked within those expectations as they applied the criteria (22-24). Table-specific variables were based on the recognition that each table operated with its own microecology growing out of such components as interpersonal interactions among the raters and the interventions of the table leaders (25-30). Finally, the individual-rater system encompassed factors such as how each rater negotiated the space between his or her own responses to the process and the expectations and pressures of the field, room, and table (30-33).
Field-level findings included the observation that most of the raters agreed with the ordered ranking of the criteria that had been chosen by the WPA team that developed the outcomes (20-21). The authors maintain that the ability of their study to identify the outliers (three of seventeen raters) who considered Writing Processes and Knowledge of Conventions most important permits a sense of how widely the field’s values have spread throughout the profession (22). Collecting a complete “scoring history” for each rater, including the number of “overturned” scores, or scores that deviated from the room consensus, allowed the finding that ranking the four criteria differently from the field consensus led to a high percentage of such incorrect scores (21).
The room-level conclusions demonstrated the assumption that there actually is a “real” or correct score for each paper and that the benchmark papers adequately represent how a paper measured against the selected criteria can earn this score. This phenomenon, the authors argue, tends to pervade assessment environments (23). Raters were extolled for bringing “professional” expertise to the process (19); however, raters whose scores deviated too far from the correct score were judged “out of line” (22). Interviews and surveys reveal that raters were concerned with the fit between their own judgments and the “room” score and sometimes struggled to adjust their scores to more closely match the room consensus (e.g., 23-24).
At the table level, observations and interviews revealed the degree to which some raters’ behavior and perceived attitudes influenced other raters’ decisions (e.g., 28). Table leaders’ ability to keep the focus on the specific criteria, redirecting raters away from other, more individual criteria, affected overall table results (25-27): A comparison of the range of table scores with the overall room score enabled the authors to designate some tables as “timid,” unwilling to risk awarding high or low scores, and others as “bold,” able to assign scores across the entire 1-6 range (25). Dryer and Peckham note that some raters consciously opted for a 2 rather than a 1, for example, because they felt that the 2 would be “adjacent” to either a 1 or a 3 and thus “safe” from being declared incorrect (28).
Discussion of the rater-level social system focused on the “surprising degree” to which raters did not actually conform to the approved rubric to make their judgments (31). For example, raters responded to the writer’s perceived gender as well as to suppositions about the English program from which particular papers had been drawn (30-31). Similarly, at the table level, raters veered toward criteria not on the rubric, such as “voice” or “engagement” (24). These raters’ resistance to the room expectations showed up overtly in the exit surveys and interviews but not in the data from the assessment itself.
Dryer and Peckham recommend four adjustments to standard procedures for such assessments. First, awareness of the “ecology of scoring” can suggest protocols to head off the most likely deviations from consistent use of the rubric (33-34). Second, this same awareness can prevent overconfidence in the power of calibration and norming to disrupt individual preconceptions about what constitutes a good paper (34-35). Third, the authors recommend more opportunities to discuss the meaning and value of key terms and to air individual concerns with room and field expectations (35). Fourth, the collection of data like individual and table-level scoring as well as measures of overall and individual alignment with the field should become standard practice. Rather than undercutting the validity of assessments, the authors argue, such data would underscore the complexity of the process and accentuate the need for care and expertise both in evaluating student writing and in applying the results, heading off the assumption that writing assessment is a simple or mechanical task that can easily be outsourced (36).
March 25, 2015 at 12:07 pm
Thank you, Virginia, for this precise summary. I want to add a minor amendment and then a story that Dylan & I didn’t get in the article (we had enough information for five books).
Our work really emphasized scores being the consequence of social systems as much (if not more) as they were existential decisions. At times, it seemed as if the interpersonal relationships, the rub of people against each other (which can be broadened to include the rub of a person against the field of rhetoric/composition) was a primary determinant of how a score was assigned. We are tying to make the point the stakeholders who treat the score as a professional judgment based on an exchange between a trained reader and the text may be acting naively (I really mean, probably is). Any “use” of the score should take into account how that score might be simply an articulation of a social system and systems within systems as much as it is an objective ranking of a text determined by a set of criteria–which themselves are articulations of a social system. I think it would be somewhat naive to think that any of these social systems has got it “right.”
I once interviewed an AP reader (history) who was trying to, let’s say, advance himself, find other things to do than “just” teach. He wanted to be a table leader for AP, hoping that kind of recognition would lead to professional opportunities. In his first AP reading, he watched one reader who seemed to have a “star” ranking as a reader. My interviewee saw that this reader was the fastest reader at the table and one whose scores were never called into question. Jake (let’s call him) finally figured out what was going on–the “fast” reader when getting a stack of papers (25 or so) would do a quick sort by an ID number that identified the school district. A set of IDs were readily identifies as “poor” school districts. The fast reader did a quick sort of each stack and put the poor districts on top (all done quickly) and would rip through half the stack, assigning low scores–1s, 2s, and 3s (2s and 3s more than 1s). He would assign these after just glancing at the essays. Then he more seriously read the remaining essays, essentially cutting his reading time in half.
I think that we can learn quite a bit from an assessment if we always remember to interpret the scores as indicators of something–when viewed as an aggregate. But we need to think carefully about what that “something” is. When interpreted at the individual writer level–well, if validity is use, then that interpretation has a good chance of being invalid.