Dryer, Dylan B., and Irvin Peckham. “Social Contexts of Writing Assessment: Toward an Ecological Construct of the Rater.” WPA: Writing Program Administration 38.1 (2014): 12-41.
Dylan B. Dryer and Irvin Peckham argue for a richer understanding of the factors affecting the validity of writing assessments. A more detailed picture of how the assumptions of organizers and raters as well as the environment itself drive results can lead to more thoughtful design of assessment processes.
Drawing on Stuart MacMillan’s “model of Ecological Inquiry,” Dryer and Peckham conduct an “empirical, qualitative research study” (14), becoming participants in a large-scale assessment organized by a textbook publisher to investigate the effectiveness of the textbook. Nineteen raters including Dryer and Peckham, all experienced college-writing teachers, examined reflective pieces from the portfolios of more than 1800 composition students using the criteria of Rhetorical Knowledge; Critical Thinking, Reading, and Writing; Writing Processes; and Knowledge of Conventions from the WPA Outcomes Statement 1.0 (15). In addition to scores on each of these criteria, raters assigned holistic scores that served as the primary data for Dryer and Peckham’s study. Raters were introduced to the purpose of the assessment and to the criteria and benchmark papers by a “chief reader” characterized by the textbook publisher as “a national leader in writing assessment research” (19). The room set-up consisted of four tables with three to four raters and a table leader charged with maintaining the scoring protocols presented by the chief reader. Dryer and Peckham augmented their observations and assessment data with preliminary questionnaires, interviews, exit surveys, and focus groups.
Dryer and Peckham adapted MacMillan’s four-level model by dividing the environment into “social contexts” of field, room, table, and rater (14). Field variables involved the extent to which the raters were attuned to general assumptions common to composition scholarship, such as definitions of concepts and how to prioritize the four criteria (19-22). The “room” system consisted of the expectations established by the chief reader and the degree to which raters worked within those expectations as they applied the criteria (22-24). Table-specific variables were based on the recognition that each table operated with its own microecology growing out of such components as interpersonal interactions among the raters and the interventions of the table leaders (25-30). Finally, the individual-rater system encompassed factors such as how each rater negotiated the space between his or her own responses to the process and the expectations and pressures of the field, room, and table (30-33).
Field-level findings included the observation that most of the raters agreed with the ordered ranking of the criteria that had been chosen by the WPA team that developed the outcomes (20-21). The authors maintain that the ability of their study to identify the outliers (three of seventeen raters) who considered Writing Processes and Knowledge of Conventions most important permits a sense of how widely the field’s values have spread throughout the profession (22). Collecting a complete “scoring history” for each rater, including the number of “overturned” scores, or scores that deviated from the room consensus, allowed the finding that ranking the four criteria differently from the field consensus led to a high percentage of such incorrect scores (21).
The room-level conclusions demonstrated the assumption that there actually is a “real” or correct score for each paper and that the benchmark papers adequately represent how a paper measured against the selected criteria can earn this score. This phenomenon, the authors argue, tends to pervade assessment environments (23). Raters were extolled for bringing “professional” expertise to the process (19); however, raters whose scores deviated too far from the correct score were judged “out of line” (22). Interviews and surveys reveal that raters were concerned with the fit between their own judgments and the “room” score and sometimes struggled to adjust their scores to more closely match the room consensus (e.g., 23-24).
At the table level, observations and interviews revealed the degree to which some raters’ behavior and perceived attitudes influenced other raters’ decisions (e.g., 28). Table leaders’ ability to keep the focus on the specific criteria, redirecting raters away from other, more individual criteria, affected overall table results (25-27): A comparison of the range of table scores with the overall room score enabled the authors to designate some tables as “timid,” unwilling to risk awarding high or low scores, and others as “bold,” able to assign scores across the entire 1-6 range (25). Dryer and Peckham note that some raters consciously opted for a 2 rather than a 1, for example, because they felt that the 2 would be “adjacent” to either a 1 or a 3 and thus “safe” from being declared incorrect (28).
Discussion of the rater-level social system focused on the “surprising degree” to which raters did not actually conform to the approved rubric to make their judgments (31). For example, raters responded to the writer’s perceived gender as well as to suppositions about the English program from which particular papers had been drawn (30-31). Similarly, at the table level, raters veered toward criteria not on the rubric, such as “voice” or “engagement” (24). These raters’ resistance to the room expectations showed up overtly in the exit surveys and interviews but not in the data from the assessment itself.
Dryer and Peckham recommend four adjustments to standard procedures for such assessments. First, awareness of the “ecology of scoring” can suggest protocols to head off the most likely deviations from consistent use of the rubric (33-34). Second, this same awareness can prevent overconfidence in the power of calibration and norming to disrupt individual preconceptions about what constitutes a good paper (34-35). Third, the authors recommend more opportunities to discuss the meaning and value of key terms and to air individual concerns with room and field expectations (35). Fourth, the collection of data like individual and table-level scoring as well as measures of overall and individual alignment with the field should become standard practice. Rather than undercutting the validity of assessments, the authors argue, such data would underscore the complexity of the process and accentuate the need for care and expertise both in evaluating student writing and in applying the results, heading off the assumption that writing assessment is a simple or mechanical task that can easily be outsourced (36).