Moxley, Joseph M., and David Eubanks. “On Keeping Score: Instructors’ vs. Students’ Rubric Ratings of 46,689 Essays.” Journal of the Council of Writing Program Administrators 39.2 (2016): 53-80. Print.
Joseph M. Moxley and David Eubanks report on a study of their peer-review process in their two-course first-year-writing sequence. The study, involving 16,312 instructor evaluations and 30,377 student reviews of “intermediate drafts,” compared instructor responses to student rankings on a “numeric version” of a “community rubric” using a software package, My Reviewers, that allowed for discursive comments but also, in the numeric version, required rubric traits to be assessed on a five-point scale (59-61).
Exploring the literature on peer review, Moxley and Eubanks note that most such studies are hindered by small sample sizes (54). They note a dearth of “quantitative, replicable, aggregated data-driven (RAD) research” (53), finding only five such studies that examine more than 200 students (56-57), with most empirical work on peer review occurring outside of the writing-studies community (55-56).
Questions investigated in this large-scale empirical study involved determining whether peer review was a “worthwhile” practice for writing instruction (53). More specific questions addressed whether or not student rankings correlated with those of instructors, whether these correlations improved over time, and whether the research would suggest productive changes to the process currently in place (55).
The study took place at a large research university where the composition faculty, consisting primarily of graduate students, practiced a range of options in their use of the My Reviewers program. For example, although all commented on intermediate drafts, some graded the peer reviews, some discussed peer reviews in class despite the anonymity of the online process, and some included training in the peer-review process in their curriculum, while others did not.
Similarly, the My Reviewers package offered options including comments, endnotes, and links to a bank of outside sources, exercises, and videos; some instructors and students used these resources while others did not (59). Although the writing program administration does not impose specific practices, the program provides multiple resources as well as a required practicum and annual orientation to assist instructors in designing their use of peer review (58-59).
The rubric studied covered five categories: Focus, Evidence, Organization, Style, and Format. Focus, Organization, and Style were broken down into the subcategories of Basics—”language conventions”—and Critical Thinking—”global rhetorical concerns.” The Evidence category also included the subcategory Critical Thinking, while Format encompassed Basics (59). For the first year and a half of the three-year study, instructors could opt for the “discuss” version of the rubric, though the numeric version tended to be preferred (61).
The authors note that students and instructors provided many comments and other “lexical” items, but that their study did not address these components. In addition, the study did not compare students based on demographic features, and, due to its “observational” nature, did not posit causal relationships (61).
A major finding was that. while there was some “low to modest” correlation between the two sets of scores (64), students generally scored the essays more positively than instructors; this difference was statistically significant when the researchers looked at individual traits (61, 67). Differences between the two sets of scores were especially evident on the first project in the first course; correlation did increase over time. The researchers propose that students learned “to better conform to rating norms” after their first peer-review experience (64).
The authors discovered that peer reviewers were easily able to distinguish between very high-scoring papers and very weak ones, but struggled to make distinctions between papers in the B/C range. Moxley and Eubanks suggest that the ability to distinguish levels of performance is a marker for “metacognitive skill” and note that struggles in making such distinctions for higher-quality papers may be commensurate with the students’ overall developmental levels (66).
These results lead the authors to consider whether “using the rubric as a teaching tool” and focusing on specific sections of the rubric might help students more closely conform to the ratings of instructors. They express concern that the inability of weaker students to distinguish between higher scoring papers might “do more harm than good” when they attempt to assess more proficient work (66).
Analysis of scores for specific rubric traits indicated to the authors that students’ ratings differed more from those of instructors on complex traits (67). Closer examination of the large sample also revealed that students whose teachers gave their own work high scores produced scores that more closely correlated with the instructors’ scores. These students also demonstrated more variance than did weaker students in the scores they assigned (68).
Examination of the correlations led to the observation that all of the scores for both groups were positively correlated with each other: papers with higher scores on one trait, for example, had higher scores across all traits (69). Thus, the traits were not being assessed independently (69-70). The authors propose that reviewers “are influenced by a holistic or average sense of the quality of the work and assign the eight individual ratings informed by that impression” (70).
If so, the authors suggest, isolating individual traits may not necessarily provide more information than a single holistic score. They posit that holistic scoring might not only facilitate assessment of inter-rater reliability but also free raters to address a wider range of features than are usually included in a rubric (70).
Moxley and Eubanks conclude that the study produced “mixed results” on the efficacy of their peer-review process (71). Students’ improvement with practice and the correlation between instructor scores and those of stronger students suggested that the process had some benefit, especially for stronger students. Students’ difficulty with the B/C distinction and the low variance in weaker students’ scoring raised concerns (71). The authors argue, however, that there is no indication that weaker students do not benefit from the process (72).
The authors detail changes to their rubric resulting from their findings, such as creating separate rubrics for each project and allowing instructors to “customize” their instruments (73). They plan to examine the comments and other discursive components in their large sample, and urge that future research create a “richer picture of peer review processes” by considering not only comments but also the effects of demographics across many settings, including in fields other than English (73, 75). They acknowledge the degree to which assigning scores to student writing “reifies grading” and opens the door to many other criticisms, but contend that because “society keeps score,” the optimal response is to continue to improve peer-review so that it benefits the widest range of students (73-74).