Pruchnic, Jeff, Chris Susak, Jared Grogan, Sarah Primeau, Joe Torok, Thomas Trimble, Tanina Foster, and Ellen Barton. “Slouching Toward Sustainability: Mixed Methods in the Direct Assessment of Student Writing.” Journal of Writing Assessment 11.1 (2018). Web. 27 Nov. 2018.
[Page numbers from pdf generated from the print dialogue]
Jeff Pruchnic, Chris Susak, Jared Grogan, Sarah Primeau, Joe Torok, Thomas Trimble, Tanina Foster, and Ellen Barton report on an assessment of “reflection argument essay[s]” from the first-year-composition population of a large, urban, public research university (6). Their assessment used “mixed methods,” including a “thin-slice” approach (1). The authors suggest that this method can address difficulties faced by many writing programs in implementing effective assessments.
The authors note that many stakeholders to whom writing programs must report value large-scale quantitative assessments (1). They write that the validity of such assessments is often measured in terms of statistically determined interrater reliability (IRR) and samples considered large enough to adequately represent the population (1).
Administrators and faculty of writing programs often find that implementing this model requires time and resources that may not be readily available, even for smaller programs. Critics of this model note that one of its requirements, high interrater reliability, can too easily come to stand in for validity (2); in the view of Peter Elbow, such assessments favor “scoring” over “discussion” of the results (3). Moreover, according to the authors, critics point to the “problematic decontextualization of program goals and student achievement” that large-scale assessments can foster (1).
In contrast, Pruchnic et al. report, writing programs have tended to value the “qualitative assessment of a smaller sample size” because such models more likely produce the information needed for “the kinds of curricular changes that will improve instruction” (1). Writing programs, the authors maintain, have turned to redefining a valid process as one that can provide this kind of information (3).
Pruchnic et al. write that this resistance to statistically sanctioned assessments has created a bind for writing programs. Pruchnic et al. cite scholars like Peggy O’Neill (2) and Richard Haswell (3) to posit that when writing programs refuse the measures of validity required by external stakeholders, they risk having their conclusions dismissed and may well find themselves subject to outside intervention (3). Haswell’s article “Fighting Number with Number” proposes producing quantitative data as a rhetorical defense against external criticism (3).
In the view of the authors, writing programs are still faced with “sustainability” concerns:
The more time one spends attempting to perform quantitative assessment at the size and scope that would satisfy statistical reliability and validity, the less time . . . one would have to spend determining and implementing the curricular practices that would support the learning that instructors truly value. (4)
Hoping to address this bind, Pruchnic et al. write of turning to a method developed in social studies to analyze “lengthy face-to-face social and institutional interactions” (5). In a “thin-slice” methodology, raters use a common rubric to score small segments of the longer event. The authors report that raters using this method were able to predict outcomes, such as the number of surgery malpractice claims or teacher-evaluation results, as accurately as those scoring the entire data set (5).
To test this method, Pruchnic et al. created two teams, a “Regular” and a “Research” team. The study compared interrater reliability, “correlation of scores,” and the time involved to determine how closely the Research raters, scoring thin slices of the assessment data, matched the work of the Regular raters (5).
Pruchnic et al. provide a detailed description of their institution and writing program (6). The university’s assessment approach is based on Edward White’s “Phase 2 assessment model,” which involves portfolios with a final reflective essay, the prompt for which asks students to write an evidence-based argument about their achievements in relation to the course outcomes (8). The authors note that limited resources gradually reduced the amount of student writing that was actually read, as raters moved from full-fledged portfolio grading to reading only the final essay (7). The challenges of assessing even this limited amount of student work led to a sample that consisted of only 6-12% of the course enrollment.
The authors contend that this is not a representative sample; as a result, “we were making decisions about curricular and other matters that were not based upon a solid understanding of the writing of our entire student body” (7). The assessment, in the authors’ view, therefore did not meet necessary standards of reliability and validity.
The authors describe developing the rubric to be used by both the Research and Regular teams from the precise prompt for the essay (8). They used a “sampling calculator” to determine that, given the total of 1,174 essays submitted, 290 papers would constitute a representative sample; instructors were asked for specific, randomly selected papers to create a sample of 291 essays (7-8).
The Regular team worked in two-member pairs, both members of each pair reading the entire essay, with third readers called in as needed (8): “[E]ach essay was read and scored by only one two-member team” (9). The authors used “double coding” in which one-fifth of the essays were read by a second team to establish IRR (9). In contrast, the 10-member Research team was divided into two groups, each of which scored half the essays. These readers were given material from “the beginning, middle, and end” of each essay: the first paragraph, the final paragraph, and a paragraph selected from the middle page or pages of the essay, depending on its length. Raters scored the slices individually; the averaged five team members’ scores constituted the final scores for each paper (9).
Pruchnic et al. discuss in detail their process for determining reliability and for correlating the scores given by the Regular and Research teams to determine whether the two groups were scoring similarly. Analysis of interrater reliability revealed that the Research Team’s IRR was “one full classification higher” than that of the Regular readers (12). Scores correlated at the “low positive” level, but the correlation was statistically significant (13). Finally, the Research team as a whole spent “a little more than half the time” scoring than the Regular group, while individual average scoring times for Research team members was less than half of the scoring time of the Regular members (13).
Additionally, the assessment included holistic readings of 16 essays randomly representing the four quantitative result classifications of Poor through Good (11). This assessment allowed the authors to determine the qualities characterizing essays ranked at different levels and to address the pedagogical implications within their program (15, 16).
The authors conclude that thin-slice scoring, while not always the best choice in every context (16), “can be added to the Writing Studies toolkit for large-scale direct assessment of evaluative reflective writing” (14). Future research, they propose, should address the use of this method to assess other writing outcomes (17). Paired with a qualitative assessment, they argue, a mixed-method approach that includes thin-slice analysis as an option can help satisfy the need for statistically grounded data in administrative and public settings (16) while enabling strong curricular development, ideally resulting in “the best of both worlds” (18).