New York State's education system underwent significant reforms, including shifts to the Common Core Learning Standards and Next Generation Learning Standards. State assessment scores are crucial for teacher and principal evaluations, influencing employment decisions. Essay and constructed response questions are now equally important as multiple-choice questions, but their assessment, involving human raters, presents challenges. Research highlights various factors impacting scoring variability and reliability, including raters' backgrounds, scoring methods, criteria, tolerance for errors, training, and the number of raters. The New York State Testing Program (NYSTP) requires holistic scoring by a single rater, raising concerns about score reliability. This study employed G-theory to examine this practice's impact on the NYSTP's grades four and six ELA and grades four and five mathematics assessments.
Literature Review
The literature extensively discusses the challenges of assessing essay and constructed response questions due to the subjectivity inherent in evaluating written work. Holistic and analytic scoring methods are common approaches, each with advantages and disadvantages. Holistic scoring, while efficient, lacks diagnostic detail and can be susceptible to rater drift from established criteria. Analytic scoring, although more time-consuming, offers detailed feedback and potentially higher reliability. The number of raters significantly influences reliability; studies show that multiple raters improve reliability but are less cost-effective. Generalizability (G-) theory is a powerful method to analyze rater variability and reliability, particularly in performance assessments, as it identifies and quantifies various error sources.
Methodology
The study received IRB approval. Constructed response samples were selected from the 2015 New York State tests (grades 4 and 6 ELA, grades 4 and 5 math). Samples were categorized as "good," "fair," and "poor" based on original scores. Ten certified New York State educators were recruited using a convenience sampling method, ensuring representation across gender, years of teaching experience, assessment experience, and certification levels. Rater training mirrored New York State's procedures, including training sets, practice sets, and consistency assurance sets. Raters scored the samples holistically, using the New York State scoring rubrics. Data analysis employed the GENOVA computer program to perform 12 paper-by-rater (p x r) random effects G-studies for each task, calculating G-coefficients for each task to determine reliability.
Key Findings
Twelve paper-by-rater (p x r) random effects G-studies were conducted. For both ELA grades, the "paper" (object of measurement) variance component was largest, followed by the residual variance component. Rater variance was relatively small but varied across tasks and grades. In ELA, raters showed more consistent scoring on some tasks than others. In mathematics, rater variance was consistently minimal across all tasks and grades. G-coefficients revealed that for ELA, the current one-rater scoring practice did not achieve the acceptable threshold of ≥0.80. Increasing the number of raters to two or three per task was necessary to meet this standard. In mathematics, only some tasks achieved the acceptable G-coefficient with one rater; others required two or three raters.
Discussion
The findings highlight significant rater variability and reliability issues in the ELA assessments, contradicting the current one-rater holistic scoring practice in New York State. Mathematics assessments showed minimal rater variability but still needed more than one rater for several tasks to ensure adequate reliability. These results necessitate a reevaluation of the current NYSTP scoring policy. The study underscores the need for increased assessment literacy among educators, emphasizing the importance of understanding and applying quality assessment principles beyond test development, and considering the resources needed for reliable assessment, rather than only financial constraints.
Conclusion
The study demonstrates that the current one-rater holistic scoring practice in the NYSTP for ELA and mathematics assessments is inadequate, leading to unacceptable reliability. Increasing the number of raters per task is recommended to improve the reliability of scores and the validity of inferences drawn from them. Future research should expand the sample size and explore the implications for computer-based testing and Regents Examinations.
Limitations
The study's limitations include small sample sizes for both constructed response samples and raters, limiting generalizability. The use of only quantitative methods prevents a full understanding of the rating process from the raters' perspectives. The limited access to secure assessment materials restricted the training materials used.
Related Publications
Explore these studies to deepen your understanding of the subject.