logo
ResearchBunny Logo
Rater variability and reliability of constructed response questions in New York state high-stakes tests of English language arts and mathematics: implications for educational assessment policy

Education

Rater variability and reliability of constructed response questions in New York state high-stakes tests of English language arts and mathematics: implications for educational assessment policy

J. Huang and P. B. Whipple

Discover how a study by Jinyan Huang and Patrick B. Whipple revealed surprising insights into the reliability of holistic scoring practices for high-stakes tests in New York. Findings highlight significant concerns for assessment policy, challenging existing practices.

00:00
00:00
~3 min • Beginner • English
Introduction
New York State’s accountability reforms (e.g., adoption of the Common Core Learning Standards and subsequent Next Generation Standards) intensified the use of grades 3–8 ELA and mathematics assessments in educator evaluations (APPR). These assessments include constructed response items scored by human raters, typically once per response by a single rater under NYSED policy. Given known sources of rater-related variability, the study investigates whether the one-rater holistic scoring practice yields sufficient reliability. Using generalizability theory, the study aims to estimate rater-related variance and overall score reliability for constructed responses in grades 4 and 6 ELA and grades 4 and 5 mathematics. Research questions: (a) What are the sources of rater variability in scoring these constructed response questions across grades/subjects? (b) What are the rater reliabilities (G-coefficients for norm-referenced interpretations) for these tasks under current and increased numbers of raters?
Literature Review
The paper reviews reliability as a core indicator of assessment quality within CTT and G-theory frameworks. Reliability (consistency) is prerequisite to valid score interpretations. Inter- and intra-rater reliability are central concerns in performance assessments scored by humans. Holistic scoring offers efficiency and, with training, can support acceptable inter-rater agreement, but may lack diagnostic detail and is susceptible to rater drift. Analytic scoring can enhance both inter- and intra-rater reliability and diagnostic value but is more time- and labor-intensive, limiting practicality in large-scale accountability contexts. Prior large-scale language assessments (e.g., TOEFL, IELTS) typically use multiple raters to improve reliability; increasing tasks/raters boosts reliability. G-theory is favored to decompose multiple error sources (persons, raters, interactions) and estimate dependability, guiding design changes (e.g., number of raters) to reach target reliability. In New York’s NYSTP, single-rater holistic scoring is mandated, raising concerns about score reliability given increased constructed-response demands.
Methodology
IRB approval was obtained prior to data collection. Materials: Constructed response samples were drawn from 2015 NYSTP released materials. For ELA (grades 4 and 6): two 4-point constructed-response tasks per grade; for each grade, three “good,” three “fair,” and three “poor” responses per task were selected (18 per grade; 36 total). For Mathematics (grades 4 and 5): two 2-point and two 3-point constructed-response items per grade; similarly, three “good,” three “fair,” and three “poor” responses per task were selected (36 per grade; 72 total). Responses were chosen to maintain comparable lengths to avoid confounding with score. Raters: Ten New York State-certified educators (5 male, 5 female) were selected via convenience sampling with attention to diversity of certification, experience, and districts. Teaching experience: 6 with 1–15 years, 4 with 16–25 years; assessment experience: 4 with <5 years, 6 with >5 years; certifications evenly distributed across PK–6, reading/literacy, secondary ELA 7–12, students with disabilities, and school/district administration. Training: Raters received NYSED-like training for holistic scoring using anchor papers, practice sets, and consistency assurance sets for ELA (2- and 4-point rubrics) and mathematics (2- and 3-point rubrics). Each training session lasted about three hours per rating method. Scoring: Ten raters independently scored all samples holistically following NYSED rubrics; samples were anonymized and coded. Analysis: Using GENOVA (Crick & Brennan, 1983), twelve random-effects p × r G-studies were conducted separately for each task (ELA grade 4 tasks 1–2; ELA grade 6 tasks 1–2; Math grade 4 tasks 1–4; Math grade 5 tasks 1–4) to estimate variance components (paper/person, rater, paper×rater residual). G-coefficients (norm-referenced) were computed for current one-rater conditions and projected for increased numbers of raters.
Key Findings
- Variance components: Across all tasks, the object of measurement (paper/person) accounted for the largest share of variance; residual (paper×rater and other unexplained sources) was the second largest; rater variance was smaller but non-negligible in ELA and very small in mathematics. - ELA Grade 4: Task 1—p 57.18%, r 2.47%, pr 40.34%; Task 2—p 62.42%, r 8.22%, pr 29.36%. - ELA Grade 6: Task 1—p 48.91%, r 16.11%, pr 34.98%; Task 2—p 62.14%, r 3.73%, pr 34.13%. - Math Grade 4: Tasks 1–4 rater variance ≈ 2.27%, 0.94%, 0.98%, 1.00% (with p ranging 52.50%–81.65%). - Math Grade 5: Tasks 1–4 rater variance ≈ 1.65%, 3.77%, 2.33%, 1.12% (with p ranging 63.00%–91.22%). - Reliability (G-coefficients, one-rater scenario): - ELA Grade 4: Task 1 = 0.59; Task 2 = 0.68. Increasing to 3 raters (Task 1) and 2 raters (Task 2) raises both to ≈0.81. - ELA Grade 6: Task 1 = 0.58; Task 2 = 0.65. Using 3 raters increases to ≈0.81 and 0.85, respectively. - Math Grade 4: Tasks 1–4 = 0.96, 0.61, 0.82, 0.75. Increasing raters improves low tasks: Task 2 with 3 raters → 0.83; Task 4 with 2 raters → 0.86. - Math Grade 5: Tasks 1–4 = 0.83, 0.73, 0.64, 0.92. Increasing raters: Task 2 with 2 raters → 0.85; Task 3 with 3 raters → 0.84. - Interpretation: The current single-rater holistic scoring does not achieve acceptable reliability (≥0.80) for ELA tasks and for several mathematics tasks. Mathematics tasks generally showed very small rater variance and some tasks already exceeded the reliability threshold with one rater, but others require multiple raters to reach ≥0.80. - Residual variance was substantial across tasks (up to ~45%), suggesting additional unmeasured facets contribute to unexplained error.
Discussion
Findings address the research questions by quantifying rater-related variance and dependability under the mandated single-rater holistic scoring. Person (paper) variance dominated, as expected, confirming meaningful performance differences among examinees. However, substantial residual variance indicates hidden facets (e.g., task features, rubric clarity, rater-by-task interactions) influencing scores. In ELA, rater variance was appreciable for some tasks (e.g., 16.11% in grade 6 task 1), yielding G-coefficients well below 0.80 under one-rater conditions; thus, the one-rater policy yields insufficient reliability for ELA constructed responses. In mathematics, rater variance was consistently minimal and some tasks reached acceptable reliability with one rater, yet a number of tasks still fell short, requiring additional raters to reach ≥0.80. Overall, moving from one to multiple raters meaningfully improves dependability, aligning with G-theory expectations and common practice in large-scale writing assessments. The results suggest that current NYSTP scoring policies may undermine reliability for ELA and some mathematics tasks, with implications for the validity of inferences used in accountability and educator evaluation decisions.
Conclusion
This study applies G-theory to NYSTP constructed-response scoring in grades 4–6 ELA and grades 4–5 mathematics, showing that single-rater holistic scoring often fails to achieve acceptable reliability, particularly in ELA. Mathematics tasks exhibited very low rater variance and, in several cases, acceptable reliability with one rater; however, multiple raters are still needed for several tasks. The work contributes empirical evidence to inform state and local assessment policy, supporting the use of multiple raters to attain G ≥ 0.80. Future research should: (1) increase the number of tasks, responses, and raters to improve generalizability; (2) include qualitative components (e.g., rater interviews) to elucidate scoring processes and sources of error; (3) evaluate scoring under computer-based testing implementations; and (4) extend the framework to Regents Examinations to support high-stakes decision-making.
Limitations
- Small sample sizes of constructed responses and raters limit generalizability; access constraints to secure NYSTP materials reduced the breadth and depth of training and practice sets. - Only quantitative methods were used; the absence of qualitative data (e.g., rater interviews) limits understanding of how raters applied rubrics and made decisions. - Residual variance suggests unmodeled facets (e.g., task characteristics or rater-task interactions) that were not captured in the design.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny