logo
ResearchBunny Logo
Evaluating sport-for-development outcome measures used in a living lab setting: Process, improvements, and insights

Health and Fitness

Evaluating sport-for-development outcome measures used in a living lab setting: Process, improvements, and insights

B. Sharma, J. Robinson, et al.

This study by Bhanu Sharma, Jackie Robinson, Benjamin B. Arhen, Brian W. Timmons, Bryan Heald, and Marika Warner evaluates the reliability of 11 Likert-style outcome measures in a sport-for-development context. Analyzing data from over 2,600 participants, the research highlights the importance of context-appropriate tools and ongoing validation for enhancing research quality.... show more
Introduction

The study addresses how well existing and adapted outcome measures capture intended physical, mental, and social outcomes of sport-for-development (SFD) programming when implemented within a living lab—a real-world, participant-centered environment with less experimental control. The research question focuses on evaluating psychometric performance (per Classical Test Theory, CTT) of SFD outcome measures used at MLSE LaunchPad to determine their reliability, validity, and feasibility in a complex, real-world setting. Contextually, SFD programs aim to drive positive youth development beyond athletics, benefiting underserved youth by building transferable skills and improving health and social outcomes. However, living labs pose barriers to standardized measurement (e.g., variable engagement, logistics, less control), and conventional clinical measures are often too long, deficit-based, or linguistically inaccessible. The purpose and importance of this study lie in establishing a context-appropriate metrics framework with psychometrically sound tools to improve evaluation quality, inform program improvement, and support broader SFD research and practice.

Literature Review

The paper synthesizes literature establishing SFD as a model for promoting positive youth development and social outcomes, especially among underserved populations. It highlights the living lab approach as ecologically valid but methodologically challenging for standardized measurement. Prior critiques of SFD research point to insufficient evidence from small, observational cohorts and a lack of suitable, feasible measures for community-based settings. Conventional gold-standard clinical measures (e.g., resilience, self-efficacy, self-esteem) are often unsuitable for SFD living labs due to length, complex language, and deficit orientations. Classical Test Theory (CTT) is presented as a practical framework widely used in survey development and patient-reported outcomes to evaluate item and scale performance (e.g., missingness, endorsement patterns, inter-item relations, internal consistency, test–retest). The authors’ prior scoping work found no suitable measures tailored to SFD living labs, underscoring the need for a contextually appropriate metrics framework aligned with MLSE LaunchPad’s MISSION Measurement Methodology (Minimal, I-statements, Short, Strengths-based, Involve coaches, Online, No neutrality).

Methodology

Design: Secondary analysis of prospectively collected evaluation data from cohorts of youth (ages 6–29) participating in SFD programs (2019–2024) at MLSE LaunchPad (MLSE LP) and satellite sites in Toronto, Canada. Setting: MLSE LP is a 42,000-square-foot downtown facility serving approximately 20,000 youth annually, many from low-income and racialized communities (over 60% reporting household income < $20,000; 94% racialized). Programs align with four pillars: Healthy Body, Healthy Mind, Ready for School, Ready for Work. The living lab integrates program delivery and evaluation staff to support continuous improvement and youth-first values. Participants and programs: 2656 questionnaire completions were analyzed. Participants had average age 11.3 years; 42% girls; 1% non-binary/unspecified; 81% racialized (largest groups Black 38%, Middle Eastern 19%, East Asian 14%). Inclusion: ages 6–29, enrolled in at least one MLSE LP SFD program, English proficiency. Programs spanned sport-plus and plus-sport formats across pillars (e.g., leadership, social competence, resilience, grit, self-regulation, belonging, work readiness, self-esteem, critical thinking). Protocols: Registration, survey completion, and incentives were managed via the MLSE Scoreboard digital platform using gamification (points redeemable for items). Baseline surveys were typically completed within the first week of program start; follow-up surveys in the final program week. Staff and coaches prompted completion. Outcome measures: Eleven self-report Likert-style questionnaires (8–20 items) assessed domains including Critical Thinking, Resilience, Social Competence, Grit, Leadership, Self-Esteem, Social Capital, Ready for Work, Gritty Resilience, Belonging & Inclusion, and Self-Regulation. Scales used either 4-point (Strongly Disagree to Strongly Agree) or 6-point response formats (adding Somewhat Disagree/Agree), developed in-house or adapted to align with the MISSION framework (strengths-based, brief, accessible language, no neutral option). Measure development followed five steps: literature review; pilot analysis with CTT; content validation with stakeholders (youth, coaches, staff); drafting per MISSION; pilot and revision. Analyses: Using CTT, each outcome measure was assessed on specified thresholds: missingness (<10% per item); floor/ceiling effects (<80% at minimum/maximum); single-item endorsement (<50% endorsement of any one response category); skewness (|skew| < 2); inter-item correlations (r < 0.50 to avoid redundancy and ensure variance); item-total correlations (r > 0.30); internal consistency (Cronbach’s alpha ≥ 0.70); test–retest reliability (r > 0.50). Data were summarized at item and scale levels; iterative revisions (e.g., 6-point versions) were evaluated for improvements. Detailed instrument-by-instrument performance is provided in an online supplement.

Key Findings
  • Data volume and sample: 2656 questionnaire completions across youth ages 6–29 in multiple SFD programs and sites (2019–2024).
  • Overall CTT performance (across outcome measures; Table 5):
    • Missingness: >90% met threshold (93.8% of outcome measures), indicating minimal non-response.
    • Floor/ceiling effects: 100% met criteria.
    • Single-item endorsement: 37.5% met criterion; two-thirds showed excessive endorsement of a single response category, indicating binning.
    • Skewness: 93.8% met threshold (|skew| < 2).
    • Inter-item correlations: 18.8% met the r < 0.50 criterion; most measures showed item redundancy/overlap.
    • Internal consistency: 100% had Cronbach’s alpha ≥ 0.70.
    • Test–retest reliability: 57.1% achieved r > 0.50.
  • Specific examples from instrument-level results (Table 4):
    • Critical Thinking: alpha 0.797; test–retest r = 0.56 (n = 289). 6-point version: alpha 0.913.
    • Social Competence: alpha 0.899; test–retest r = 0.71 (n = 74). 6-point version: alpha 0.898.
    • Grit: alpha 0.905; test–retest r = 0.58 (n = 20).
    • Leadership: alpha 0.940; test–retest r = 0.46 (n = 117). 6-point version: alpha 0.926; test–retest r = 0.79 (n = 46).
    • Self-Esteem: alpha 0.965; test–retest r = 0.32 (n = 105). 6-point version: alpha 0.930.
    • Resilience: alpha 0.915; test–retest r = 0.39 (n = 38).
    • Belonging: alpha 0.832.
    • Self-Regulation: alpha 0.914.
    • Ready for Work: alpha 0.748; 6-point version alpha 0.787.
  • Iterative improvements: Introducing 6-point response options for five measures (Critical Thinking, Social Competence, Leadership, Self-Esteem, Ready for Work) improved single-item endorsement in 80% of cases with modest changes to internal consistency.
  • Practical insight: Measures generally showed excellent internal consistency and low missingness but struggled with endorsement concentration and item redundancy, which can obscure change detection and compromise sensitivity in living lab contexts.
Discussion

The evaluation addressed the central question of how existing/adapted SFD outcome measures perform under real-world living lab conditions. Findings show that while instruments exhibit strong internal consistency and feasibility (minimal missingness, acceptable skewness, no floor/ceiling problems), many suffer from response binning (single-item endorsement) and item redundancy (high inter-item correlations). These patterns limit discriminative capacity and sensitivity to change, challenging repeated-measures use in dynamic youth programs. By applying CTT, the study identified actionable avenues for improvement: revising response formats (e.g., 4-point to 6-point Likert), pruning redundant items to enhance brevity and reduce burden, and consolidating overlapping constructs (e.g., developing a combined “Gritty Resilience” scale). The improved test–retest for some revised measures (e.g., Leadership 6-point r = 0.79) suggests that format and content changes can enhance reliability without sacrificing internal consistency. Embedding iterative validation within the living lab aligns measurement with practical constraints and youth-centered principles, ultimately strengthening both research quality and program decision-making.

Conclusion

This study demonstrates a pragmatic pathway for validating and refining SFD outcome measures for living lab use. The measures generally showed strong feasibility and internal consistency but revealed issues with endorsement concentration and item redundancy. Iterative revisions—including expanded Likert response options and scale streamlining—improved measurement properties in key cases. Future work will continue iterating instrument formats, testing newly developed measures (e.g., Gritty Resilience), and building a standardized, shareable metrics framework to enable cross-site comparability. Ongoing stakeholder engagement (youth, coaches, practitioners) and complementary psychometric approaches (e.g., IRT/Rasch) are recommended to enhance sensitivity, interpretability, and generalizability of SFD evaluations.

Limitations
  • Single-site context: Data were collected from one SFD facility, potentially limiting generalizability to other settings.
  • Limited repeated measures: Many instruments lacked repeated administrations, constraining longitudinal reliability and change detection analyses.
  • Psychometric scope: Analyses relied primarily on CTT; incorporating IRT or Rasch models may provide deeper item-level insights.
  • Subgroup analyses: No stratified assessments (e.g., by age, gender) were conducted; future work should examine measurement invariance and subgroup performance.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny