Interdisciplinary Studies
Assessing scale reliability in citizen science motivational research: lessons learned from two case studies in Uganda
M. G. Ashepet, L. Vranken, et al.
The paper examines why and how commonly used psychometric frameworks for motivations and behavior—specifically the Volunteer Functions Inventory (VFI) and the Theory of Planned Behaviour (TPB)—perform when applied to citizen science (CS) participants in the Global South. Citizen science relies on recruiting and retaining motivated volunteers, yet most motivation studies and instruments are developed and validated in Global North contexts. The research questions are: How reliable are VFI and TPB scales in measuring motivations and behavioral intentions among CS participants in Uganda? Do standard reliability indices such as Cronbach’s alpha hold under the data structures typical for CS in this context? The study aims to assess internal consistency reliability of VFI and TPB factors among highly motivated groups (active CS participants and candidate CS participants) in two Ugandan CS networks. It evaluates Cronbach’s alpha, checks its assumptions (normality, unidimensionality, uncorrelated errors, tau-equivalence), explores data transformations, and reports alternative reliability indices (McDonald’s omega and Greatest Lower Bound, GLB). The broader purpose is to inform robust measurement practice and the potential need for context-specific instrument development for CS in the Global South.
Prior work in CS motivation often adopts frameworks from volunteerism and psychology. The functional approach to volunteering underpins the VFI, which measures six motives: values, understanding, social, career, protective, and enhancement. TPB predicts behavior via intention, shaped by attitude, subjective norms, and perceived behavioral control (PBC), with extensions including self-identity and moral obligation. These frameworks are widely used and generally show strong psychometric properties across fields, but their adaptation and validation for CS—especially in the Global South—lag behind. Internal consistency is commonly reported using Cronbach’s alpha, yet alpha is sensitive to restrictive assumptions: continuous and normally distributed data, unidimensionality, uncorrelated errors, and tau-equivalence. Violations can bias alpha unpredictably. Alternatives discussed in the literature include McDonald’s omega (better under congeneric models, skew, small samples, mild multidimensionality) and GLB (provides a lower bound to reliability; may exceed alpha and omega but can be inflated in small samples). The literature recommends verifying dimensionality and reporting multiple reliability indices rather than relying solely on alpha.
Study setting and participants: Two CS networks in southwest Uganda were studied: (1) Geo-observers (GO) monitoring seven natural hazards (landslides, floods, earthquakes, droughts, lightning, windstorms, hailstorms) under D-SIRe (2017/2019) and HARISSA (2019) projects (60 active CSs); and (2) ATRAP (2020) monitoring freshwater snails relevant to schistosomiasis transmission (25 active CSs). A control group comprised candidate CSs who met recruitment criteria but were not selected at that time (GO control ≈60; ATRAP control 30). Recruitment involved local leader nominations and project selection; CSs received training, equipment (smartphones, protective gear), identifiers, and cost reimbursements. Measures: VFI used 30 items (six functions × five items) on seven-point Likert scales (1–7). TPB used 26 items: attitudes (six semantic differential items), subjective norms (six items), PBC (five items), intention (three items), plus extensions self-identity (three) and moral obligation (three), all on seven-point Likert scales. Items were adapted from prior literature and contextualized to network tasks; several TPB items were reverse-coded. Data collection: Semi-structured, face-to-face interviews with both individual and group-based sessions allowed clarification and back-translation where needed. Interview periods spanned 2019–2021. Datasets: ATRAP_I (n=53), ATRAP_G (n=58), GO_G (n=107), GO_I (n=100). Personal demographics were also collected. Data analysis: Stage 1 computed descriptive statistics (means, SD, skewness, kurtosis), reverse-coded negatives, and Cronbach’s alpha per factor using R (psych package). Missing data (notably in GO group interviews) were excluded. Stage 2 verified alpha assumptions: item-total correlations (>0.2 considered adequate), tests for normality (Shapiro–Wilk; skew |2| threshold), tried log and inverse transformations after reflecting items; confirmatory factor analyses (CFA) specified one-factor models using MLR estimation (lavaan). Unidimensionality thresholds: CFI/TLI ≥ 0.93; RMSEA/SRMR ≤ 0.08. Residual correlation matrices were inspected; residuals >0.1 indicated notable correlated errors. Tau-equivalence was assessed via (a) comparing freely estimated vs constrained equal-loadings models (likelihood ratio test) and (b) the tau.test robust F-statistic (coefficientalpha package). Model selection used LR tests or AIC as appropriate. Stage 3 estimated alternative reliability indices: omega total (semTools) from freely estimated CFAs and GLB (psych). Threshold for acceptable reliability was set at ≥0.70 for all indices. Qualitative fieldnotes documented respondent reactions to item wording; problematic items with >2 reactions were flagged and compared to item-total correlations. All analyses used R 4.2.2; significance p≤0.05.
- Sample characteristics: Predominantly male (74%), mean age 34 years (SD 8), 38% with tertiary/University education (mostly in GO), 72% self-employed.
- A priori reliability (Cronbach’s alpha): VFI factors had generally higher alphas than TPB; across datasets, some VFI factors surpassed 0.70 (e.g., protective ATRAP_G α=0.83), but others were moderate to low (e.g., values ATRAP_G α=0.34). TPB showed acceptable alpha mainly for attitude (up to 0.91 in GO_G) and sometimes subjective norms; PBC, intention, moral obligation, and self-identity often had very low or even negative alphas (e.g., PBC GO_I α≈−0.03).
- Score distributions: Items exhibited high means and pervasive negative skew; Shapiro–Wilk tests rejected normality for all items. Data transformations (reflected log/inverse) reduced skew but did not restore normality nor systematically improve alpha values.
- Internal structure: Many one-factor CFA models failed recommended fit thresholds (CFI/TLI <0.93; RMSEA/SRMR >0.08), especially for TPB PBC, intention, moral obligation, and self-identity, indicating frequent violations of unidimensionality. Residual correlations >0.1 were common in many factors, violating uncorrelated error assumptions. Factor loadings varied widely within factors, indicating non–tau-equivalence. Tests of tau-equivalence (LR tests and tau.test) provided mixed evidence; overall, tau-equivalence could not be consistently supported.
- Alternative reliability indices: Omega total and GLB were generally higher than alpha. GLB often exceeded 0.70 for most VFI factors and for TPB attitude and subjective norms, while PBC, intention, moral obligation, and self-identity remained below 0.70 across indices. Where tau-equivalence held better, omega and alpha were closer.
- Item diagnostics and qualitative insights: Problematic items (e.g., V2, P3, certain PBC items, some attitude and social/subjective norm items) drew confusion or contextual mismatch and frequently had weak or negative item-total correlations. Removing weak items improved omega in sensitivity analyses.
- Robust pattern across networks and interview settings: Despite data collected months apart and differing interview modes (group vs individual), factor mean scores and reliability patterns were consistent.
The study set out to evaluate the internal consistency reliability of VFI and TPB scales among citizen scientists in Uganda and to test whether Cronbach’s alpha is appropriate under real-world data conditions typical of CS in the Global South. Findings show that alpha is often low for several TPB factors (PBC, intention, self-identity, moral obligation) and that its assumptions (normality, unidimensionality, uncorrelated errors, tau-equivalence) are frequently violated. These violations likely bias alpha downward and make it an unreliable indicator of internal consistency in this context. Alternative indices (omega, GLB) offered higher and more plausible reliability estimates; however, even with these, several TPB factors remained below acceptable thresholds, suggesting genuine construct or item issues (e.g., low item-total correlations, multidimensionality). Qualitative evidence indicates that certain items were culturally or contextually mismatched (e.g., phrasing about “less fortunate,” reverse-worded items, and PBC phrasing clashing with formal commitments via MoUs), potentially contributing to skewed responses, ceiling effects, and weak inter-item coherence. The project’s formalized design (training, equipment, contracts) and a relatively homogeneous, highly motivated sample likely increased score homogeneity, further affecting internal consistency statistics. Overall, the results argue for careful verification of reliability assumptions, reporting multiple indices, and adapting or redeveloping psychometric instruments to local CS contexts to validly capture motivations and behavioral drivers.
Cronbach’s alpha applied a priori to VFI and TPB in Ugandan CS samples frequently yielded low estimates and violated key assumptions, indicating that alpha should not be used uncritically. More robust indices (omega, GLB) generally produced higher reliability, though several TPB factors still underperformed, suggesting item and construct mismatches in this context. The study contributes practical guidance: investigate dimensionality and error structure before selecting reliability indices; report multiple reliability estimates; and refine or develop context-specific psychometric tools for CS, especially in the Global South. Future work should conduct qualitative pretesting and pilot studies, consider categorical methods for Likert data, explore single-item measures where multi-item coherence is weak, and test-retest reliability in more diverse samples to enhance generalizability.
- Primary data collection was designed to study motivations and intentions, not full psychometric validation; thus, other reliability forms (e.g., test-retest) were not assessed due to long intervals between interviews.
- Items were relatively long and Likert anchors may have been complex; questionnaires were not translated into local languages, potentially affecting comprehension.
- Sample sizes were modest and the control group characteristics were similar to active CSs, limiting contrast.
- Interview-based data may be influenced by social desirability and interviewer–respondent power dynamics, especially given participants’ interest in remaining in or joining CS networks.
- GLB can be inflated in small samples and weak item correlations; interpretation requires caution.
Related Publications
Explore these studies to deepen your understanding of the subject.

