Education
A systematic review of AI literacy scales
T. Lintner
The paper addresses the growing societal integration of AI across domains such as medicine, education, and science, alongside labor market shifts demanding AI skills and the proliferation of deepfakes that affect information processing and public trust. In this context, AI literacy—viewed as an advanced form of digital literacy encompassing understanding, interaction with, and critical evaluation of AI systems and outputs—has become a key concern. Prior conceptual work highlights core aspects of AI literacy (know and understand, use, evaluate, and ethics), with debate around the inclusion of higher-order skills like creating and evaluating AI. Although AI literacy is being integrated into curricula from early childhood through higher education, instrument development for assessing AI literacy has lagged. Purpose: to systematically review and critically assess the quality of AI literacy scales using COSMIN, providing guidance on instrument selection for different contexts and populations. Objectives: (1) provide a comprehensive overview of available AI literacy scales; (2) critically assess their quality; (3) guide researchers on scale selection by quality and suitability.
The paper situates AI literacy within broader digital literacy and educational technology literatures, noting rapid AI adoption in society and education and calls for AI literacy from early childhood to higher education. Conceptualizations commonly include technical understanding, societal impact, and ethics as core competencies, with debate over the inclusion of creation and critical evaluation as core components (e.g., based on Ng et al.’s framework). Existing reviews document growing AI education initiatives but highlight a gap in validated assessment instruments. The review also notes diverse target populations (general public, students, teachers) and overlapping competencies across scales despite differing theoretical sources.
The review followed PRISMA 2020 guidelines and was preregistered on OSF (https://osf.io/tcjaz). Searches were conducted in Scopus and arXiv (initially on January 1, 2024; updated June 18, 2024) without limits on date, type, or stage; Scopus searches were limited to English. Search strings combined AI, literacy/skills/knowledge, and scale/test/questionnaire/survey terms. Additional sources were identified through reference list checks and reverse searches in Scopus and Google Scholar (June 20, 2024). Inclusion criteria: studies that (1) developed or revalidated an AI literacy scale, (2) provided full item lists, (3) described item formulation, (4) described participants, and (5) described validation techniques. Data extracted included authors, publication date, scale type, item count/type, languages, target population, participant characteristics, factor methods/structure, and metrics for quality assessment. Authors were contacted for missing data; published datasets were used when available to compute interpretability and feasibility indicators (e.g., completion time, missing data, floor/ceiling effects). Quality assessment employed COSMIN for measurement properties (content validity, structural validity, internal consistency, cross-cultural validity, measurement invariance, reliability, measurement error, construct validity, responsiveness). Criterion validity was not assessed due to absence of a gold standard. Interpretability (score distributions, missing data, floor/ceiling effects) and feasibility (available languages, completion time) were also evaluated. Evidence was synthesized per property per scale; overall ratings applied COSMIN criteria and the GRADE approach. For revalidated scales, estimates were pooled via random-effects meta-analysis (R metafor package).
- Search results: 5574 records identified; 5560 after deduplication; 59 full texts screened; 22 studies included, covering 16 distinct scales.
- Populations: general population, higher education students, secondary education students, and teachers. Three performance-based scales vs 13 self-report scales.
- Common competencies across scales: technical understanding of AI, societal impact, and ethics; varying inclusion of higher-order skills (creation and critical evaluation) as core.
- Psychometric overview: generally good structural validity and internal consistency. Few scales assessed content validity, reliability, construct validity, or responsiveness. None assessed cross-cultural validity or measurement error. Interpretability indicators (distributions, missing data, floor/ceiling) and feasibility (completion time) were often unreported; raw data were rarely available.
- Scale-specific highlights: • AI literacy test (higher education; performance-based): single-factor IRT; limited content validity evidence; high structural validity, internal consistency, construct validity; German and English versions (English not yet revalidated). • AI-CI (middle school; performance-based): 20 MCQs; good content and structural validity; high internal consistency and responsiveness; English. • AILQ (secondary; self-report): four-factor model (affective, behavioral, cognitive, ethical); moderate content validity; high structural validity and internal consistency; very low evidence on responsiveness; English. • AILS (general population; self-report): four factors (awareness, use, evaluation, ethics); high structural validity and internal consistency; low evidence for reliability; very low content validity; revalidated in Turkish (no cross-cultural validation). • AISES (general population; self-report): high structural validity and internal consistency; content validation not performed; English. • ChatGPT literacy (college; self-report): Delphi-based; good content validity; high structural validity, internal consistency, construct validity; English. • GSE-6AI (medical students; self-report): brief 6-item AI self-efficacy; high structural validity, internal consistency, gender invariance; no content validation; Spanish and English. • Hwang et al. (college; self-report): six-factor scale on AI-era digital literacy; high structural validity and internal consistency; no content validation; English. • Intelligent TPACK (teachers; self-report): five-factor including ethics; high structural validity and internal consistency; no content validation; English. • Kim & Lee (middle school; self-report): six factors; limited content validity; high structural validity, internal consistency, construct validity; Korean. • MAILS (general population; self-report): modular; strong structural validity, internal consistency, construct validity; evidence of minimal floor/ceiling effects; feasible completion (<20 min); content validation not performed; German and English (English not revalidated). • MAIRS-MS (medical students; self-report): four-factor; limited content validity; high structural validity, internal consistency, gender invariance; Turkish and Persian (no cross-cultural validation). • Pinski & Belian (general; self-report): limited content and structural validity due to small N; medium internal consistency; English. • SAIL4ALL (general; performance-based and Likert formats with four subscales): mixed structural validity and internal consistency; “What can AI do?” subscale poor; ceiling effects indicated; English; completion ~9 min. • SNAIL (general and students; self-report): three-factor (technical understanding, critical appraisal, practical application); high structural validity, internal consistency; limited reliability and responsiveness; content validation not performed; possible floor effects; quick completion (~6–8 min); English, German, Turkish.
The review met its objectives by cataloging 16 AI literacy scales across 22 validation studies and appraising their psychometric quality. Findings show solid evidence for structural validity and internal consistency across many instruments, but widespread gaps in content validation with target populations, reliability, responsiveness, and complete absence of cross-cultural validity and measurement error assessments. The lack of interpretability data and open datasets further limits confidence and replicability. Recommendations by population: (1) General population—AILS presents the most robust evidence (some content validity, reliability, strong structural validity and internal consistency); SNAIL and MAILS are promising but require content validation and investigation of floor effects (SNAIL) and broader validation (MAILS). Pinski & Belian’s instrument needs larger-sample validation; SAIL4ALL offers a performance-based option but with mixed psychometrics. (2) Higher education—AI literacy test and ChatGPT literacy scale show relatively stronger evidence (including content validity); MAIRS-MS is relevant for medical students; GSE-6AI, Hwang et al., and Chan & Zhou’s EVT subscale are promising pending content validation. (3) Secondary education—AI-CI and AILQ have stronger evidence bases including responsiveness; Kim & Lee’s instrument is also supported but currently only in Korean. (4) Teachers—Intelligent TPACK is the only instrument to assess AI-related pedagogical readiness, though content validation is pending. Overall, addressing the identified psychometric gaps—especially content validity with target populations, cross-cultural validity, reliability, and measurement error—will enhance the field’s capacity to measure AI literacy reliably across contexts.
This systematic review consolidates and evaluates the current landscape of AI literacy scales, providing practical guidance for researchers and educators selecting instruments by population and use case. The main contributions include: (1) a comprehensive inventory of 16 scales and their characteristics; (2) a COSMIN- and GRADE-based quality appraisal revealing strong structural validity/internal consistency but limited evidence on content validity, reliability, responsiveness, and no assessments of cross-cultural validity or measurement error; (3) recommendations for scale selection across populations. Future work should prioritize: content validation with target groups; cross-cultural validation and measurement error; routine reporting of interpretability and feasibility indicators; open data to support replication; development of additional performance-based instruments beyond higher education; and cross-validation between self-report and performance-based measures. Periodic updates will be necessary as AI and AI literacy evolve.
The review was conducted by a single author, introducing potential bias in quality assessment despite using standardized COSMIN criteria. The search was limited to Scopus and arXiv, which may have missed relevant scales in grey literature; however, reverse searches and reference list screening were used to mitigate this risk. Many primary studies lacked open data and key interpretability/feasibility indicators, constraining secondary analyses and possibly affecting the precision of quality judgments.
Related Publications
Explore these studies to deepen your understanding of the subject.

