
Linguistics and Languages
A genome-wide association study of Chinese and English language phenotypes in Hong Kong Chinese children
Y. Lin, Y. Shi, et al.
This genome-wide association study explores the genetic foundations of Chinese and English language abilities in 1046 bilingual children from Hong Kong. Discover how significant genetic loci are linked to language skills, with insights into educational and cognitive overlaps, particularly in English literacy. Conducted by a team of experts, this research unveils the intricate relationship between genetics and language education.
~3 min • Beginner • English
Introduction
Literacy and language skills are crucial for children’s academic development and life outcomes. These complex traits are influenced by both environmental and genetic factors and show substantial heritability and polygenicity. Prior family and twin studies indicate strong heritable components, but the specific variants and genes remain unclear due to phenotype complexity and sample-size challenges. Most GWAS to date have focused on European-ancestry cohorts, often on dyslexia as a binary outcome or on a limited set of reading/language phenotypes. This may obscure domain-specific biology and limit generalizability across languages with different orthographic and linguistic features. There is a notable paucity of GWAS on Chinese-language phenotypes and on ESL skills in Chinese populations. Given possible differences in the cognitive and neurobiological underpinnings of Chinese versus alphabetic literacy, dedicated studies in Chinese populations are needed. The present study aims to systematically investigate the genetic architecture of 34 Chinese and English (ESL) literacy- and language-related phenotypes in Cantonese-speaking Hong Kong children using SNP-, gene-, and pathway-level analyses, transcriptome-informed methods, and PRS-based overlap with neuropsychiatric and cognitive/educational traits.
Literature Review
Previous GWAS on reading and language abilities in Europeans have identified associated loci for dyslexia and related skills using both case-control and quantitative designs. However, many focused on single binary outcomes or a narrow set of phenotypes (e.g., rapid naming, word reading), limiting insight into the breadth of language domains. Cross-ancestry transferability is uncertain, as allele frequencies and LD patterns differ across populations; variants may be more discoverable in non-European populations. Few GWAS have examined Chinese-language phenotypes directly or English as a second language within Chinese children. A large dyslexia GWAS (Doust et al.) reported multiple loci with some replication in Chinese reading measures, but the discovery samples were predominantly European. Eising et al. (GenLang) meta-analyzed up to ~34,000 individuals on English reading/language skills and reported genome-wide signals for word reading. A recent Chinese GWAS (Wang et al.) examined reading abilities, but covered fewer phenotypes. This study extends the field by comprehensively profiling Chinese-language phenotypes alongside ESL measures in a Chinese cohort to assess both novel associations and genetic overlaps with prior European-based findings.
Methodology
Participants: 1048 typically developing Hong Kong Chinese-English bilingual children aged 5–12 years (274 MZ twins [137 pairs], 350 DZ twins [175 pairs], 424 singletons). Cantonese was the first language and English the second. Zygosity was confirmed via STR markers (QF-PCR). Assessments were administered in lab, school, or home settings by trained assistants. A total of 34 literacy/language-related phenotypes were analyzed, covering Chinese and English skills (see Table 10 in paper for detailed definitions). Except for rapid naming measures, higher scores indicate better performance.
Genotyping and QC: Genotyping used Illumina Human Infinium OmniZhongHua-8 v1.3. QC with PLINK 1.9 removed SNPs with HWE P<1e-5, MAF<1%, individual missingness >10%, and marker missingness >10%. After QC, 911,178 SNPs and 1046 individuals remained. Imputation was conducted via Michigan Imputation Server using the 1000 Genomes Phase 3 v5 reference; variants with INFO (Rsq) >0.3 were retained (12,475,316 SNPs). Analyses reported imputation quality (Rsq) and excluded variants with Rsq<0.3.
Phenotype processing: Both untransformed and rank-based inverse normal transformed (RINT) phenotypes were examined. QQ-plots were inspected; four traits (BDS_Total, CVA_Total, CDRAN_Mean, EDRAN_Mean) showed improved calibration after RINT and were analyzed in transformed form; others used untransformed values. Lambda-GC suggested no inflation (largest lambda-GC ≈ 1.0255; 29/34 traits <1.02).
GWAS: Univariate linear mixed models (GEMMA) were fitted per phenotype, with age and sex as fixed covariates and a GRM random effect to account for relatedness and population stratification. SNP-based tests used MAF≥0.05; genome-wide significance threshold p<5e-8. Independent loci were defined by LD-clumping in PLINK (r²=0.01, 1000 kb window), using 1000 Genomes East Asian reference; additionally, lead SNPs required Rsq>0.3 and at least two correlated SNPs (r²>0.5) with p<1e-3.
Gene- and transcriptome-level analyses: MAGMA v1.06 aggregated SNP associations into gene-level statistics; FDR control per trait was applied (FDR<0.05 significant; 0.05–0.2 suggestive). TWAS used S-PrediXcan (GTEx v7 elastic net weights; 13 brain tissues) with FDR per trait across tissues; S-MultiXcan combined across tissues to increase power and identified the best representative tissue per gene.
Pathway analysis: GAUSS subset-based gene-set testing was performed on MSigDB canonical pathways and GO gene sets (C2, C5). Significant results (generally FDR<0.05 or <0.2 as noted) and core subsets of driving genes were reported.
Polygenic risk score (PRS) analyses: PRS were constructed from external GWAS of educational attainment (EA; N≈1.13M), cognitive performance (CP; N≈258k), autism spectrum disorders (ASD), ADHD, schizophrenia (SCZ), bipolar disorder (BP), major depressive disorder (MDD) using clumping+thresholding in PLINK across 11 p-value thresholds (LD clumping r²=0.1, 1000 kb). Associations with the 34 phenotypes were tested using GEMMA LMM adjusted for age/sex with GRM random effect; FDR correction stratified by exposure-outcome. SBayesR (Bayesian multiple regression on summary stats) was also used as an alternative PRS method that does not require p-value thresholds.
Genetic overlap with external studies: Due to limited summary data from Doust et al. (top 10,000 SNPs; p≈1.31e-6 after clumping) and Wang et al. (top SNPs p<1e-5), SNP-set and gene-set overlap tests (Simes, ACAT) were conducted by extracting top SNPs/genes from those studies and testing enrichment in the HK data across the 34 phenotypes. For GenLang (Eising et al.), full GWAS summary statistics allowed genome-wide dependence testing using Hoeffding’s test (implemented in R package independence) across HK traits versus five GenLang traits (word reading, non-word reading, spelling, phoneme awareness, non-word repetition), after LD-clumping. LDSC was attempted but yielded negative heritability estimates given modest sample size, so LDSC-based genetic correlations were not reported. Exploratory correlations of effect sizes for top SNPs (p≤1e-5) between HK and GenLang were computed (Pearson/Spearman) with caveats.
Key Findings
- SNP-based GWAS: Five independent loci (LD-clumped at r²=0.01; MAF>0.05; Rsq>0.3; with ≥2 correlated SNPs r²>0.5 at p<1e-3) reached genome-wide significance (p<5e-8). Traits implicated included Chinese vocabulary, character and word reading, digit rapid naming, and English lexical decision. Notably:
• rs6905617 near MANEA/MANEA-AS1 on chr6 (p=3.29×10^-9) associated with English Lexical Decision (ELD_Total).
• rs56024259 near SLC46A2 (chr9) with Chinese Character Reading (CCR_Total).
• rs3847795 near PLXNC1 (chr12) with Chinese Digit Rapid Naming (CDRAN_Mean).
• rs4865143 near HOPX (chr4) with Chinese Vocabulary—Expressive (CVB_Total) and Chinese Word Reading (CWR_Total) (r between traits=0.63).
Genomic inflation was minimal (largest lambda-GC ≈1.0255 across untransformed traits).
- TWAS:
• S-PrediXcan: DUS3L showed significant associations (FDR<0.05) with English Word Reading (EWR_Total) in amygdala, caudate, putamen, and cerebellar hemisphere.
• S-MultiXcan: 248 significant gene-level associations (FDR<0.05). The strongest was HSD3B7 with English Vocabulary—Receptive (EVA_Total) (best tissue: cortex; FDR-adjusted p=9.55×10^-20). HSD3B7 also associated with EVB, EVK, EVD, EDRAN, and EWR. For Chinese skills, SEMA6C was most significant with CVB_Total (best tissue: cerebellar hemisphere; FDR-adjusted p=2.77×10^-12).
- MAGMA gene-based tests (top examples; FDR-corrected p):
• KCNC1 with PureC_Total (FDR=1.49×10^-5).
• CATSPERD with EWR_Total (FDR=2.22×10^-3) and EVB_Total (FDR=4.40×10^-3).
• SLC2A12 and RSPH1 with EIS_Total (FDR=2.27×10^-3 and 2.49×10^-3).
• GTF3C1 with CVD_Total (FDR=5.90×10^-4) and CVK_Total (FDR=3.30×10^-3).
• MAPK10 with CVB_Total (FDR=3.20×10^-3).
• SMKR1 and RFX8 with MS_Total (both FDR=3.25×10^-3).
- Pathway/GO enrichment (GAUSS):
• Pathways (FDR<0.05): Reactome RNA polymerase III transcription (WO_Total; FDR=1.60×10^-4), BioCarta Deregulation of CDK5 in Alzheimer’s Disease (EWR_Total; FDR=1.62×10^-3), Reactome P2Y receptors (CVK_Total and CVD_Total; FDR≈1.88×10^-3 and 2.50×10^-2), Reactome Kinesins (BDS_Total; FDR=3.37×10^-3).
• GO sets (FDR<0.05): sphingolipid-mediated signaling pathway (CDICT_Total; FDR=4.07×10^-5), glycerophospholipid catabolic process (PureC_Total; FDR=3.78×10^-4), proton-transporting V-type ATPase complex (CWR_Norm; FDR=7.13×10^-4), and ATP hydrolysis coupled transmembrane transport (EWR_Total; FDR=1.31×10^-2), among others.
- PRS overlap with external traits:
• EA and CP PRS showed the most consistent positive associations across many of the 34 traits (EA significant for 20/34; CP for 16/34 at FDR≤0.1 using clumping+thresholding), with particularly strong associations for English literacy phenotypes. SBayesR yielded broadly consistent patterns.
• Using GenLang (Eising et al.) training GWAS, PRS for non-word reading showed significant positive associations with multiple HK Chinese language traits (e.g., Chinese word reading, discourse skills, morphosyntax, reading comprehension, word order). PRS for spelling associated with English vocabulary knowledge. Dyslexia PRS showed largely concordant effect directions (higher dyslexia PRS → poorer reading), though few FDR-significant associations given limitations of available SNPs.
• Some positive associations were observed between ASD PRS and language/literacy measures (notably with SBayesR), aligning with prior reports of positive genetic correlations between ASD and EA/CP, but warranting replication.
- Genetic dependence with GenLang: Hoeffding’s test indicated significant dependence (FDR<0.1) between multiple HK traits and GenLang word reading (22/170 pairs FDR<0.1). Strongest signals included DS_Total (scaled stat 6.382; FDR=4.47×10^-4), EIS_Total (6.199; FDR=4.47×10^-4), and CVA_Total (5.853; FDR=4.88×10^-4) with word reading.
- Biological context of implicated loci: Lead SNPs mapped near genes previously tied to cognitive/educational and neuropsychiatric phenotypes, including MANEA/MANEA-AS1 and PLXNC1; additional suggestive genes included STXBP6 and NRXN3 for Chinese vocabulary/reading measures.
Discussion
This study addresses a major gap by interrogating genetic contributions to a wide range of Chinese and ESL literacy/language phenotypes in a Chinese bilingual cohort. The discovery of five genome-wide significant loci spanning both Chinese and English measures, along with significant gene- and pathway-level associations and TWAS signals, suggests underlying biological mechanisms involving synaptic/neuronal processes, membrane transport, and transcriptional regulation (e.g., HSD3B7, SEMA6C, KCNC1, SLC2A12, GTF3C1, MAPK10; pathways like RNA Pol III transcription, CDK5 dysregulation, P2Y signaling). PRS analyses demonstrate substantial shared polygenicity between literacy/language skills and EA/CP, particularly for ESL measures, suggesting that cognitive/educational genetic architecture overlaps more strongly with English literacy acquisition in this bilingual context. Significant genetic dependence between HK traits and GenLang word reading further supports cross-study overlap in genetic signals, despite differences in age ranges, ancestries, and phenotypic definitions. Together, these findings advance understanding of the polygenic basis of Chinese and English literacy skills, highlight novel candidate genes and pathways, and indicate both shared and potentially language-specific genetic influences.
Conclusion
This work represents one of the first comprehensive GWAS of both Chinese- and ESL-related literacy/language phenotypes in a Chinese cohort. It identifies multiple genome-wide significant loci, highlights candidate genes (e.g., HSD3B7, KCNC1, SLC2A12, MAPK10, SEMA6C) and pathways, and demonstrates robust polygenic overlap with educational attainment and cognitive performance, especially for English-language skills. Evidence of genetic dependence with large external reading GWAS (GenLang) underscores shared biology across cohorts. Future directions include replication in larger and diverse Chinese cohorts, fine-mapping and functional validation, expansion to rare variant and sequencing analyses, improved cross-ancestry PRS, and application of methods like LDSC in adequately powered samples to estimate heritability and genetic correlations. Such advances may ultimately inform early identification and targeted interventions for language and reading difficulties.
Limitations
- Modest sample size (N=1046) limits power to detect small-effect variants and precluded reliable LDSC-based heritability/correlation estimates (negative h2 observed).
- Single population context (Hong Kong Cantonese-speaking bilingual children) may limit generalizability across Chinese dialects, other ESL contexts, and non-Chinese populations.
- PRS training GWAS largely European; cross-ancestry portability can attenuate predictive performance and effect estimates.
- Imputation relied on 1000 Genomes Phase 3 reference; Chinese-specific panels were not broadly available at analysis time (though imputation quality metrics were applied).
- Focused on common variants; rare variant effects were not assessed and warrant sequencing studies.
- Some external datasets provided only top SNPs/genes, limiting replication/overlap power.
- Statistical associations (SNP/gene/pathway/TWAS/PRS) require independent replication and functional validation; false positives cannot be excluded.
Related Publications
Explore these studies to deepen your understanding of the subject.