Agriculture
Multi-generation genomic prediction of maize yield using parametric and non-parametric sparse selection indices
M. Lopez-cruz, Y. Beyene, et al.
Genomic Selection (GS) has been widely adopted across livestock and crop breeding, supported by increasingly large genomic and phenotypic datasets. However, as training datasets grow across generations, genetic heterogeneity rises, potentially reducing prediction accuracy due to differences in allele frequencies and linkage disequilibrium (LD) between training and target populations. Evidence shows that including distantly related or older generations may not always improve, and can sometimes reduce, prediction accuracy, raising the question of which data to include when retraining models each cycle. The study investigates whether selecting custom, individual-specific training subsets via a sparse selection index (SSI) can improve accuracy over standard genomic-BLUP (GBLUP), and whether combining SSI with non-linear kernel methods (RKHS/Gaussian kernels) enhances prediction when using multi-generation maize data. The central hypothesis is that per-individual, sparse, locally weighted training sets will improve predictive accuracy in heterogeneous, multi-generational contexts, and that kernel methods combined with sparsity can further enhance performance.
Prior work highlights the strong influence of family relationships on genomic prediction accuracy and suggests diminishing or negative returns from adding genetically distant individuals to training sets. Studies in broilers and maize have shown higher accuracy using recent generations or related biparental families. Approaches to handle heterogeneity include explicit modeling of group-specific SNP effects and multivariate or SNP-by-group interaction models, effective when populations can be partitioned into clear groups. Training set optimization methods (from threshold-based to algorithms minimizing prediction error variance or maximizing reliability) generally assume a single optimal training set for all prediction candidates, which may be suboptimal when individuals benefit from different reference subsets. RKHS regression with Gaussian kernels has often outperformed linear additive models (GBLUP) by enabling local, distance-dependent covariance structures controlled by kernel bandwidth. The SSI method introduces an L1 penalty into selection index theory, enabling per-individual sparse training subset selection and has previously outperformed GBLUP in wheat by 5–10%.
Data: 3,722 doubled haploid (DH) maize lines from 54 biparental families developed at CIMMYT (KALRO Kiboko, Kenya) across 2017–2020. Some parents were shared across years, creating half-sib connections. Selected DH lines advanced to multi-location yield trials: 923 (14 trials) in 2017, 1,423 (34) in 2018, 722 (17) in 2019, and 654 (13) in 2020. Trials used alpha-lattice designs, two reps, two optimal (well-watered) locations and one managed drought location per year. Agronomic management included target plant density of 53,333 plants ha−1 and standard fertilization. Traits recorded: grain yield (GY, t ha−1; adjusted to 12.5% moisture), anthesis date (AD, days), and plant height (PH, cm). Genotyping: Leaf tissue from all 3,722 lines genotyped using rAmpSeq (Buckler et al., 2016). From 5,465 dominant markers (0/1), segregation distortion filtering (5% FDR) removed 61 markers; minor allele frequency filtering (MAF<0.05) yielded 4,612 markers for analysis. Phenotype preprocessing: Mixed models fitted per trait–environment–year. For optimal (two locations), BLUEs within year across locations using META-R: fixed effects for location, replicate (nested), genotype, genotype×location; random incomplete block. Adjusted phenotypes were obtained by subtracting estimated nuisance effects. For drought (single location), linear model with fixed replicate, random incomplete block, and genotype fixed; adjusted phenotypes computed similarly. After QC, n=3,527 lines with markers and phenotypes remained: n2017=901, n2018=1,418, n2019=722, n2020=486. Models:
- GBLUP: y = u + e with u ~ N(0, σ2_a G), e ~ N(0, σ2_e I). G computed from centered/scaled markers Z (VanRaden method). Predictions for PS via UPS = B_G Y_TS with B_G = G_PS,TS (G_TS + λ0 I)−1, λ0=σ2_e/σ2_a.
- RKHS (KBLUP): Replace G with Gaussian kernels Kθ, K_ij = exp(−θ d^2_ij), where d^2_ij is scaled squared Euclidean marker distance. Three single-kernel bandwidths: θ1=0.2 (K1), θ2=1 (K2), θ3=5 (K3). Also kernel averaging (KA) with three random effects using K1, K2, K3.
- Sparse Selection Index (SSI): For individual i in PS, solve L1-penalized selection index: b_i(λ) = argmin_b ||(G_TS + λ0 I) b − g_i||^2 + λ ||b||_1, where g_i is relationships of i to TS; analogous formulation replacing G with K for KSSI. Coordinate descent used for optimization; λ tuned by cross-validation within TS. Sparse hat matrix B(λ) has rows b_i(λ). Variance components: For each trait–environment–TS, Bayesian models (BGLR R package) estimated σ2_a (additive or kernel-specific) and σ2_e; for KA, kernel-specific variances summed to obtain total genetic variance and averaged kernel (K_A = K1 + K2 + K3) for BLUP form. Proportion of variance explained h2 = σ2_a/(σ2_a+σ2_e). Prediction accuracy assessment: Target prediction set was cycle 2020; scenarios:
- Random 85/15 split within 2020 (413 TS, 73 PS) for baseline.
- Predict 85% set of 2020 (n=413) using prior years as TS: 2017 (901), 2018 (1,418), 2019 (722), 2018+2019 (2,140), 2017+2018+2019 (3,041).
- Augment each prior-year TS by adding 5% (25), 10% (49), or 15% (73) of 2020 individuals, yielding TS sizes per Table 1 (e.g., 2017-2019 + 15% → 3,114). Each scenario repeated 100 times with different random 2020 partitions. Accuracy = cor(y_PS, ŷ_PS). Hyperparameters: λ by 10-fold CV within TS; variance components estimated on TS only. Software: R; BGLR for variance components; SFSI package for SSI computation; multi-core parallelization supported.
- Kernel vs. linear: KBLUP (Gaussian kernels, especially K1, K2, KA) generally outperformed GBLUP by 1–15% (≈0.01–0.06 correlation points) across scenarios; KA provided robust performance comparable to the best single kernel.
- SSI benefits: Sparse models (GSSI and KSSI) increased accuracy relative to their non-sparse counterparts by ≈0.02–0.08 points when any 2020 data were included in TS. Using SSI with additive relationships (GSSI) led to 5–17% increases versus GBLUP, depending on TS composition.
- Training set composition: Using only the most recent prior generation (2019) as TS yielded the highest accuracies when 2020 was not included. Adding older generations (2018+2019 or 2017–2019) reduced accuracy by ≈5–30% (e.g., GBLUP reduction of 0.01–0.08 points vs. 2019 alone). Similar reductions observed under drought.
- Importance of contemporaries: Including small proportions of same-cycle (2020) individuals in TS sharply increased accuracy across models. For GBLUP, adding 5% (25 individuals) raised accuracy by ~88–>100% in some single-year TS (e.g., 2017: 0.02→0.30; 2018: 0.08→0.32; 2019: 0.18→0.34). Gains were even larger when adding 15% (73 individuals).
- Low-accuracy cases: When no 2020 data were in TS (lowest baseline accuracy), standard KBLUP sometimes matched or underperformed GBLUP by ~0.01–0.03, while SSI variants still achieved gains of 0.01–0.08 in many cases.
- Kernel bandwidth and sparsity: For highly local kernels (large bandwidth K3), adding sparsity did not systematically improve accuracy; optimal λ often zero. For K1, K2, and KA, sparsity improved accuracy in ≥80% of partitions.
- Variance explained vs. predictive accuracy: RKHS models showed higher fitted “heritability” in training (≈0.20–0.35 higher than GBLUP), but only modest gains in prediction accuracy, suggesting potential overfitting or increased model complexity.
- Automatic selection behavior: SSI selected custom support sets per prediction genotype, prioritizing more closely related individuals (especially from 2020). Typical GSSI support sizes corresponded to 4–42% of TS for G; 9–52% (K1) and 20–62% (K2); KA similar to K2. With 2017–2019 + 15% 2020, GSSI used on average 322 of 3,114 training genotypes (~10%). As more 2020 individuals were added to TS, fewer prior-generation individuals appeared in SSI supports.
- Computational: Per-individual SSI timings (with TS=3,114) averaged ~0.11 s (GSSI) and ~2.01 s (KASSI); fully parallelizable across prediction genotypes.
The study addresses whether all available multi-generation data should be used to train genomic prediction models or whether individualized, sparse selections can improve accuracy. Results show that training exclusively on the most recent related generation is often superior to pooling older, heterogeneous data, and that adding a small number of contemporaries markedly boosts accuracy. This underscores the dominant role of realized relationships over LD in driving prediction in these data. SSI explicitly tailors the training subset per target genotype by shrinking to zero the contributions from distant or weakly informative individuals, thereby delivering consistent gains over GBLUP and complementing kernel methods. Where kernels already enforce locality (e.g., very local Gaussian bandwidths), additional sparsity confers limited benefit; otherwise, sparse kernel models (KSSI) commonly improve accuracy over KBLUP. Although RKHS models fit training phenotypes better (higher variance explained), their generalization gains are modest, likely reflecting the bias–variance tradeoff. Overall, the findings validate per-genotype training set optimization in multi-generational, admixed breeding data and demonstrate that “bigger is not always better”—inclusion of heterogeneous, distant generations can dilute predictive signal unless weighted or pruned via local kernels or SSI.
Sparse Selection Indices (SSI) provide a practical, per-individual mechanism to optimize training data usage in heterogeneous, multi-generation breeding datasets. In CIMMYT maize, SSI with additive relationships consistently improved prediction over GBLUP (≈5–17%), and kernel methods (KBLUP) generally outperformed GBLUP, with sparse kernels (KSSI) often further enhancing accuracy. Pooling older generations reduced accuracy relative to using the most recent generation, while adding small subsets of contemporaries substantially increased accuracy. Both SSI and local kernels enable local, relationship-driven predictions that mitigate heterogeneity. Future work could: (i) extend SSI and kernel-sparse frameworks to multi-trait and genotype×environment interaction models; (ii) evaluate across more crops, traits, and larger horizons; (iii) develop dynamic retraining strategies that optimally refresh supports across cycles; (iv) integrate sequence-level data and causal variant information; and (v) improve computational efficiency and automated hyperparameter tuning for large-scale deployment.
- No common hybrids across years; mixed-effects preprocessing did not model genotype×year interactions, relying solely on genomic relationships across years.
- Marker set comprised 4,612 rAmpSeq dominant markers; conclusions may differ with denser or causal-variant-enriched panels.
- Most detailed results focus on grain yield under optimal and drought environments; generalizability to other traits/environments may vary.
- In the absence of any same-cycle data, baseline accuracies were low and kernel gains were limited; sparse models sometimes also showed small losses in these hardest settings.
- K3 (very local) kernel showed limited benefit from added sparsity, indicating sensitivity to kernel bandwidth choices.
- Computational cost of SSI is higher than BLUP per prediction, though parallelizable.
Related Publications
Explore these studies to deepen your understanding of the subject.

