logo
ResearchBunny Logo
Multi-generation genomic prediction of maize yield using parametric and non-parametric sparse selection indices

Agriculture

Multi-generation genomic prediction of maize yield using parametric and non-parametric sparse selection indices

M. Lopez-cruz, Y. Beyene, et al.

Discover how innovative genomic prediction models can enhance crop improvement strategies! This research, led by Marco Lopez-Cruz, Yoseph Beyene, Manje Gowda, Jose Crossa, Paulino Pérez-Rodríguez, and Gustavo de los Campos, reveals that combining sparse selection indices with kernel methods significantly boosts prediction accuracy in maize data.

00:00
00:00
Playback language: English
Introduction
Genomic selection (GS), first proposed by Meuwissen et al. (2001), has been widely adopted in animal and crop breeding. Large genomic datasets, while offering advantages, often present increased genetic heterogeneity due to multi-generational data and complex admixture patterns. Studies have shown that using all available data may not be optimal for genomic prediction. For example, Wolc et al. (2016) found higher prediction accuracy using data from the last three generations compared to five. Similarly, Riedelsheimer et al. (2013) and Jacobson et al. (2014) demonstrated higher accuracy when using data from biparental families sharing at least one parent. Habier et al. (2010) highlighted the impact of family relationships on prediction accuracy, with distantly related individuals contributing minimally. However, using distantly related individuals in the training set might negatively impact accuracy due to heterogeneity in allele frequencies and linkage disequilibrium (LD) patterns between training and prediction sets (Lorenz and Smith 2015). Several research approaches address this, including SNP-by-group interaction models or multivariate models for multi-breed genomic prediction (Olson et al. 2012; Lehermeier et al. 2015; Rio et al. 2020), and training set optimization methods (Clark et al. 2012; Lorenz and Smith 2015; Rincent et al. 2012; Akdemir and Isidro-Sanchez 2019; Roth et al. 2020). A limitation of these methods is the assumption of a single optimal training set for all prediction genotypes. Lopez-Cruz and de los Campos (2021) introduced a sparse selection index (SSI) to address this by identifying customized training sets for each individual in the prediction set. Reproducing Kernel Hilbert Spaces (RKHS) regression, including GBLUP as a special case using a linear kernel, has shown good predictive performance. Studies suggest that non-linear kernels, such as Gaussian kernels, may yield higher accuracy (Crossa et al. 2010; de los Campos et al. 2010; Morota and Gianola 2014; Bandeira e Sousa et al. 2017; Cuevas et al. 2016, 2017, 2018). This study investigates if combining SSIs and kernel methods improves prediction accuracy using multi-generation data.
Literature Review
The literature review extensively covers existing genomic selection (GS) methods and their limitations when applied to multi-generational data. It highlights the challenges of data heterogeneity, emphasizing that simply increasing training set size doesn't guarantee improved accuracy. Key studies demonstrating the impact of family relationships and the potential negative effects of including distantly related individuals are cited. The review discusses previous approaches to address data heterogeneity, including SNP-by-group interaction models and various training set optimization techniques. The limitations of existing training set optimization methods, which assume a single optimal training set for all prediction individuals, are discussed. The review also introduces reproducing kernel Hilbert spaces (RKHS) regression and its application in genomic prediction, noting the potential advantages of using non-linear kernels like Gaussian kernels over linear kernels. Finally, the review introduces the concept of sparse selection indices (SSI), a novel approach that addresses the limitations of existing methods by identifying custom training sets for each prediction individual.
Methodology
This study utilized four years (2017-2020) of doubled haploid (DH) maize data from the International Maize and Wheat Improvement Center (CIMMYT). The data consisted of 3722 DH lines derived from 54 biparental families, genotyped using repetitive sequences (rAmpSeq). After filtering, 4612 markers were used in the analysis. Grain yield (GY) was measured under optimal and drought conditions. Adjusted means for GY were obtained using mixed-effects models, accounting for location, replicate, and block effects. Four prediction models were compared: Genomic Best Linear Unbiased Prediction (GBLUP) using additive genomic relationships; Reproducing Kernel Hilbert Spaces (RKHS) regression using Gaussian kernels (KBLUP); Sparse Selection Index (SSI) with additive relationships (GSSI); and SSI with a Gaussian kernel (KSSI). Three Gaussian kernels (K1, K2, K3) with different bandwidth parameters and a multi-kernel model (KA) were used. Variance components were estimated using Bayesian genomic models with the BGLR R-package. Prediction accuracy was assessed using the correlation between observed and predicted values in the prediction set (2020). Different training set compositions were used: data from a single previous year (2017, 2018, or 2019), combinations of previous years (2018+2019, 2017+2018+2019), and these sets augmented with progressively larger proportions (0%, 5%, 10%, 15%) of the 2020 data. The optimal penalty parameter λ for SSIs was obtained using 10-fold cross-validation. All analyses were performed in R, using the BGLR and SFSI packages.
Key Findings
The study revealed several key findings regarding the effectiveness of different genomic prediction models when applied to multi-generational maize data. Firstly, kernel-based models (KBLUP), particularly those utilizing Gaussian kernels (K1, K2, KA), consistently outperformed the standard GBLUP model, demonstrating a 1-15% increase in prediction accuracy. This underscores the benefits of incorporating non-linear relationships between genotypes in predicting grain yield. Secondly, the incorporation of sparse selection indices (SSI) significantly enhanced prediction accuracy, regardless of the type of kernel used (additive or Gaussian). Specifically, GSSI resulted in accuracy gains ranging from 5% to 17% compared to GBLUP, while KSSI showed modest additional improvement over KBLUP, particularly when utilizing kernels with smaller bandwidths (K1 and K2). The improvement in accuracy was more pronounced when the underlying GBLUP model achieved low accuracy, which often occurred when training data did not include data from the target prediction set (2020). The analysis also indicated that including only the most recent generation in the training set (2019) did not always yield the highest accuracy. Including older generations, particularly when combined with a portion of the prediction set's data, often led to similar or even better results. Moreover, cumulating older generations in the training set resulted in decreased accuracy compared to using only the most recent generation. This suggests that genetic heterogeneity across generations negatively impacts the standard GBLUP. Finally, the SSI methodology demonstrated its capacity for automated training set optimization, selecting customized training sets (support points) for each genotype in the prediction set, highlighting the value of tailored training strategies. The sparsity of the SSIs varied, with GSSI exhibiting 4-42% sparsity and KSSI showing increased sparsity with larger bandwidth parameters.
Discussion
This study's findings address the challenges associated with training genomic prediction models using multi-generational data. The observed improvement in prediction accuracy achieved using both kernel methods and sparse selection indices (SSI) highlights the importance of accounting for non-linear genetic relationships and the benefits of tailored training set optimization. The superior performance of kernel methods, particularly those using Gaussian kernels, underscores the value of moving beyond simplistic linear models to capture complex genetic architecture. The significant accuracy gains realized through the implementation of SSI demonstrate its potential to mitigate the negative impact of genetic heterogeneity often present in multi-generational data. The study's demonstration that customized training sets, rather than a single optimal set for all prediction individuals, improve prediction accuracy has strong implications for practical genomic prediction applications. This research reinforces the notion that 'bigger is not always better' in terms of training set size and emphasizes the importance of strategic training set composition.
Conclusion
This study demonstrates the effectiveness of combining sparse selection indices (SSI) with kernel methods to improve the accuracy of genomic prediction in multi-generation maize data. SSIs offer a powerful approach to optimize training set selection by identifying a customized training set for each individual in the prediction set, mitigating the negative impacts of genetic heterogeneity. Future research could explore the application of SSIs to other crops and traits, investigate the optimal strategies for incorporating multi-generational data in genomic prediction, and further refine the computational efficiency of SSI methods.
Limitations
The study focused on a specific maize dataset from a particular breeding program. The generalizability of these findings to other populations or breeding programs requires further investigation. The computational cost of SSIs can be relatively high compared to standard BLUP methods; however, this can be mitigated using parallel computing. Further research is needed to explore alternative methods for tuning the parameters in SSI and kernel models.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny