Comparison of genome-wide association and genomic prediction methods for milk production traits in Korean Holstein cattle

Objective The objectives of this study were to compare identified informative regions through two genome-wide association study (GWAS) approaches and determine the accuracy and bias of the direct genomic value (DGV) for milk production traits in Korean Holstein cattle, using two genomic prediction approaches: single-step genomic best linear unbiased prediction (ss-GBLUP) and Bayesian Bayes-B. Methods Records on production traits such as adjusted 305-day milk (MY305), fat (FY305), and protein (PY305) yields were collected from 265,271 first parity cows. After quality control, 50,765 single-nucleotide polymorphic genotypes were available for analysis. In GWAS for ss-GBLUP (ssGWAS) and Bayes-B (BayesGWAS), the proportion of genetic variance for each 1-Mb genomic window was calculated and used to identify informative genomic regions. Accuracy of the DGV was estimated by a five-fold cross-validation with random clustering. As a measure of accuracy for DGV, we also assessed the correlation between DGV and deregressed-estimated breeding value (DEBV). The bias of DGV for each method was obtained by determining regression coefficients. Results A total of nine and five significant windows (1 Mb) were identified for MY305 using ssGWAS and BayesGWAS, respectively. Using ssGWAS and BayesGWAS, we also detected multiple significant regions for FY305 (12 and 7) and PY305 (14 and 2), respectively. Both single-step DGV and Bayes DGV also showed somewhat moderate accuracy ranges for MY305 (0.32 to 0.34), FY305 (0.37 to 0.39), and PY305 (0.35 to 0.36) traits, respectively. The mean biases of DGVs determined using the single-step and Bayesian methods were 1.50±0.21 and 1.18±0.26 for MY305, 1.75±0.33 and 1.14±0.20 for FY305, and 1.59±0.20 and 1.14±0.15 for PY305, respectively. Conclusion From the bias perspective, we believe that genomic selection based on the application of Bayesian approaches would be more suitable than application of ss-GBLUP in Korean Holstein populations.


INTRODUCTION
High production ability has been used for primary selection in dairy breeding schemes. In particular, milk yield, fat yield, and protein yield are the most important economic traits for dairy cattle selection. To date, genetic improvement of these economic traits has been performed successfully based on traditional best linear unbiased prediction (BLUP), and the breeding values of economic traits have been applied with selection indices in Korean dairy breeding systems. The BLUP used in combination with individual records and estimated breeding value (EBV) has resulted in considerable genetic progress in the dairy industry [1]. In recent years, however, genomic information in the form of commercial single-nucleo-tide polymorphic (SNP) marker panels from various companies (i.e., Illumina, San Diego, CA, USA; Neogen-GeneSeek, Lincoln, NE, USA; and Affymetrix, Santa Clara, CA, USA) have become available for genetic evaluations, as a consequence of improvements in genotyping technology and statistical methods after introduction by Meuwissen et al [2] in 2001. Accordingly, genomic prediction using genotypic data has been widely applied for various livestock.
Genomic selection (GS) involves selection of bulls based on genomic breeding values, which are derived from the combination of EBVs and direct breeding values (DGVs) based on SNPs using several blending formulae [3,4] or single-step methods (e.g., single-step genomic best linear unbiased prediction [ss-GBLUP]) [5] and single-step super hybrid model [6]). The advantages of GS are simplicity and resistance to preselection bias [7,8] and more reliable prediction than traditional BLUP [1,9,10]. When GS schemes are applied in the field, the use of young bulls should be the most effective in terms of reliability. For example, in young Holstein bulls in the United States, reliabilities for predicted transmitting abilities for milk yield based on genomic information ranged from 73% to 79% [11].
Typically, there are two approaches to performing GS. The first method is multiple-step GS. In step 1 of this method, pseudo-phenotypes (i.e., EBV or deregressed-EBVs), which include information related to genotyped and ungenotyped animals, are calculated for the genotyped animals; in step 2, DGV is calculated using the pseudo-records and genotyped data (i.e., Bayesian and GBLUP approaches); and in step 3, the traditional EBV and DGV are combined into genomic-enhanced EBVs (GE-EBVs). The second method is ss-GBLUP. To construct a blended relationship matrix (H-matrix) [5] using ss-GBLUP, a numeric relationship matrix (NRM) is replaced with a genomic relationship matrix (GRM) and then these can be blended with an NRM [10]. In ss-GBLUP, the accuracy obtained for milk yield is greater than that obtained using multiple-step GS [10]. However, a drawback of ss-GBLUP is that it cannot be applied to non-linear estimates, although some solutions to ss-GBLUP non-linear estimations have been presented in the literature [10].
The objectives of this study were to compare identified informative regions through two different genome-wide association study (GWAS) approaches and assess the accuracy and bias of DGVs for milk production in Korean dairy cattle using genomic prediction approaches (i.e., ss-GBLUP and Bayesian).

Phenotypic data
Raw data for the period from 1998 to 2018 were obtained from data collected by the National Agricultural Cooperative Federation's dairy cattle improvement center by way of its milk testing program, which is nationally based. The pedigree data for this analysis were obtained from the Korean Animal Improvement Association. Traits considered in this study were adjusted 305-day (d) milk yield (MY305), adjusted 305-d fat yield (FY305), and adjusted 305-d protein yield (PY305). The data set included records for Holstein cows in the first parity with full pedigree information and excluded records with extreme milk production (MY305, <2,500 or >16,000 kg; FY305, <70 or >600 kg; and PY305, <80 or >500 kg), age at calving (<17 or >31 months). The final number of edited records was 265,271. Table 1 shows the basic statistics of the data.

Genotypic data
Genotypic data were obtained using two SNP panels: Bo-vineSNP50 v2 and BovineSNP50 v3 (Illumina Inc., USA). These two SNP panels were imputed to BovineSNP50 version 3 using Fimpute version 2.2 [12]. After excluding unmapped SNPs and SNPs on sex chromosomes, the available number of SNP markers was 54,931. After performing marker quality control, genotypes at each locus were excluded based on the following criteria: average call rate lower than 0.90; minor allele frequency less than 0.01; markers not in Hardy-Weinberg equilibrium, with a chi-square value (χ 2 ) greater than 95%; and SNPs in extreme linkage disequilibrium (LD, r 2 >0.99). After editing, 50,765 SNP genotypes were available for analysis. Furthermore, genotyped animals were excluded from analysis based on the following criteria: duplicate animals, twin animals, and animals that failed parentage tests. Duplicate animals and twin animals were removed based on marker call rates. Furthermore, genotype identification that could not be matched to the corresponding animal in the phenotypic data set was removed from a total of 2,032 Holstein dairy cattle. Finally, for ss-GBLUP for all traits, the genotype data set comprised 1,919 animals, whereas for Bayes-B, the number of animals available for MY305, FY305, and PY305 was 963, 943, and 946, respectively.

Statistical model
Genetic components, breeding values, and corresponding reliabilities of milk production traits were estimated using following mixed-model equation: where y ijk is the observation; HYS i is the fixed effect of the ith herd-year season; age j is the fixed effect of the jth calving age; a k is the random genetic effect of animal k; and e ijk is the residual effect. Using a univariate animal model, the covariance between traits was assumed to be zero. In matrix notation, the statistical model with single traits was as follows: where y is the matrix of observations for the traits; X and Z are the known incidence matrices for fixed and random effects; b is the vector of fixed effects; a is the vector of additive genetic effects for each animal, and e is the vector of the residual effect. Total phenotypic variance ( = + + + 122 calving age; ak is the random genetic effect of animal k; and eijk is the residual effect. Using a univariate animal 125 model, the covariance between traits was assumed to be zero. In matrix notation, the statistical model with 126 single traits was as follows: 127 where y is the matrix of observations for the traits; X and Z are the known incidence matrices for fixed and 131 random effects; b is the vector of fixed effects; a is the vector of additive genetic effects for each animal, and e 132 is the vector of the residual effect. 133 Total phenotypic variance ( 2 ) was defined as the sum of additive ( 2 ), and residual ( 2 ) variance. Thus, 134 the heritability was calculated as ℎ 2 = 2 2 , where ℎ 2 is the estimate of heritability. 135 The reliability of breeding value was then calculated as: √1 − / 2 , where PEV is the prediction error 136 variance. The variance components and EBVs were estimated using the expectation maximization restricted 137 maximum likelihood (EM-REML) algorithm in the REMLF90 and the BLUPF90 software module from the 138 BLUPF90 family [13]. ) was defined as the sum of additive ( t of the i th herd-year season; agej is the fixed effect of the j th al k; and eijk is the residual effect. Using a univariate animal to be zero. In matrix notation, the statistical model with ; X and Z are the known incidence matrices for fixed and the vector of additive genetic effects for each animal, and e the sum of additive ( 2 ), and residual ( 2 ) variance. Thus, 2 is the estimate of heritability. ere yijk is the observation; HYSi is the fixed effect of the i th herd-year season; agej is the fixed effect of the j th lving age; ak is the random genetic effect of animal k; and eijk is the residual effect. Using a univariate animal del, the covariance between traits was assumed to be zero. In matrix notation, the statistical model with gle traits was as follows: ere y is the matrix of observations for the traits; X and Z are the known incidence matrices for fixed and ndom effects; b is the vector of fixed effects; a is the vector of additive genetic effects for each animal, and e the vector of the residual effect.
The reliability of breeding value was then calculated as: √1 − / 2 , where PEV is the prediction error riance. The variance components and EBVs were estimated using the expectation maximization restricted aximum likelihood (EM-REML) algorithm in the REMLF90 and the BLUPF90 software module from the UPF90 family [13].
Single-step method: The ss-GBLUP was used to predict DGVs (DGVss) and analyze genome-wide sociation study data (ssGWAS). The ss-GBLUP method is a modification of BLUP. The numerator lationship matrix (A -1 ) was replaced by an H -1 matrix (a combination of numerator relationship matrix and M) as follows: , where h 2 is the estimate of heritability.
The reliability of breeding value was then calculated as: ect of the i th herd-year season; agej is the fixed effect of the j th mal k; and eijk is the residual effect. Using a univariate animal ed to be zero. In matrix notation, the statistical model with its; X and Z are the known incidence matrices for fixed and is the vector of additive genetic effects for each animal, and e s the sum of additive ( 2 ), and residual ( 2 ) variance. Thus, ℎ 2 is the estimate of heritability. , where PEV is the prediction error variance. The variance components and EBVs were estimated using the expectation maximization restricted maximum likelihood (EM-REML) algorithm in the REMLF90 and the BLUPF90 software module from the BLUPF90 family [13].
Single-step method: The ss-GBLUP was used to predict DGVs (DGV ss ) and analyze genome-wide association study data (ssGWAS). The ss-GBLUP method is a modification of BLUP. The numerator relationship matrix (A -1 ) was replaced by an H -1 matrix (a combination of numerator relationship matrix and GRM) as follows: where A22 is a numerator relationship matrix for genotyped animals and G is a GRM. The genomic matrix (G) is 147 formed based on [3] as follows: 148 where Z is the incidence matrix for markers, D is a weight matrix for SNPs (initially D = I), and q is a weighting 152 factor. The weighting factor can be obtained by using either SNP frequency [3] or by guaranteeing that the 153 average diagonal in G approaches that of A22 [8]. In the present study, for increasing the weights of SNPs with 154 large effects and decreasing those with small effects, the SNP effect and weighting factor were derived using 155 several steps, which are described by Wang et al [14]. In the present study, the weighting factor used second 156 iteration and all procedures were performed using the BLUPF90 family [13]. (dividing by the reliability of EBV) with parental information [9], were used as response variables for prediction 160 where A 22 is a numerator relationship matrix for genotyped animals and G is a GRM. The genomic matrix (G) is formed based on [3] as follows: where Z is the incidence matrix for markers, D is a weight matrix for SNPs (initially D = I), and q is a weighting factor. The weighting factor can be obtained by using either SNP frequency [3] or by guaranteeing that the average diagonal in G approaches that of A 22 [8]. In the present study, for increasing the weights of SNPs with large effects and decreasing those with small effects, the SNP effect and weighting factor were derived using several steps, which are described by Wang et al [14]. In the present study, the weighting factor used second iteration and all procedures were performed using the BLUPF90 family [13].
Bayesian method: Deregressed-estimated breeding value (DEBV) adjusting for parental average (DEBV-PA) values, which contained only phenotypic information for individuals and their descendants used for deregression (dividing by the reliability of EBV) with parental information [9], were used as response variables for prediction of DGV Bayes and GWAS (BayesGWAS) in the multiple-step process. To ensure the quality of DEBV-PA values, those animals with a reliability of less than 0.10 were removed. To account for the heterogeneous variance of DEBV, the response variable was weighted because each animal has different reliabilities. The weighting factor (w i ) [15] for animal i was calculated as follows: several steps, which are described by Wang et al [14]. In the present study, th 156 iteration and all procedures were performed using the BLUPF90 family [13]. 157 Bayesian method: Deregressed-estimated breeding value (DEBV) adjusting 158 values, which contained only phenotypic information for individuals and their de 159 (dividing by the reliability of EBV) with parental information [9], were used as 160 of DGVBayes and GWAS (BayesGWAS) in the multiple-step process. To ensure 161 those animals with a reliability of less than 0.10 were removed. To account fo 162 DEBV, the response variable was weighted because each animal has different re 163 ( ) [15] for animal was calculated as follows: 164 where 2 is the reliability of EBV, ℎ 2 is the heritability of the trait, and 168 variation that could not be explained by the genetic information (i.e., SNP 169 assumed to be equal to 0.40 [16] . 170 To estimate SNP marker effects, the Bayes-B method was used [2] with π s 171 assumes that some proportion (π) of SNP markers has zero effects and that each 172 variance, which contrasts with the Bayes-C method. For each trait, marker e 173 where several steps, which are described by Wang et al [14]. In the present study, t values, which contained only phenotypic information for individuals and their d 159 (dividing by the reliability of EBV) with parental information [9], were used as 160 of DGVBayes and GWAS (BayesGWAS) in the multiple-step process. To ensure 161 those animals with a reliability of less than 0.10 were removed. To account f 162 DEBV, the response variable was weighted because each animal has different r 163 ( ) [15] for animal was calculated as follows: 164 where 2 is the reliability of EBV, ℎ 2 is the heritability of the trait, and 168 variation that could not be explained by the genetic information (i.e., SNP 169 assumed to be equal to 0.40 [16] . 170 To estimate SNP marker effects, the Bayes-B method was used [2] with π 171 assumes that some proportion (π) of SNP markers has zero effects and that eac 172 variance, which contrasts with the Bayes-C method. For each trait, marker 173 is the reliability of EBV, h 2 is the heritability of the trait, and c is the the proportion of genetic variation that could not be explained by the genetic information (i.e., SNP markers). In this study, c was assumed to be equal to 0.40 [16] .
To estimate SNP marker effects, the Bayes-B method was used [2] with π set to 0.99. The Bayes-B method assumes that some proportion (π) of SNP markers has zero effects and that each SNP marker has locus-specific variance, which contrasts with the Bayes-C method. For each trait, marker effects were estimated using the following model equation: Metropolis-Hastings samples were run for 10 iterations. The prior genotypic a 186 results of REML were used, which were implemented in the BLUPf90 family. Al 187 using GenSel4R software [17]. 188 189

Detection of informative regions or loci based on single SNPs may result in nois 191
high ratio between the number of SNPs and the number of genotyped animals [1 192 in high LD with the same quantitative trait locus (QTL) in high-density SNP p 193 where y i is DEBV on animal i for the respective trait; μ is the population mean; k is the number of markers; Z ij is the allelic state at locus j in individual i; u j is the random substitution effect for marker j, which follows a mixed distribution for this random substitution effect according to indicator variable (δ j ), a random 0/1 variable indicating the absence or presence of marker j in the model, with u j assumed to be normally distributed N(0, 7 g model equation: is DEBV on animal i for the respective trait; μ is the population mean; k is the number of markers; Zij lelic state at locus j in individual i; uj is the random substitution effect for marker j, which follows a istribution for this random substitution effect according to indicator variable (δj), a random 0/1 variable g the absence or presence of marker j in the model, with uj assumed to be normally distributed N(0, 2 ) = 1; and ei is a random residual effect assumed to be normally distributed N(0, 2 ).
posterior distributions of the parameters and effects were obtained using Gibbs sampling. We performed v chain Monte Carlo (MCMC) simulation of 110,000 rounds with Gibbs sampling, of which the first iterations were discarded as burn-in. To estimate posterior means and variances of marker effects, lis-Hastings samples were run for 10 iterations. The prior genotypic and residual variances from the f REML were used, which were implemented in the BLUPf90 family. All procedures were implemented nSel4R software [17].
-wide association study analysis n of informative regions or loci based on single SNPs may result in noise or underestimation due to the o between the number of SNPs and the number of genotyped animals [14], and adjacent SNPs may be LD with the same quantitative trait locus (QTL) in high-density SNP panels because the effect of the ) when δ j = 1; and e i is a random residual effect assumed to be normally distributed N(0, where yi is DEBV on animal i for the respective trait; μ is the population mean; k is the number of markers; Zij 178 is the allelic state at locus j in individual i; uj is the random substitution effect for marker j, which follows a 179 mixed distribution for this random substitution effect according to indicator variable (δj), a random 0/1 variable 180 indicating the absence or presence of marker j in the model, with uj assumed to be normally distributed N(0, 2 ) 181 when δj = 1; and ei is a random residual effect assumed to be normally distributed N(0, 2 ). 182 The posterior distributions of the parameters and effects were obtained using Gibbs sampling. We performed 183 a Markov chain Monte Carlo (MCMC) simulation of 110,000 rounds with Gibbs sampling, of which the first 184 10,000 iterations were discarded as burn-in. To estimate posterior means and variances of marker effects, 185 Metropolis-Hastings samples were run for 10 iterations. The prior genotypic and residual variances from the 186 results of REML were used, which were implemented in the BLUPf90 family. All procedures were implemented 187 using GenSel4R software [17]. 188 189

Genome-wide association study analysis 190
Detection of informative regions or loci based on single SNPs may result in noise or underestimation due to the 191 high ratio between the number of SNPs and the number of genotyped animals [14], and adjacent SNPs may be 192 in high LD with the same quantitative trait locus (QTL) in high-density SNP panels because the effect of the 193 ). The posterior distributions of the parameters and effects were obtained using Gibbs sampling. We performed a Markov chain Monte Carlo (MCMC) simulation of 110,000 rounds with Gibbs sampling, of which the first 10,000 iterations were discarded as burn-in. To estimate posterior means and variances of marker effects, Metropolis-Hastings samples were run for 10 iterations. The prior genotypic and residual variances from the results of REML were used, which were implemented in the BLUPf90 family. All procedures were implemented using GenSel4R software [17].

Genome-wide association study analysis
Detection of informative regions or loci based on single SNPs may result in noise or underestimation due to the high ratio between the number of SNPs and the number of genotyped animals [14], and adjacent SNPs may be in high LD with the same quantitative trait locus (QTL) in high-density SNP panels because the effect of the QTL would be spread over all SNPs in high LD [18]. For this reason, non-overlapping 1-Mb windows, which is the proportion of genetic variance in each region consisting of a 1-Mb genome window, were calculated and used to identify informative genomic regions accounting for LD, which is more appropriate than using single SNPs.
The significance level of the informative 1-Mb window region in ssGWAS and BayesGWAS was, respectively, 1.0% and 0.5% of additive genetic variance, which was estimated as a portion of the total genetic variance explained by all SNPs.

Accuracy of the direct genomic value
To estimate the accuracy of DGVs, we applied five-fold crossvalidation with random clustering, whereby we set up training data sets, each of which was each constructed by masking the phenotype in the SS-method (i.e., setting the phenotype of genotyped cows and daughters of genotyped sires and their "unknown") and the response variable in the Bayesian method (i.e., setting the response variable "unknown"), whereby 20% of the total individuals is set to random without replacement so as to be masked precisely once in the training data sets. Using these steps, we produced five training and testing sets. This results in each genotyped animal having DGV ss and DGV Bayes values from the masking data set, as derived using the single-step and Bayesian methods, respectively. The correlation coefficient between the DGV and DEBV values was calculated and used as a measure of the accuracy of DGV. Additionally, the bias (spread) of DGV for each method was assessed using regression coefficients. Table 2 summarizes the number of masked animals and phenotypes in each data set.

Genetic parameter estimation
Variance components and heritability were estimated from regular phenotypic BLUP based on a univariate animal model. The estimated heritabilities for MY305, FY305, and PY305 were 0.26, 0.21, and 0.22, respectively (Table 3).

Genome-wide association study
Using association analysis based on ssGWAS and BayesGWAS, we detected the most significant regions for SNP markers on the Illumina BovineSNP50 panel. Figures 1, 2 shows plots of genetic variance accounted for by 1-Mb windows, within a chromosome, based on different methods. Table 4 shows the results of GWAS for milk production traits. The GWAS results include the chromosomal position and fraction of variance of 1-Mb genome windows by informative regions (greater than 0.5% or 1.0%). Using BayesGWAS, there were 2,521 regions, with an average number of 20 SNPs, whereas for ssGWAS, there were 2,024 regions with an average number of 20 SNPs.
A total of nine and five significant windows (1-Mb) were identified for MY305 using ssGWAS and BayesGWAS, respectively. The most informative window was detected on chromosome Bos taurus autosomes 15 (BTA15) at 23 Mb using ssGWAS and on BTA14 at 21 Mb using BayesGWAS, which explained 15.73% and 1.0%, respectively. An informative window common to both ssGWAS and BayesGWAS was identified on BTA14 at 1 Mb, which explained 1.54% and 0.79%, respectively. For FY305, we detected 12 significant QTLs using ssGWAS and seven significant QTLs using BayesGWAS. The region of BTA14 at 1 Mb was the most significant 1-Mb window region and a common significant region detected using  both methods, which indicated that 11.25% (ssGWAS) and 12.12% (BayesGWAS) of the additive genetic variance was captured, respectively. For PY305, we identified 14 and two significant regions using ssGWAS and BayesGWAS, respectively. Using ssGWASs and BayesGWAS, the most informative window was detected on BTA15 at 24 Mb and on BTA13 at 31 Mb, respectively. A common informative window obtained using both methods was detected on BAT13 at 31 Mb. The BTA14 region has received considerable attention from many scientists as this region has been reported to harbor a large number of QTLs having an effect on milk production. The diacylglycerol O-acyltransferase 1 (DGAT1) gene located at 1 Mb on BTA14 is generally accepted to be a major gene for milk production [21,22]. In addition to the DGAT1 gene, the 1-Mb region of BTA14 also harbors a number of other genes with linkage to DGAT1, such as cytochrome P450 family 11 subfamily B member 1 [22,23]. Accordingly, using both ssGWAS s and BayesGWAS, the 1 Mb region of BTA14 was identified to be a region associated milk and fat yield. Although the 1-Mb region on BTA14 has also previously been shown to be informative with respect to milk protein [21,22], in the present study, we were unable to detect this window with regards to milk protein. This could be attributable to the fact that the collection system for milk protein yield data in Korea was recently changed due to problems associated with the standard solution used. Accordingly, the data for milk protein are not standardized. Therefore, further research is required to obtain uniform milk protein data.
Our findings relating to the 1-Mb region on BTA14, along with other significant regions, are consistent with previously identified regions that have a potential influence on milk production in the Animal QTL database (https://www.animalgenome.org/cgi-bin/QTLdb/BT/index).
Despite the significantly higher level of genetic variance associated with using ssGWAS than when using BayesGWAS, the former was able to identify a larger number of significant regions. Moreover, it is notable that few significant regions were detected using both ssGWAS and BayesGWAS approaches, which can probably be attributed to the differences in methodologies. Methods like Bayes-B are strongly affected by priors, and by the proportion of SNPs assumed not to have an effect (π) [14,15,24]. In contrast to Bayesian methods, ss- GWAS analysis is based on available pedigree relationships, and does not depend on deregression [14]. Previous studies have investigated different GWAS approaches using simulated data sets, and found that the different methods were able to detect the same regions [25,26]. In contrast, however, Wang et al [14] found that few common regions were detected using different methods. These disparate findings can probably be explained in terms of the limitations of simulations, which do not capture the complexities of real data.

Accuracy of direct genomic value
On the basis of our previous GWAS results, we identified common QTL regions using two different approaches (i.e., ss-GBLUP and Bayes-B). However, we were unable to accurately determine the location and effect size of true QTLs. Therefore, we also compared the accuracy and bias of DGVs when using the two approaches. Table 5 shows the accuracy and bias of the DGVs determined using the ss-GBLUP (single-step method) and Bayes-B (Bayesian method) approaches. To gain estimates of the accuracy and degree of bias of DGVs, we calculated the averages of correlation and regression coefficients in predicting the masking individual in the validation set for analysis of the non-masking individual in the training set, respectively.
For the three studied traits, we noted small differences in the accuracy of the DGVs obtained using the two methods. The prediction accuracy for trait MY305 (Milk [Acc.]) was lower than that for the other milk production traits (fat and protein). However, compared with Bayes-B, the single-step method for MY305, FY305, and PY305 had a higher bias. By using weighting factors in the Bayes-B method, the more reliably genotyped animals made a greater contribution in estimating SNP marker effects and the prediction of DGVs. We did not apply weighting factors when using ss-GBLUP as real phenotypes were used as response variables when using this method. A direct comparison of the accuracy and bias of DGV determined in the present study with those determined previously is difficult given differences in populations and methodologies, such as clustering methodologies (e.g., K-means vs random vs identity by state IBS clustering), the models used, assessments of method accuracy (e.g., genetic correlation vs simple vs variable setting), and other reasons [9]. Furthermore, accuracy depends on various parameters, including the reference population size and its genetic structure [27]. In this regard, in a previous study on Danish Holsteins using a five-fold crossvalidation, Su et al [28] reported that the accuracy of DGV (r DGV,EBV ) for milk production ranged from 0.64 to 0.70. Similarly, Ding et al [29] in their study of Chinese Holsteins, reported that the accuracy of DGV (r DGV, EBV ) in five-fold cross-valida-tion using Bayes-B with priors (π = 0.99) and GBLUP for milk production ranged from 0.317 to 0.380, whereas Luan et al [30] reported an accuracy for milk production of 0.54 to 0.56 in their study on Norwegian red cattle.
We found that the mean accuracies of DGVs for milk productions in the present study were smaller than those obtained previously, which can probably be explained by the fact that the reference population size in our study was smaller than that used in other studies, which was at least 2,000 bulls. Therefore, we intend to increase the size of our reference population by continuously updating data on genotyped animals and phenotypes. This will accordingly improve the accuracy of our genomic predictions. Similarly, if real variants (true QTLs) identified from putative informative regions based on GWAS