# Accurate Estimation of Effective Population Size in the Korean Dairy Cattle Based on Linkage Disequilibrium Corrected by Genomic Relationship Matrix

## Article information

## Abstract

Linkage disequilibrium between markers or genetic variants underlying interesting traits affects many genomic methodologies. In many genomic methodologies, the effective population size (N_{e}) is important to assess the genetic diversity of animal populations. In this study, dairy cattle were genotyped using the Illumina BovineHD Genotyping BeadChips for over 777,000 SNPs located across all autosomes, mitochondria and sex chromosomes, and 70,000 autosomal SNPs were selected randomly for the final analysis. We characterized more accurate linkage disequilibrium in a sample of 96 dairy cattle producing milk in Korea. Estimated linkage disequilibrium was relatively high between closely linked markers (>0.6 at 10 kb) and decreased with increasing distance. Using formulae that related the expected linkage disequilibrium to N_{e}, and assuming a constant actual population size, N_{e} was estimated to be approximately 122 in this population. Historical N_{e}, calculated assuming linear population growth, was suggestive of a rapid increase N_{e} over the past 10 generations, and increased slowly thereafter. Additionally, we corrected the genomic relationship structure per chromosome in calculating *r ^{2}* and estimated N

_{e}. The observed N

_{e}based on

*r*corrected by genomics relationship structure can be rationalized using current knowledge of the history of the dairy cattle breeds producing milk in Korea.

^{2}**Keywords:**Linkage Disequilibrium; Dairy Cattle; Effective Population Size; SNP

## INTRODUCTION

The recent availability of SNP genotyping allows for the application of genomic techniques to domestic animals. Genomic techniques such as genome-wide association studies and genomic selection depend on the extent of linkage disequilibrium and its rate of decline with distance between loci within a population. These genomic tools depend on sample size and the quality of linkage disequilibrium estimation. Therefore, accurate characterization of linkage disequilibrium will assist in planning future studies using genomic techniques.

Linkage disequilibrium can provide insights into the evolutionary history of a population through the effective population size (N_{e}) (Falconer et al., 1996). Using N_{e}, we can monitor genetic diversity in livestock populations and explain the observed range and pattern of genetic variation with regard to population genetics. In addition, if pedigrees are incomplete or unavailable, N_{e} provides information into the inbreeding level of a population. The strength of linkage disequilibrium at different genetic distances between makers can be used to infer ancestral N_{e}. The pattern of historical N_{e} in animal populations can increase our understanding of the impact of selective breeding strategies on the genetic variation within the framework of population genetics.

Researchers have already applied SNP chips for genome-wide association study (Huang et al., 2010; Jiang et al. 2010; Sahana, 2010) and genomic selection (Schaeffer, 2006; Hayes et al., 2009) in dairy cattle. The pattern of linkage disequilibrium in cattle has been characterized, and predictions of N_{e} estimated using SNP chip data have already been made. Flury (2010), based on 128 Swiss Eringer breed genotyped using the Illumina BovineSNP50 BeadChip, estimated linkage disequilibrium-based actual N_{e} and ancestral pedigree-based N_{e} (Flury et al., 2010). Other studies have evaluated the historical N_{e} of a variety of cattle breeds, all of which suggested a continuous decrease in N_{e} since the time of domestication (Thevenon, et al., 2007; De Roos et al., 2008).

N_{e} is an important measure in the global dairy cattle industry as well as the Korean dairy cattle industry in identifying the state of genetic resource. As Korea is a major-semen importing country in the dairy cattle industry (Table S4), dairy cattle in Korea have diverse genetic sources. In this study, we used *r ^{2}* corrected by the genomic relationship structure based on SNPs, which is more accurate than the existing

*r*to estimate linkage disequilibrium. We then estimated the current N

^{2}_{e}and inferred an accurate N

_{e}. These results are compared with other studies and considered in the context of current knowledge for the establishment of genomics methods in dairy cattle in Korea.

## MATERIALS AND METHODS

### Genotypic data

Almost all of the dairy cattle in Korea are Holstein, which have been produced by selected domestic seed bull or imported semen. Therefore, the Korean dairy cattle population in this manuscript is the Holsteins population in Korea. The country of origin for imported semen is shown in Table S4. The 96 Holstein samples were collected from four field dairy farms located in Gyeonggi and Gangwon provinces. The samples were collected by the Animal and Plant Quarantine Agency of Korea (a governmental agency) by random sampling for a survey on population disease resistance in Korea. The collected samples for this study represent the Korean Holstein cattle population despite the limitations of the sample size and is treated as a single population sample representing the dairy cattle of Korea. There are few regional differences in the dairy cattle in Korea because the supply and management of dairy cattle semen or seed bull takes place at the national level.

The Illumina BovineHD Genotyping BeadChip, which contains 777 kb SNP located across all autosomes, sex chromosomes, and mitochondria was used. These informative SNPs were selected from the Bovine Genome Database. Genotyping data were analyzed using Plink for quality control (GENO>0.05, MAF<0.1, HWE test p≤ 0.0001), which removed 226,156 SNP from the analysis because of poor genotyping quality (Purcell et al., 2007). As the Illumina BovineHD Genotyping BeadChip contains numerous SNP at a high density, the distance between adjacent SNPs is very short making the computing time long. To reduce the computational time, we used 70,000 randomly selected SNPs including only autosomal SNPs after quality control. Interbull (Uppsala, Sweden; www.interbull.org) uses BovineSNP50 Genotyping BeadChips, which contains 54,609 SNP to estimate genetic parameters and hence, we considered 70,000 SNPs sufficient for our purposes.

### Linkage disequilibrium

We estimated linkage disequilibrium using the “LDcorSV” package implemented in R. This package provides a set of functions to measure the existing *r ^{2}* and the

*r*corrected by the sample structure (Mangin et al., 2012). Estimating the

^{2}*r*of all pairwise per chromosome using high-density SNP chips is time-consuming. To reduce computing time, the input for each chromosome was split into files containing 100 loci, and

^{2}*r*

^{2}was calculated for all syntenic marker pairs in each file as has been done in previous studies (Flury et al., 2010).

The standard measures of existing *r ^{2}* are respectively equivalent to the covariance and the correlation between alleles at two different loci (Hill and Robertson, 1968), computed as:

_{A}, P

_{a}, P

_{B}, and P

_{b}are the respective frequencies of alleles A, a, B and b, and D is P

_{AB}−P

_{A}P

_{B.}Sample structure information is required for the corrected

*r*. We used the genomic relationship matrix based on SNP data instead of pedigree data. In this study, the genomic relationship value of individual pairs was the number of common SNP between two individuals/number of total allele sites. In this way, we proposed a simple genomic relationship matrix (96×96). A process of calculating

^{2}*r*corrected by the genomic relationship structure based on SNP was identical to the existing

^{2}*r*calculation.

^{2}Details on the physical position of the markers can be found in the Illumina product literature (http://support.illumina.com/array/array_kits/bovinehd_dna_analysis_kit/downloads.ilmn). To determine linkage disequilibrium in relation to the physical distance between markers, marker pairs were divided into distance bins. We established two kinds of classes (0 to 0.5Mb, 0 to 5Mb) and subsequently, applicable marker pairs to each class were put into 50 distance bins with bin ranges dependent on the class (see Table S2). The two types of mean *r ^{2}* for each of the distance bins were then plotted against the median of the distance bin range (Mb). This estimation of

*r*was performed on a chromosome by chromosome basis; the pooled results are presented in Figure 2.

^{2}Also *r ^{2}* was calculated for a random selection of non-syntenic SNP. Across the autosome, 700 SNP were randomly selected and

*r*values were calculated for only non-syntenic marker pairs. This resulted in a total of 250,096 pairwise comparisons that were not corrected by the genomic relationship structure.

^{2}### Construction model of linkage disequilibrium with distance

Under the assumption of an isolated population with random mating, Sved (1971) derived an approximate expression for the expectation of *r ^{2}* (Sved, 1971):

*r*) given by Sved (1971). Based on this formula, a non-linear least-squares approach to statistically model the observed

^{2}*r*was implemented within R (nls function) using this model:

^{2}_{i}is the value of

*r*for marker pair i, at linkage distance d

^{2}_{i}in Morgans. Parameters a and b were estimated iteratively using the least-squares method. Chromosome-specific megabase-to-centimorgan conversion rates were calculated based on total physical chromosome length as stated on the UCSC Web site, and total chromosome genetic length derived from the bovine linkage map (Arias et al., 2009) (see Table S2). We used marker pairs for which

*r*values were higher than the mean of

^{2}*r*for non-syntenic marker pairs. This model was applied to each chromosome in turn and the parameters were estimated. Similar to Corbin (2010), estimated parameters were combined by meta-analysis in R using an inverse variance method for pooling and random effects method based on the DerSimonian-Laird method (DerSimonian and Laird, 1986; Corbin et al., 2010).

^{2}### Ancestral effective population size estimation

Equation (2) can predict the effective population size at a given point in time, expressed as generations in the past (Hayes et al., 2003; De Roos et al., 2008).

_{T}is the effective population size t generations ago, c is the distance between markers in Morgans,

*r*

_{c}

^{2}is the mean value of

*r*for marker pairs c Morgans apart, and c = (2t)

^{2}^{−1}when assuming linear growth (Hayes et al., 2003). As previously mentioned, marker pairs with linkage distances less than the mean of

*r*for non-syntenic marker pairs (<0.014248) were excluded from this analysis. To estimate N

^{2}_{T}, the number of prior generations was selected and a suitable range for c was calculated. The binning process was designed to ensure sufficient marker pairs within each bin and obtain a representative

*r*mean (see Table S3). This process was performed for markers pooled across autosomes.

^{2}### Estimation of effective population size based on the genomic relationship structure per chromosome

For the 96 genotyped individuals, we were able to establish genomic relationship structure per chromosome using SNP information of each chromosome. We proposed *r ^{2}* corrected by the genomic relationship structure per chromosome using “LDcorSV” package implemented in R (Mangin et al., 2012). A process of estimating N

_{e}based on genomic relationship per chromosome and ancestral N

_{e}was identical to that described above.

## RESULTS

### Genotypic data

Of the 734,862 genotyped autosomal SNPs, 508,707 (69.2%) remained after quality control processes and 70,000 were selected for analysis. The number of SNPs per autosome remaining after filtering and selection ranged from 1,300 to 4,280 and was closely related to chromosome length and total number of SNP as shown in Figure 1. The minor allele frequency of remaining SNP followed a uniform distribution and averaged (±SD) to be 0.31±0.11. The average distance between marker pairs (±standard deviation) for this analysis was 1,202.77±932 kb, with the distance between markers ranging from 0.134 kb to 8,398 kb.

### Linkage disequilibrium

Linkage disequilibrium declined with increasing distances between marker pairs, as shown in Figure 2a and b. The most rapid decrease was seen over the first four bins, with the mean *r ^{2}* decreasing by about half. The mean existing

*r*decreased more slowly with increasing distance and was constant after 0.2 Mb of distance. The mean

^{2}*r*corrected by the genomic relationship structure based on SNP with distance between marker pairs was slightly less than that of the existing

^{2}*r*. However, change patterns in the

^{2}*r*means with distance for both methods were similar. The decline in linkage disequilibrium showed slight differences with log-transformed distance (Figures S1 and S2). According to the existing

^{2}*r*, 23,797 of the 3,385,800 marker pairs were in complete linkage disequilibrium; 12,225 of these were adjacent pairs. For

^{2}*r*corrected by the genomic relationship structure based on SNP, 23,794 of the 3,385,800 marker pairs were in complete linkage disequilibrium; 12,223 of these were adjacent pairs. The mean (±standard deviation)

^{2}*r*between random non-syntenic marker pairs was 0.014248±0.0197. 835,324 using the existing

^{2}*r*measure and 861,676 using

^{2}*r*corrected by the genomic relationship structure based on SNP are less than the mean of non-syntenic marker pairs.

^{2}### Construction model of linkage disequilibrium with distance

The non-linear regression model of the declining linkage disequilibrium with distance resulted in both parameters, a and b, being significantly different from zero. For parameter estimation using the existing *r ^{2}*, the mean estimate and 95% confidence interval by meta-analysis across autosomes for parameters a and b were 2.95 (2.84; 3.07) and 106.32 (92.95; 119.70), respectively. The parameters a and b for

*r*corrected by the genomic relationship structure based on SNP were 2.87 (2.75; 2.99) and 122.28 (107.21; 137.34), respectively. The predicted

^{2}*r*from the non-linear regression equation was similar to the mean observed

^{2}*r*of regions with massed bins with discrepancies occurring in other regions for both two types of

^{2}*r*(Figures S1 and S2). Parameter b showed greater variability between chromosomes than parameter a. No such relationship was observed between the estimated parameters (a, b) and chromosome length (cM) as shown in Figure 3 and 4. This reason for the lack of relationship and interpretation of parameter b in this non-linear regression model as an estimated N

^{2}_{e}is demonstrated in the discussion.

### Ancestral effective population size estimation

We observed rapid increase in N_{e} over the past 10 generations with the values increasing fivefold (close to 500) by 10 generations ago (Figure 5). From the past 10 generations to the past 100 generations, N_{e} increased slowly. Our results suggest that a continuous increase in N_{e} occurred over the last 100,000 generations (Figure S3). Overall, the tendencies for the two different *r ^{2}* were similar, but N

_{e}based on

*r*corrected by the genomic relationship structure based on SNP was slightly higher than N

^{2}_{e}based on the existing

*r*. In Figure 5, although the recent difference between the two estimated measures was small, the difference in value increased with time.

^{2}### Estimation of effective population size based on genomic relationship per chromosome

Linkage disequilibrium corrected by the genomic relationship structure per chromosome declined with increasing distance between marker pairs, as shown in Figure S4a and b. Over the first four bins, the decline was greater than two previous results. The mean *r ^{2}* corrected by the genomic relationship structure per chromosome with distance between marker pairs was less than the two previous mean but the change pattern was similar. (Figure S5) In total, 23,783 of marker pairs were in complete linkage disequilibrium; 12,220 of these were adjacent pairs. For parameter estimation using

*r*corrected by the genomic relationship structure per chromosome, the mean estimate and 95% confidence interval by meta-analysis across autosomes for parameters a and b were 2.11 (2.04; 2.18) and 361.73 (335; 339.47), respectively. Parameter b was approximately three times greater than the two previous values for parameter b. The decline in linkage disequilibrium was almost linear with log-transformed distance and the decline in linkage disequilibrium with log-transformed distance was better fitted with predicted values than that of the other two. No such relationship was observed between parameters (a, b) and chromosome length (cM) in

^{2}*r*corrected by the genomic relationship structure per chromosome as shown in Figure S6. Similar to previous patterns, we observed a rapid increase in N

^{2}_{e}over the past 10 generations and then a slow increase up to 100 generations ago. However, the values increased fivefold (close to 1,500) at 10 generation than two kinds of

*r*(Figure S7). From 10 generation ago, our results suggest that continuous increase in N

^{2}_{e}has occurs over the 100,000 generations. (Figure S8) Overall, these tendencies were similar with that of the previous two but N

_{e}based on

*r*corrected by the genomic relationship structure per chromosome was greater than the two previous kinds of

^{2}*r*measures. Although recent difference with two estimated measures was big, their difference in value decreased with generations ago.

^{2}## DISCUSSION

This study provides an overview of linkage disequilibrium in dairy cattle in Korea using a high density SNP panel. Validation work by Khatkar (2008) on their cattle data suggested that the *r ^{2}* correlation between 100 and 1,000 samples is 0.94 (Khatkar et al., 2008). Thus, we can obtain an unbiased picture of linkage disequilibrium (3,385,801 pairs) in our population of 96 dairy cattle. The pattern of decline of linkage disequilibrium in this population is consistent with that reported by Flury (2010) in 128 Swiss cattle (Flury et al., 2010) with existing

*r*remaining higher than approximately 0.3 for distance up to 0.05 Mb. The linkage disequilibrium observed is higher at short distances and more extensive than that observed in human populations (Shifman et al., 2003). Overall,

^{2}*r*values corrected by the genomic relationship structure based on SNP were slightly lower than the existing

^{2}*r*because the correction process prevented the overestimation of linkage disequilibrium.

^{2}Our population consisted of individuals from several genetic origins, which produces linkage disequilibrium between unlinked loci because of differences in allele frequencies. Consequently, such a structured sample can lead to a biased estimate of linkage disequilibrium, which may increase false positives rates. Therefore *r ^{2}* was corrected for by inferring N

_{e}to take into account the non-independence of loci due to population differentiations (Yu et al., 2005). Not corrected

*r*can be a biased estimation of linkage disequilibrium, which could lead to estimated genetic parameters that reduce the power of analysis (Mangin et al., 2012). We used an R package (“LDcorSV”) to calculate

^{2}*r*corrected by the genomic relationship structure based on SNP, which was slightly smaller than the existing

^{2}*r*. This was expected, as we speculated that the calculated

^{2}*r*corrected by the genomic relationship structure based on SNP was more accurate. Thus we can infer an accurate N

^{2}_{e}of dairy cattle in Korea.

The mean value of *r ^{2}* between non-syntenic markers was 0.014248, which provides an approximation of the linkage disequilibrium that can be expected by chance. The value observed here is higher than the value observed by Khatkar (2008) in a sample of over 1,500 cattle (0.0032) (Khatkar et al., 2008). Sample size and genetic sampling (drift) affect the mean of non-syntenic

*r*values, and hence the mean may be expected to decrease with an increase in sample size. The larger non-syntenic value observed by Khatkar (2008) may be more affected by large populations. Since the sample size was smaller in this study than the other studies, the larger non-syntenic values of our dataset are reasonable. We used marker pairs with more than mean of

^{2}*r*for non-syntenic marker pairs as a standard for estimating linkage disequilibrium. Many low

^{2}*r*can cause overestimated N

^{2}_{e}more than expected. Therefore we decided to use marker pairs more than mean of non-syntenic marker pairs for inferring N

_{e}.

Using Sved’s (1971) formula for the expected *r ^{2}*, a nonlinear regression model was fitted to the data to describe the relationship between linkage distance and linkage disequilibrium. Our estimate of parameter a supports an alternative version of Sved’s (1971) equation (Sved, 1971), derived by Tenesa (2007), which accounts for mutation and puts a = 2 (Tenesa et al., 2007). While estimating parameters, the initial value of parameter a was two with this approach. The estimated parameter a ranged from 2 to 3. For the heterogeneity of variance of the observed

*r*, variance of

^{2}*r*declined with increasing distances between markers, which may have impacted our results. A significant negative relationship between chromosome length (cM) and estimates of parameter b from the nonlinear model have been observed (Corbin et al., 2010), while others have observed a positive relationship in domestic livestock species (Khatkar et al., 2008; Muir et al., 2008). In this study, all maker pairs were calculated in each bin so

^{2}*r*was not affected by chromosome length. Thus, we could not observe a relationship between chromosome length (cM) and estimates of b.

^{2}Our estimate of b represents an estimated N_{e} assuming a constant population size. However, this assumption is difficult to maintain, b represents a conceptual average of N_{e} over the period inferred from the range of marker pairs distances (Toosi et al., 2010). Two measures of *r ^{2}* resulted in two different estimates of N

_{e}. N

_{e}based on the existing measure of

*r*is about 106 and N

^{2}_{e}based on

*r*corrected by the genomic relationship structure based on SNP was about 122. Assuming that

^{2}*r*corrected by the genomic relationship structure based on SNP was more accurate than the existing

^{2}*r*, we predict that N

^{2}_{e}of dairy cattle in Korea is about 122. Figure 5, S3, S7, and S8 show historical N

_{e}assuming a linear population following Hayes et al. (2003). The observed pattern shows a rapid increase in N

_{e}up to around 10 generations ago. Several explanations exist for this pattern including bottlenecks associated with domestication, selection and breed formation, and endangerment of the breed. Therefore, it is useful to consider our results in the context of the demographic history of the dairy cattle in Korea. The reliability of this method depends both on the technical implementation and approached used in a previous study approach (Corbin et la., 2010).

To the best of our knowledge, this is a novel study on N_{e} estimation based on *r ^{2}* corrected by the genomic relationship structure per chromosome resulting in an underestimation of inbreeding and thus an overestimation of the current population size. The estimates for recent N

_{e}for dairy cattle in Korea based on

*r*corrected by the genomic relationship structure based on SNP was around 120 individuals in contrast to estimates in the range of 361 using

^{2}*r*corrected by the genomic relationship structure per chromosome. Interesting thing is that recent N

^{2}_{e}for Swiss Eringer breed using pedigree information covering 15 generations (the range of 110 to 321) is three times higher than the linkage disequilibrium-based estimates for recent N

_{e}(around 100 individuals). Thus, we guess that the number of genomic relationship matrix that covers generation depends on how we establish the genomic relationship structure. These differences are important for inferring N

_{e}.

The extent of linkage disequilibrium in a population can be used to estimate the SNP density required for genome-wide association studies to be effective, as well as providing some indication of genomic selection. This has generated thresholds for useful linkage disequilibrium described as the proportion of QTL (quantitative trait locus) variance explained by a marker (Zhao et al., 2005). The consensus is that an average *r ^{2}*>0.3 will permit reasonable sample sizes for genome-wide association studies (Ardlie et al., 2002; Du et al., 2007; Khatkar et al., 2008). In this data set, markers 200 kb apart achieved an average linkage disequilibrium of

*r*= 0.303 excluding marker pairs less than the mean of non-syntenic marker pairs. However, marker pairs with

^{2}*r*= 1, which represent the high variability of

^{2}*r*values at small distances are typically excluded in genomic selection. This is likely due to underestimation of the actual number of SNP needed. Genomic selection appears to be effective at lower average

^{2}*r*with simulation results demonstrating accuracies of up to 0.65 with an average

^{2}*r*between adjacent markers as low as 0.2 and a trait heritability of 0.1 (Calus et al., 2008). Deterministic equations demonstrates that the accuracy of genomic selection can be expressed as a function of the effective number of loci in a population (Daetwyler et al., 2010). The effective number of loci in a population relates to the number of independent chromosome segments and assumes a random mating population. Our dataset covered an effective number of loci. However, because the estimated N

^{2}_{e}was greater than our sample size, we required more preparation to predict the potential accuracy of genomic selection in dairy cattle populations.

We used dense SNP genotype data to characterize linkage disequilibrium and infer the ancestral N_{e} for a sample of dairy cattle. In the population studied, linkage disequilibrium extended for long distances, reaching baseline levels at more than 5 Mb. From the decay in linkage disequilibrium with genetic distance, we inferred the ancestral N_{e} and observed a recent rapid increase in N_{e} which reached approximately 500, 10 generations ago followed by a decrease until the present time. The final results were that we used correction by the genomic relationship structure to ensure accurate derivation of N_{e}, resulting in 122 individuals.

## Supplementary Data

## Acknowledgements

The work was supported by Project (PJ0092602013) of the National Livestock Research Institute of Rural Development Administration, Republic of Korea.