Analysis of genetic characteristics of pig breeds using information on single nucleotide polymorphisms

Objective This study was undertaken to investigate the genetic characteristics of Berkshire (BS), Landrace (LR), and Yorkshire (YS) pig breeds raised in the Great Grandparents pig farms using the single nucleotide polymorphisms (SNP) information. Methods A total of 25,921 common SNP genotype markers in three pig breeds were used to estimate the expected heterozygosity (HE), polymorphism information content, F-statistics (FST), linkage disequilibrium (LD) and effective population size (Ne). Results The chromosome-wise distribution of FST in BS, LR, and YS populations were within the range of 0–0.36, and the average FST value was estimated to be 0.07±0.06. This result indicated some level of genetic segregation. An average LD (r2) for the BS, LR, and YS breeds was estimated to be approximately 0.41. This study also found an average Ne of 19.9 (BS), 31.4 (LR), and 34.1 (YS) over the last 5th generations. The effective population size for the BS, LR, and YS breeds decreased at a consistent rate from 50th to 10th generations ago. With a relatively faster Ne decline rate in the past 10th generations, there exists possible evidence for intensive selection practices in pigs in the recent past. Conclusion To develop customized chips for the genomic selection of various breeds, it is important to select and utilize SNP based on the genetic characteristics of each breed. Since the improvement efficiency of breed pigs increases sharply by the population size, it is important to increase test units for the improvement and it is desirable to establish the pig improvement network system to expand the unit of breed pig improvement through the genetic connection among breed pig farms.


INTRODUCTION
An investigation of the genetic architecture is the first important step towards genomic selection for the improvement of pig breeds. Today, the genetic information on breeding pigs has been accumulating. If the reference population is entirely established in the future, genomic selection can be possible and used to increase selection accuracy through the use of genomic and phenotypic data along with pedigree information [1].
A comprehensive information on the genetic diversity and introgression is essential for an improvement of national breeding as well as the design of conservation programs. In the past, the genetic diversity in pigs was mostly reported using information on both microsatellite markers [2,3] and mtDNA [4]. However, the advantages of single nucleotide polymorphism (SNP) over microsatellite or mtDNA are that they represent the major source of genetic variation, show low mutation rates, and are associated with complex heritable traits [5]. Nowadays, thousands of SNP information are readily available, with the advent of next generation sequencing technology [6]. Through various high-density SNP panels, the Illumina Porcine 60k Bead Chip allows for more precise and comprehensive genome-wide investigation of genetic diversity, and the degree of admixture among pig breeds [7][8][9].
Linkage disequilibrium (LD), on the other hand, existing within population could assist in determining the relationship among the SNPs which affect the economic traits, mapping the quantitative trait locus (QTL), and selecting the tagging SNP. Additionally, the LD between SNP among specific physical distance can be used to estimate the effective population size, and to identify the genetic diversity through genetic characteristics [10]. Furthermore, the QTLs governing pig economic traits have been studied frequently, primarily through genomewide association studies using single marker regression [11][12][13].
This experiment was conducted to investigate the genetic characteristics and effective population sizes of Berkshire, Landrace, and Yorkshire pig breeds raised in the great grandparent (GGP) farms using the SNP information.

MATERIALS AND METHODS
Description of single nucleotide polymorphism data A total of 3,710 pigs of consisting of the Berkshire (1,615), Landrace (1,041), and Yorkshire (1,054) were genotyped using Porcine SNP 60k and 61,565 SNP were collected.
To ensure the quality of the genotypic data, SNP on the sex chromosomes, SNP without information on chromosome, SNP with higher than 10% of missing rate, SNP without polymorphism (all homo or hetero), SNP with less than 1% of minor allele frequency and SNP with more than 23.93 (p<10 -6 ) of Hardy-Weinberg disequilibrium chi-squared value, and animals with more than 10% of SNP missing rate were excluded from the analysis.
We found that 30, 3, and 19 pigs in Berkshire, Landrace and Yorkshire breeds had an SNP missing rate higher than 10%, respectively. Therefore, the number of pigs (SNP) after quality control in Berkshire, Landrace and Yorkshire were 1,585 (38,962), 1,038 (26,392), and 1,035 (40,783) pigs, respectively. In this study, only the 25,921 common SNPs among three breeds were used for analyses.

Statistical models
Expected heterozygosity: The expected heterozygosity (H E ) of a locus is defined as the probability that an individual is heterozygous in the population. 4 notypic data, SNP on the sex chromosomes, SNP without information on an 10% of missing rate, SNP without polymorphism (all homo or hetero), SNP ele frequency and SNP with more than 23.93 (p<10 -6 ) of Hardy-Weinberg and animals with more than 10% of SNP missing rate were excluded from the pigs in Berkshire, Landrace and Yorkshire breeds had an SNP missing rate erefore, the number of pigs (SNP) after quality control in Berkshire, Landrace 2), 1,038 (26,392), and 1,035 (40,783) pigs, respectively. In this study, only the ee breeds were used for analyses. pected heterozygosity (HE) of a locus is defined as the probability that an population.
h allele of the n alleles [14]. where p i is the frequency of the ith allele of the n alleles [14].
Polymorphism information content: The polymorphism information content (PIC) refers to the value of a marker for detecting polymorphism with a population, depending on the number of detectable alleles and the distribution of their frequency [15]. marker for detecting polymorphism with a population, depending on the number of detectable alleles and the 104 distribution of their frequency [15]. 105 where n is the total number of alleles, pi and pj are frequency of the ith and jth alleles in the population, 109 respectively [16]. The PIC is defined as the probability that the marker genotype of a given offspring will allow 110 deduction, in the absence of crossing over, of which of the two marker alleles of the affected parents it received 111 [17]. where n is the total number of alleles, p i and p j are frequency of the ith and jth alleles in the population, respectively [16]. The PIC is defined as the probability that the marker genotype of a given offspring will allow deduction, in the absence of crossing over, of which of the two marker alleles of the affected parents it received [17].
F-statistics (F IS ,F ST ,F IT ): The F-statistics were used for comparing the genetic characteristics among the breeds.  By using F-statistics, paired t-tests analyses with entirely SNP among breeds were performed. 123 Linkage disequilibrium: The LD between SNP pairs was used to calculate the standardized LD value D' [19] 124 and r 2 [20]. But, D' is dependent on the frequencies of the individual alleles. Another measure of LD is r 2 , which 125 is less dependent on allele frequencies. 126 The amount of LD is the value for the linkage between two different alleles and can be estimated by the 127 standardized D (D') [19] or r 2 [20]. However, since the estimation of the LD using D' can be overestimated when 128 the population size or frequency of allele is small, it was estimated using r 2 . The measure of LD was expressed as 129 the square (r 2 ) of the correlation coefficient between SNP pairs, and was calculated between each allele at locus 130 A and each allele at locus B [21]. The correlation coefficient (r 2 ) was calculated by the formula:  Haplotypes within haploblocks were obtained using the expectation maximization (EM) algorithm, similar to 137 the partition/ligation [22]. When the two loci are homozygotes or one of genotypes of the two loci is homozygote, 138 the frequency of haplotype can be calculated, but when the two loci are double heterozygotes it is difficult to 139 distinguish the coupling (A1B1/A2B2) and repulsion (A1B2/A2B1) by the DNA chip analysis. Therefore, using the 140 where F is the correlation of genes within individuals, Ø is the correlation of genes of different individuals in the same population, and f is the correlation of genes within individuals within populations. These parameters are related to Wright's F-statistics [18] as: For F-statistics, F IT represents the degree of genetic fixation of a breed, F IS represents the degree of inbreeding of individuals in a population, and F ST indicates the degree of genetic segregation of the populations. By using F-statistics, paired t-tests analyses with entirely SNP among breeds were performed.
Linkage disequilibrium: The LD between SNP pairs was used to calculate the standardized LD value D' [19] and r 2 [20]. But, D' is dependent on the frequencies of the individual alleles. Another measure of LD is r 2 , which is less dependent on allele frequencies.
The amount of LD is the value for the linkage between two different alleles and can be estimated by the standardized D (D') [19] or r 2 [20]. However, since the estimation of the LD using D' can be overestimated when the population size or frequency of allele is small, it was estimated using r 2 . The measure of LD was expressed as the square (r 2 ) of the correlation coefficient between SNP pairs, and was calculated between each allele at locus A and each allele at locus B [21]. The correlation coefficient (r 2 ) was calculated by the formula: By using F-statistics, paired t-tests analyses with entirely SNP among breeds were performed. 123 Linkage disequilibrium: The LD between SNP pairs was used to calculate the standardized LD value D' [19] 124 and r 2 [20]. But, D' is dependent on the frequencies of the individual alleles. Another measure of LD is r 2 , which 125 is less dependent on allele frequencies. 126 The amount of LD is the value for the linkage between two different alleles and can be estimated by the 127 standardized D (D') [19] or r 2 [20]. However, since the estimation of the LD using D' can be overestimated when 128 the population size or frequency of allele is small, it was estimated using r 2 . The measure of LD was expressed as 129 the square (r 2 ) of the correlation coefficient between SNP pairs, and was calculated between each allele at locus 130 A and each allele at locus B [21]. The correlation coefficient (r 2 ) was calculated by the formula: Haplotypes within haploblocks were obtained using the expectation maximization (EM) algorithm, similar to 137 the partition/ligation [22]. When the two loci are homozygotes or one of genotypes of the two loci is homozygote, 138 the frequency of haplotype can be calculated, but when the two loci are double heterozygotes it is difficult to 139 distinguish the coupling (A B /A B ) and repulsion (A B /A B ) by the DNA chip analysis. Therefore, using the where D = P AB -P A P B and P A , P a , P B , and P b are the frequencies of alleles A, a, B, and b, respectively. P AB is the frequency of the genotype AB.
Haplotypes within haploblocks were obtained using the expectation maximization (EM) algorithm, similar to the partition/ligation [22]. When the two loci are homozygotes or one of genotypes of the two loci is homozygote, the frequency of haplotype can be calculated, but when the two loci are double heterozygotes it is difficult to distinguish the coupling (A 1 B 1 / A 2 B 2 ) and repulsion (A 1 B 2 /A 2 B 1 ) by the DNA chip analysis. Therefore, using the EM algorithm [22] that determines maximum likelihood estimates for the parameters in the probability model which depends on the invisible potential variables, conditional probabilities for coupling (A 1 B 1 /A 2 B 2 ) and repulsion (A 1 B 2 /A 2 B 1 ) were calculated and the value of LD was estimated through the repeated arithmetic calculation until the amount of change reaches less than 10 -5 [23].
Effective population size: The effective population size was determined based on a simple expectation from the amount of LD and a given chromosome segment. Since LD breaks down more rapidly over the generations for loci that are further apart, LD at large distances reflects N e at recent generations [24]. repulsion (A1B2/A2B1) were calculated and the value of LD was estimated through the repeated arithmetic calculation until the amount of change reaches less than 10 -5 [23].
Effective population size: The effective population size was determined based on a simple expectation from the amount of LD and a given chromosome segment. Since LD breaks down more rapidly over the generations for loci that are further apart, LD at large distances reflects Ne at recent generations [24].
where, Ne is the effective population size t generations ago, c is the recombination distance between the SNP in Morgan, c = (1/2t), 2 is the mean value of r 2 for markers that are c Morgan apart. It was assumed that 1 cM of physical distance and 1 Mb of genetic distance were identical. (0.34±0.13), and SSC1 (0.34±0.14) in Berkshire, Landrace, and Yorkshire pigs (Figure 1).

Polymorphism Information contents
The estimates of PIC obtained using the HE values represented polymorphism information on each gene locus [16]. The estimates of the average PIC in Berkshire, Landrace, and Yorkshire breeds were 0.26±0.11, 0.28±0.10, and 0.29±0.10, respectively.
is the mean value of r 2 for markers that are c Morgan apart. It was assumed that 1 cM of physical distance and 1 Mb of genetic distance were identical. Figure 1 illustrates the distribution of expected heterozygosity for chromosome-wise SNP in the studied pig breeds. All three pig breeds showed a similar trend in the H E estimates. The estimates of the average H E in the Berkshire, Landrace, and Yorkshire were 0.33±0.15, 0.36±0.14, and 0.36±0.14, respectively. While the estimates of the average H E were low in Berkshire, they were the same in Landrace and Yorkshire. Ai et al [25] reported that research regarding genetic diversity of 18 pig breeds using 60K SNP Chip showed the similar expected heterozygosity (0.38) of Landrace and Large White. The results of this study indicated the same about expected heterozygosity. The estimates of the average H E were found to be highest in Sus scrofa chromosome 6 (SSC6 (0.36±0.15) of Berkshire, SSC18 (0.38±0.12) of Landrace, and SSC14 (0.38± 0.13) and SSC16 (0.38±0.13) of Yorkshire. On the other hand, the lowest H E estimates, in contrast, were found in SSC15 (0.29± 0.16), SSC10 (0.34±0.13), and SSC1 (0.34±0.14) in Berkshire, Landrace, and Yorkshire pigs (Figure 1).

Polymorphism information contents
The estimates of PIC obtained using the H E values represented polymorphism information on each gene locus [16]. The estimates of the average PIC in Berkshire, Landrace, and Yorkshire breeds were 0.26±0.11, 0.28±0.10, and 0.29±0.10, respectively.
Across the chromosomes, the estimates of PIC for the SNP was highest in SSC6 (0.28±0.10) of Berkshire, SSC18 (0.30± 0.08) of Landrace, and SSC14 and 16 (0.30±0.09) of Yorkshire. On the other hand, the lowest values of PIC were observed for SSC15 (0.23±0.12) in Berkshire, for SSC10 (0.27±0.09) in Landrace, and for SSC1 (0.27±0.10) in Yorkshire. Overall, the estimates of PIC were lower than those of the average H E (Table 1).

Pairwise t-test
Using the estimates of the average H E and PIC in each breed, pairwise t-tests were performed, across the breeds.
For the H E estimates, there was no significant (p<0.05) difference in SSC1 and SSC8 in the comparison between Berkshire and Landrace, and in SSC6 and SSC8 in the comparison between Berkshire and Yorkshire, and in SSC2, SSC3, SSC8, SSC10, SSC12, SSC15, and SSC17 in the comparison between Landrace and Yorkshire. For the PIC estimates of the average, there was no significant (p<0.05) difference in SSC1 and SSC8 in the comparison between Berkshire and Landrace, and in SSC6 and SSC8 in the comparison between Berkshire and Yorkshire, and in SSC1, SSC2, SSC3, SSC8, SSC10, SSC12, and SSC17 in the comparison between Landrace and Yorkshire (Table 2).
However, the pairwise t-tests using all SNP revealed significant differences (p<0.01) in the estimates of the average H E and PIC among Berkshire, Landrace, and Yorkshire breeds ( Table 2). According to the study of Edea et al [26], the H E estimate was reported to be lowest in Berkshire breed (0.31±0.17), highest in Landrace breed (0.42±0. 22), while that of Yorkshire breed was reported to be 0.35±0.17. The results of this study were consistent with those of the study [26], and the estimates of expected heterozygosity were observed the same pattern (Berkshire, 0.327±0.017; Landrace, 0.363±0.012; and Yorkshire: 0.361±0.011).

F-statistics
To investigate differences in the genetic characteristics, F-statistics were estimated among Berkshire, Landrace, and Yorkshire populations. The estimates of F ST by chromosome among breeds were in the range of 0 to 0.36, and the distributions of F ST for chromosome-wise SNP were shown Figure 2. Previous study showed that F ST among Berkshire, Landrace, and Yorkshire breeds are 0.22 for Berkshire vs Landrace, 0.24 for Berkshire vs Yorkshire, and 0.20 for Landrace vs Yorkshire [26]. As the F ST value by chromosome among breeds increased, the frequency of SNP definitely decreased, and the same trend was shown in all chromosomes.
When the F ST value among breeds was less than 0.05, the number of SNPs was 12,008 (46.3%), while it was 12,901 (49.8%) when the F ST value was between 0.05 and 0.2. Also, when the  F ST value among breeds was more than 0.2, it was 1,012 (3.9%). The average F ST in all chromosomes was 0.07±0.06. This result indicated that some genetic segregation has occurred partly.

Linkage disequilibrium
The average physical distance between adjacent SNP pairs by chromosome was largest in SSC6 (126.59 kb), smallest in SSC14 (66.73 kb) and the overall average distance was 94.09 kb (Table  3). A total of 22,571,445 SNP pairs were used to estimate LD (r 2 ). The estimates of the average r 2 between adjacent SNP were 0.411, 0.408, and 0.413 in Berkshire, Landrace, and Yorkshire, respectively. Similar results were reported in Landrace, Yorkshire, Hampshire and Duroc in the USA and their estimates were 0.36, 0.39, 0.44, and 0.46 [27]. However, Uimarie and Tapio [28] reported that their estimates were 0.47 (Yorkshire) and 0.49 (Landrace) in Finland, which were higher than those of our results. Across the chromosomes, the estimate for the r 2 between adjacent markers was highest in SSC1 of the Berkshire breed (0.47), SSC14 of the Landrace breed (0.49), and SSC1, SSC13 and SSC14 (0.47) of the Yorkshire breed ( Table 4).
The values of r 2 decreased with increasing distance between SNP pairs ( Figure 3) and the most rapid decline was observed over the first 2 Mb. But r 2 decreased more slowly with increasing distance and was constant after 5 Mb of distance [28]. In each breed, the pattern and magnitude LD decline with distance at less than 10 Mb were almost similar.

Effective population size
It can be predicted that when the LD (r 2 ) between SNP located within close physical distances is low, genetic recombination  at that locus occurred a long time ago. Similarly, when the r 2 between SNP located within far physical distances is high, genetic recombination at that locus occurred recently. The extent of genetic recombination can be estimated by the population size, while the N e across the generations can be estimated from the r 2 [10,29]. The N e for the Berkshire, Landrace, and Yorkshire over 1st-5th generation was estimated to consist of 19.87, 31.41, and 34.09 pigs, respectively ( Figure 4). It was reported in a previous study that the N e of the Landrace and Yorkshire in Finland consists of approximately 80 and 55 pigs, respec-   tively [28]. The effective population size was estimated small compared to those of the advanced countries in pig industry since the scales of domestic GGP farms were relatively small. Additionally, closed herds have been maintained and inbreeding mating system have been applied.
In Berkshire, the size of past N e from 50th to 5th generations ago had changed noticeably, from 97.7 to 50, with a gradual increase in declining rate per generation (0.8% to 9.7%). Similarly, N e declines were also observed in Landrace (100.2 to 50) and Yorkshire (102.3 to 34.1) pigs, followed by a somewhat similar declining rate. The N e for the Berkshire, Landrace, and Yorkshire decreased at constant slope from 50th generations ago to 10th generations ago, with a sharp decrease in the recent 10th generations. Similar results were reported in a study by Uimari and Tapio [28]. From these results, the intensive artificial selection seemed to be made from recent 10th generations ( Figure 4).

CONCLUSION
In order to develop customized chips for the genomic selection of various breeds, it is important to select and utilize SNP based on the genetic characteristics of each breed. Since the improvement efficiency of breed pigs increases sharply by the population size, it is important to increase test units for the improvement and it is desirable to establish the pig improvement network system to expand the unit of breed pig improvement through the genetic connection among breed pig farms.

CONFLICT OF INTEREST
We certify that there is no conflict of interest with any financial organization regarding the material discussed in the manuscript.