ACCEPTED AUTHOR VERSION OF THE MANUSCRIPT: Genome-wide selection of discriminant SNP markers for breed assignment in indigenous sheep breeds

The assignment of an individual to the true population of origin is one of the most important applications of genomic data for practical use in animal breeding. The aim of this study was to develop a statistical method and then, to identify the minimum number of informative SNP markers from high-throughput genotyping data that would be able to trace the true breed of unknown samples in indigenous sheep breeds. The total numbers of 217 animals were genotyped using Illumina OvineSNP50K BeadChip in Zel, Lori-Bakhtiari, Afshari, Moqani, Qezel and a wild-type Iranian sheep breed. After SNP quality check, the principal component analysis (PCA) was used to determine how the animals allocated to the groups using all genotyped markers. The results revealed that the first principal component (PC 1 ) separated out the two domestic and wild sheep breeds, and all domestic breeds were separated from each other for PC 2 . The genetic distance between different breeds was calculated using F ST and Reynold methods and the results showed that the breeds were well differentiated. A statistical method was developed using the stepwise discriminant analysis (SDA) and the linear discriminant analysis (LDA) to reduce the number of SNPs for discriminating 6 different Iranian sheep populations and K-fold cross-validation technique was employed to evaluate the potential of a selected subset of SNPs in assignment success rate. The procedure selected reduced pools of markers into 201 SNPs that were able to exactly discriminate all sheep populations with 100% accuracy. Moreover, a discriminate analysis of principal components (DAPC) developed using 201 linearly independent SNPs revealed that these markers were able to assign all individuals into true breed. Finally, these 201 identified SNPs were successfully used in an independent out-group breed consisting 96 samples of Baluchi sheep breed and the results indicated that these markers are able to correctly allocate all unknown samples to true population of origin. In general, the results of this study indicated that the combined use of the SDA and LDA techniques represents an efficient strategy for selecting a reduced pool of highly discriminant markers.

individual to the true population of origin is one of the most appealing opportunities to exploit genomic data for practical use (Dimauro et al., 2013;Panetto et al., 2017). Assigning animals to their true populations, where training data (known population membership) are available, is known as discriminant analysis in the statistical literature or supervised learning in the machine learning literature . An assignment method relies on the use of genetic information and statistical procedure to determine the origin of unknown individuals (for a review see Manel et al., 2005;Negrini et al., 2009). In livestock species, DNA-based assignment tests can be used in breed assignment, improvement, conservation and are generally used to trace the origin of animal products (Hulsegge et al., 2019;Schwägele, 2005). Recently, the assignment analysis has been mostly focused on cattle breeding (eg. Negrini et al., 2009;Bertolini et al., 2018); while, given the weakness of pedigree information and consequence limitations of sheep breeding, the development of these methods in sheep, especially indigenous breeds can be on priority of animal breeding researches.
Assignment tests were traditionally carried out by using microsatellites (e.g., Dalvit et al., 2008) or AFLP markers (e.g., Negrini et al., 2007). Compared to these markers, single nucleotide polymorphisms (SNP) offer the advantage such as lower rates of genotyping errors (Weller et al., 2006) and very abundant over the genome (Heaton et al., 2014). Numerous SNPs have been identified in the genome of domestic animals, for example, in the dog (> 14 million) (Plassais et al., 2019), chicken (> 10.4 million) (Ni et al., 2017) and cattle (> 10 million) (Choi et al., 2014). This has led to the technological development of standard products commonly termed 'SNP Chips', which enable the rapid automated large-scale production of genomic data. The high throughout SNP Chips are now available for several species including the Illumina Ovine 50kSNP BeadChip (Illumina Inc., San Diego, CA) for sheep. Although this high density (HD) SNP chips provide the best coverage of the genomes for genomic evaluations, they may not be cost-effective for assigning goals and currently, HD SNP chips are still too expensive for most agricultural genomic applications. Consequently, interest has increased in the use of low density (LD) SNP chips as a cost-effective solution . The practical application of SNP markers for assigning of unknown individuals requires the use of smaller, cost effective and flexible multiplex panels for genotyping (Heaton et al., 2014).
A few statistical approaches have been proposed for assignment goals yet. One of the simplest methods use the Delta values that measures allele frequency differences, and has been explored by Wilkinson et al. (2011) for cattle. Bowcock et al. (1994) suggested that informative genetic markers may be identified using Wright's FST (Wright, 1950) and its derivatives (Weir and Cockerham, 1984). However, these approaches are computationally intensive and their execution may be prohibitively slow with large datasets (Wilkinson et al., 2011). Principal component analysis (PCA) has also been more recently proposed as an alternative method to determine population informative SNP markers. This method has been already used in human populations to characterize their structure based on SNP genotyping data (Paschou et al., 2007) and in cattle to identify breed informative SNPs (Bertolini et al., 2018). A drawback of this method is that it does not easily translate into estimated breed proportions; and its main use would be to verify that an animal is (or is not) allocated to the recorded breed . A popular method for understanding population structure is the model-based clustering method implemented in the program STRUCTURE (Pritchard et al., 2000). This method has been applied to surveys of breed variation in sheep (Kijas et al., 2012). However, it does not produce a prediction equation and requires a re-analysis when new data is added . Discriminant analysis (DA) is a powerful classification technique that is suited when classes are well-separated and predictors have a common covariance structure (Hastie et al., 2009). DA is used when groups are known a priori (unlike in cluster analysis) and has been successfully applied to bovine breed assignment for traceability purposes (Dimauro et al., 2013).
Iran is recognized as the reservoir of genetically unique breeds playing crucial roles in the livelihood of the human populations of this area (Zeder, 2012). Archeological evidence suggests that Iran has been one of the main origins of sheep domestication (Zeder, 2008) and due to its geo-climatic conditions, represents a biodiversity hotspot, with approximately 30 million heads of sheep including 27 breeds and ecotypes well adapted to a patchwork of extremely heterogeneous harsh habitats (Moradi et al., 2012). Livestock has always played a vital role in the agricultural economies of Iran, where commercial livestock enterprises, smallholder and communal farming contributes to food production, and the general wellbeing of rural households. Pedigree information is essential for accurate genetic evaluation; however, in Iranian sheep populations the rate of known sires can vary widely, from a few percent in hardy breeds reared in the rural areas up to 100% in specialized industrial reared breeds (Vatankhah et al., 2004). The lack of complete pedigrees and misidentification of sires affects the accuracy of genetic evaluation and the efficiency of breeding programs (Tortereau et al., 2017). In this situation developing of a statistical method for selecting the discriminant SNPs for breed assigning in Iranian sheep breeds could be crucial where, no report is available yet.
The aim of this study was to identify the minimum number of informative SNPs from high throughput genotyping data for assigning unknown individuals to the true population of origins in Iranian sheep breeds. In this study, two complementary DA statistical techniques including the stepwise discriminant analysis (SDA) and the linear discriminant analysis (LDA) were used to assign independent individuals to the proper group using genomic data.
The results of this study could be further used to develop a low cost customized essay to trace the breed or derived foodstuffs.

Animal samples
The genotypic data of 217 animals, came from distinct geographical conditions of Iran, were collected for this study. They consisted Zel (47 heads), Lori Bakhtiari (47 heads), Afshari (41 heads), Qezel (35 heads), Moghani (35 heads), and a wild-type Iranian sheep breed located in Markazi province (12 heads). The geographic distribution of these breeds has been shown in Figure 1. The breeds used in this study are the most common and main indigenous sheep breeds reared in Iran that have been prioritized over the past few years by animal breeding researches and private investment sectors (Moradi et al., 2017).  (Moradi et al., 2012). The genotype data from other Iranian domestic sheep breeds of Afshari, Ghezel and Moghani were obtained from the international sheep HapMap project that have already described by Kijas et al. (2009). Data on Iranian wild-type sheep breed were also collected from Haftad Qolleh protected area located in the central region of Iran, in Markazi province, Arak. Extraction of DNA in Zel, Lori-Bakhtiari and wild-type sheep breeds was performed using standard protocols. Genotyping was carried out in all animals using Illumina Ovine SNP 50K BeadChip in the Centre for Genomics and Reproduction, Invermay, New Zealand.

SNP quality check
Several steps of quality control were performed by the PLINK v1.07 software (Purcell et al., 2007) to remove animals or SNPs not useful for the analysis in this study. First, all animals with less than 95% of genotyping call rate were discarded from next analysis. It should be noted that the implementation of this step in Zel, Lori-Bakhtiari, and Wild-type breeds was carried out in this study, and for Afshari, Moghani and Qezel breeds were earlier performed in the Sheep HapMap project. Then, according to the standards defined by the Consortium of Wellcome Trust Case, SNPs with call rate less than 95% within each breed and minor allele frequency (MAF) less than 2% in all samples were excluded (Moradi et al., 2012). For the remaining SNPs, the Hardy-Weinberg equilibrium was tested within each of the six populations and SNPs that were not at equilibrium in at least one population (p-value<10 -6 ) were discarded as genotype errors (Moradi et al., 2012). The SNP locations were mapped for ovine genome assembly version 3.1 using information from the Sheep HapMap dataset (http://www.sheephapmap.org) and finally, the SNP markers whose genomic location was unknown were removed from the total markers.

Genetic distances
In this research, the distances between breeds were evaluated using two important methods: Reynolds' distance (Reynolds et al., 1983) and the degree of differentiation FST (Weir and Cockerham, 1984). The Reynolds assumes that the genetic differentiation occurs only by genetic drift and it is based on coancestry coefficients constructed for multi loci data (Dimauro et al., 2015). While, Weir and Cockerham's FST estimator is widely used because it is asymptotically unbiased with respect to sample size, and can compensate for overestimates particularly at low levels of genetic differentiation (Willing et al., 2012). The Reynolds's distance was calculated by using the following formula in R version 3.4.1 software: Where, Xu and Yu, are standing for the frequency of u th alleles in the i th position of X and Y populations, respectively (Reynolds et al., 1983). Wright, (1950) introduced F-statistics to describe the proportion of genetic diversity within and among populations. Unbiased estimates of Weir and Cockerham FST were estimated as functions of variance components and fixation indices as described by Weir and Cockerham, (1984) and was calculated using R version 3.4.1 software in this study as: Where a is standing for the variance of allele frequencies among populations, b is the variance of allele frequencies among individuals within populations, and c showing the variance of allele frequencies among gametes within individuals.

Principal component analysis (PCA)
In this study, to determine how the animals allocated to the groups using all genotyped markers, a principal component analysis (PCA) was performed using the prcomp function in the R version 3.4.1 software. For summarizing data from many variables into a few variables which describe as much of the variation in the data as possible, the variance-covariance matrix of independent variables was first calculated and principal components were extracted.
Each new variable has an associated eigenvalue that measures the respective amount of explained variance .

Discriminant analysis (DA) and validation
To detect a pool of highly discriminant markers based on their correlation structure using a multivariate technique, two complementary approaches including the stepwise discriminant analysis (SDA) and the linear discriminant analysis (LDA) were used based on Illumina's OvineSNP50 BeadChip genotyping data in Iranian sheep breeds. The DA methods have previously been implemented in some other animal species (eg. Dimauro et al., 2013), however, some modifications have been applied in the current study to decrease the false discovery rate (FDR). Since, the number of animals (207) was much lower than the number of markers (~33000) in the current study, the markers located in each chromosome were divided into the subsets according to their locations firstly, so that the number of markers in each subset was lower than the number of animals. The following steps were then used to select markers with the most information for assigning of unknown animals into the true population of origins: 1-the first subset of markers was used for assigning of animals by using SDA and simultaneously the k-fold cross validation was applied. The SDA is a statistical technique specifically conceived to select the subset of variables that better discriminate groups (Hastie et al., 2009). In particular, to determine which variables will mostly contribute in discriminating groups, all variables are reviewed and evaluated in the stepwise process. In this research, the adopted criterion for selecting variables was the squared partial correlation R 2 (Baba et al., 2004). 2-The second subset of markers, along with the discriminant markers identified in the first step, was subjected to the same analysis, to recognize the new markers that able to assign animals to their true populations. The stepwise selection process stops when all variables included in the model meet the criterion to stay and none of the other variables meets the criterion to enter. 3-The third subset of markers was used along with the markers identified in the last steps to perform subsequent analysis as described earlier. These steps were performed for all subsets and applied for all chromosomes to finally identify the SNP markers that are able to assign animals with higher accuracy. One of the benefits of this algorithm was that the discriminant markers identified in each step, were used and analyzed again in the subsequent step, so some of them could be removed. This algorithm is expected to reduce the FDR, or expected proportion of discoveries which are falsely rejected (Korthauer et al., 2019) and increase the ability of markers to assign unknown animals.
The LDA was then used to assign animals to their groups using the largest number of linearly independent markers previously selected by the SDA. In practice, these functions applied to each animal to produce a discriminant score as described by Hastie et al. 2009. An individual was assigned to a particular group if its discriminant score was lower than the cutoff value obtained by calculating the weighted mean distance among group centroids (Hastie et al., 2009). Finally, a new run of the SDA was developed to obtain the minimum number of markers able to significantly separate breeds and to correctly assign individuals.
KLaR package in R v.3.4.1 (Weihs et al., 2005) was used to develop the SDA and LDA analysis in the current study, and the modifications have been also applied to decrease the FDR.
To test the ability of the selected SNPs in assigning new animals to the proper breed and to increase the confidence level at the selection step, K-fold cross-validation technique was employed to evaluate the potential of a selected subset of SNPs in assignment success rate (Dimauro et al., 2013). Genotyped animals that passed the quality control steps (207) were allocated to ten mutually exclusive subsets for cross-validation. Of the ten subsets, nine were used as training data, and the remaining subset was considered as validation data; hence, each subset was predicted once from the other subsets. The correlation between predicted and real population of all animals in the validation population was defined as the predictive ability of discriminant SNPs for each fold. The accuracy was obtained as the mean of accuracies for ten-fold cross-validations.
In order to have a general view of the demographic composition of the animals and studied breeds using the markers identified during the discriminant analysis (DA), the analysis of the discriminant analysis of principal components (DAPC) was also performed by these markers alone using the adegent package in R v 3.4.1 software (Jombart et al., 2008). DAPC was conducted without a piori group assignment by inferring the most likely number of genetic clusters (K) using the find.clusters function in the adgent package. This function utilizes Kmeans clustering to calculate a Bayesian information criterion (BIC) value for each potential value of K (the most likely K has the lowest BIC value) and delineates individual group assignments for DAPC.
Finally, to test the ability of selected markers in assigning unknown individuals to their true population of origin in Iranian indigenous sheep breeds, an additional testing has also been implemented for assessing the traceability of predictive model at the out-group breed.
In particular, 96 animals belong to Baluchi sheep breed were used for this goal, that reared in eastern parts of Iran and genotyped using the Illumina Ovine SNP50k Beadchip that fully described previously by (Gholizadeh et al., 2014).
The presence of the genes closes to the discriminant markers identified in this study, were acquired by the use of the data mining tool Biomart (Hubbard et al., 2009), by exploring 20kb upstream and downstream from the marker (Carignano et al., 2018).

Genetic distances
One of the common questions in population genetics and assignment analysis is the degree of differentiation between populations (Dimauro et al., 2015). The genetic distances between different populations using Reynolds' distance and the degree of differentiation FST have been shown in Table 2.

Principal component analysis (PCA)
Scatterplot of the first four principal components using genotypes information was shown in Figure 2. The results of PCA revealed that all animals allocated to the groups that they sampled and the principal components of PC1 separated out the two domestic and wild sheep breeds and all domestic breeds were separated from each other for PC2, although, there is some limitation for the separation of the Moghani and Qezel sheep breeds (Figure 2). In this analysis, PC1, PC2, PC3 and PC4 accounted for 35%, 17%, 12%, and 8% of the total variance, respectively.    The population clustering of the animals and breeds by using 201 discriminant SNP markers identified in this study, was evaluated by discriminant analysis of principal components (DAPC) and the DAPC chart is shown in Figure 5. Based on the Bayesian information criterion (BIC) statistic, the optimal number of clusters in the data set with the lowest BIC value was K=4, showing four clusters generated by DAPC.

Several studies have shown the usefulness of SNP data for identifying breed informative
SNPs for discrimination among breeds in a variety of species like cattle (e.g., Negrini et al., 2009), pigs (Ramos et al., 2011), goats (Talenti et al., 2016) and bees (Muñoz et al., 2015).
However, a few studies have already been set up for developing the SNP panels in sheep at the International level (Tortereau et al., 2017), and no report is available for selecting the discriminant SNPs for breed assigning in Iranian sheep breeds.
As shown in Table 1 The results of genetic distances indicated that the lowest genetic distance was observed between Qezel-Moghani. These two sheep breeds are commonly distributed in the same geographical regions in northwestern of Iran, located in Ardebil and East Azerbaijan provinces. The lowest genetic distance observed between these breeds corresponded to their geographical distribution. Similar results were obtained by Ciani et al. (2014) who observed, a phylogeographic gradient, almost completely overlapping the areas of origin of breeds in the genetic variability of Italian breeds. Dimauro et al. (2015) were also reported that the genetic distances between Italian sheep breeds are reflecting the geographic distributions using Reynolds' distances between groups confirming the ability of the this method for representing the genetic differences between groups.
Another interesting result of the FST test was observed for comparing the genetic distance of wild-type Iranian sheep breeds with other domesticated breeds. As shown in Table 2 and Supplementary Figure 1, the lowest genetic distance between the wild-type sheep breed was obtained with the Zel breed (0.185). Zel is the only Iranian thin-tail sheep breed (Moradi et al., 2012). This breed is also known as the Aryan breed since the historical evidence shows that the Aryans, who were living in these areas, attempted to domesticate these animals.
Hence, there is a general belief that Zel is the contour of wild and domesticated strains in Iran.
The nearest distance observed in this study between Zel and wild-type sheep breeds may also be used to confirm this hypothesis.
PCA is an unsupervised linear technique for dimension reduction and allows extracting axes of maximal variation from data sets (Jollife and Cadima, 2016). This method is often used for clustering analysis when classes are not well-known and animals of the same breed tend to be located close together in plots of the PCA (Manel et al., 2005). Since, it was assumed that some of the animals used in the current study may not be pure and unmanaged crossbreeding could be occurring in their herds, the PCA was used to determine how the animals allocated to the groups using all genotyped markers. The results of this study indicated that although the PCA approach was able to separate the breeds from each other, but face some limitation for the separation of the Moghani and Qezel breeds where showed closest genetic distances using FST and Reynolds methods (Figure 2).

Discriminant analysis (DA) is a powerful classification and a supervised statistical learning
technique that is suited when classes are well-separated and predictors have a common covariance structure (Manel et al., 2005). Then DA, unlike clustering analysis, uses a priori knowledge on which observation belongs to which group. Our results indicated that a total number of 201 SNP markers were able to exactly discriminate all sheep populations (even Qezel and Moghani animal samples) with 100% accuracy. In general, it is difficult to compare the results obtained here to other studies, since first, most previous studies used microsatellite or AFLP markers (e.g., Dalvit et al., 2008;Negrini et al., 2007) and, second, these studies had only a limited number of markers or have used different statistical approaches. These studies also primarily focused on the practicality of assigning individuals among different breeds with the available markers and were not concerned with how many markers would be required to achieve confident assignment of individual genotypes (Wilkinson et al., 2011). Maudet et al. (2002) reported 23 microsatellite loci that were able to assign 93% of individuals to their breed of origins in French cattle breeds. Assignment tests based on genotypes derived from microsatellite markers (Dalvit et al., 2008) and AFLP (Negrini et al., 2007) have also been reported in Italian cows. Negrini et al. (2009)  It should be considered that the number of markers required for population assignment will depend on the species, the level of genetic differentiation and the statistical methods under consideration (Wilkinson et al., 2011). For instance, within dogs 27% of the genetic variation is found between breeds, whereas for humans the level between populations is only 5%-10% (Parker and Ostrander, 2005). Meadows et al. (2005) studied the mitochondrial genome diversity in different Asian and European sheep breeds, and reported that sheep has largest gene flow and weakest population structure in domestic animals, so that only 2.7% of the total genetic diversity is intercontinental diversity, whereas it is 10 and 50 percent for goat and cow, respectively. Moradi et al. (2017)

using mitochondrial markers in Iranian indigenous
sheep breeds reported that only 3% of the observed genetic variation is found between breeds, compared to 97% variation within breeds. As a result, the number of SNP markers required for individual assignment and discrimination amongst populations (breeds) will differ between species under consideration and it seems it would be encountered with more restrictions in sheep than many other domestic animals.
As mentioned earlier, a few studies applied assignment techniques in sheep breeds. As shown in figure 4, around 40% of breed selection markers identified in the current study, located in the first three chromosomes, whereas the others were arranged in the remaining portion of the genome. These results were in agreement with Dimauro et al. (2015) where reported that more than 40% linearly independent markers able to significantly discriminate 21 Italian sheep breeds based on their geographic distribution were located in the first three chromosomes.
In the DAPC method, the researcher does not suggest the number of populations, but it is suggested by statistical method. This approach is based on the assumption that the markers are not related and the mating in the populations is randomized (Jombart et al., 2010). Furthermore, the model independent of DAPC method is better than PCA for population stratification correction, because it increases the variance between groups and decreases the variance within the groups. The correct assignment rate or accuracy of the DAPC method in assigning individuals to groups varies from 80% to 97% depending on the number of replicates (Jombart et al., 2010). With this in mind, the results of the DAPC analysis of Iranian sheep breeds ( Figure 5) indicated that these breeds are clustered in at least four different groups, which have a fair distance to each other in terms of genetic variation, although the relation between the Afshari, Maghoni and Qezel breeds is closer. The results of this analysis are also consistent with the geographical distribution of these breeds, with the least distance between the Qezel and Moghani breeds, and also the Zel breed (The only Iranian thin tailed sheep breed) can be seen between the wild type sheep breed and the domesticated fat tailed breeds ( Figure 5).
Finally, to assess the traceability of predictive model at the out-group breed, the method developed in this study was applied in an external dataset including individuals of Baluchi sheep breed by only using these 201 independent SNP markers. This breed has the largest population than other breeds of sheep in Iran (about 30% sheep population) that is fat-tailed and dual-purpose breed (wooly and meaty), which is well adapted to the dry and hot climate conditions of eastern Iran (Gholizadeh et al., 2014). Our results indicated that by only using these 201 linearly independent markers, all sheep in the external validation datasets were correctly assigned (Supplementary Figure 2).
In general, the results of our study revealed that the statistical method used in this study was able to identify a panel of 201 SNPs markers that exactly discriminate all Iranian sheep populations in the breeds used to select these markers, and also in the new data from external out-group breed. Likewise, it was able to cover some drawbacks encountered in other standard statistical methods, for example the slow execution with large datasets in FST (Wilkinson et al., 2011), the limitations for discriminating of closely related breeds in PCA , and the requirement of re-analysis for adding new data in Structure . The results of our study can provide valuable information for developing a method in designing a low-cost SNP assay for assigning unknown animals to their true population of origins in other sheep breeds. It is expected that the genotyping of 201 discriminant SNP loci selected from ~50000 SNPs, available on Illumina Ovine 50KSNP BeadChip, will commercially decrease the cost of the genotyping for assignment goals using assays such as Sequenom multiplex MassARRAY (Heaton et al., 2014 (Allen et al., 2010). This result was in agreement with previously published manuscript (Sottile et al., 2018).

Conclusion
In this study, the genomic information of Iranian indigenous sheep breeds was used to develop a statistical method and therefore, to identify a minimal set of informative SNPs for assignment of breeds. At first, the results of genetic distance statistics showed that the different breeds were well differentiated and their genomic information could be used for determining of discriminant SNP markers. In order to have a general view of the population structure of the studied breeds, the principal components analysis (PCA) was performed by all SNP markers that passed the quality control steps. The results of this analysis showed that all sheep breeds differentiated from each other on the PC1 and PC2. Next, to identify the SNP markers with maximum information for assigning animals to their true populations, discriminant analysis functions were applied. The results showed that using the total number of 201 SNP markers from the total markers available on the Ovine 50k SNP Chip, we are able to completely separate all animals with 100% accuracy, that about 40% of the selected SNP markers were located on the first three chromosomes. The DAPC analysis using 201 independent discriminant SNP markers shows that Iranian sheep breeds are clustered in at least four different groups, which are distinguished from each other in terms of genetic distances. Finally, to verify the effectiveness of the selected SNPs, these markers were used for assigning the samples of an independent Iranian sheep breed namely Baluchi, and the results showed that we can distinguish and assign all new samples to their true breed of origin by only using these 201 markers. In general, the statistical methods developed in this study and SNP markers identified can provide valuable information to facilitate the design of accurate and low-cost SNP assay for assigning unknown animals to their true population of origins in indigenous sheep breeds.

Conflict of interest
The authors declare that they have no conflict of interest.