Polyploids, organisms with more than two copies of their genome, are ubiquitous in the plant kingdom, predominant in agriculture, and important drivers of evolution. Biologists have thus spent significant resources studying these organisms. I was surprised then when I found that very little work had been done to characterize and estimate linkage disequilibrium (LD) in polyploids

LD is the statistical association between two alleles at different loci. It is used all over the place in computational biology and population genetics, but the literature is pretty lacking in describing LD in polyploids. So in Gerard (2021a), I provided a series of characterizations of LD in polyploids. I discuss measures that are both "haplotypic" (alleles are on the same haplotype) and "composite" (alleles may be on separate haplotypes). The haplotypic measures are old-hat, but the composite measures here are the real novelty. They are useful because you only need genotype (dosage) information to estimate them, which is a bonus for polyploids since haplotypic phase is really hard to come by with current technologies.

But actually estimating LD in polyploids is tricky because of genotype uncertainty. In polyploids, it is much harder to be sure of your genotypes because of the many different types of heterozygosity, as well as data-quirks like allele bias and overdispersion which are more significant in polyploids. If you aren't sure of your genotypes, and you use estimated genotypes to calculate LD, it will be super biased toward 0. Statisticians call this "attenuation", and it's a (120 year old) known result of measurement error.

In Gerard (2021b), I came up with a way to get rid of this bias that's super fast. Below are LD estimates (y-axis) for a simulated octoploid species where true LD is the red dashed line. Blue uses estimated genotypes. Black is MLE (from Gerard, 2021a), orange is my new way (from Gerard 2021b).



Black and orange behave about the same, but orange is loads faster. For 2000 SNPs, black would take days of computation time to do all pairs of SNPs, while orange would take seconds. I used a simple approach that just requires posterior moments of each individual's genotype, obtainable from most genotyping software. Check it out!

See here for a presentation I made on my scalable LD estimates.

The software that implements these approaches is on CRAN: https://cran.r-project.org/package=ldsep


References

Gerard, D. (2021a). Pairwise Linkage Disequilibrium Estimation for Polyploids. Molecular Ecology Resources 21, p. 1230--1242. doi:10.1111/1755-0998.13349 | bioRxiv:2020.08.03.234476

Gerard, D. (2021b). Scalable Bias-corrected Linkage Disequilibrium Estimation Under Genotype Uncertainty. Heredity, 127(4), p. 357--362. doi:10.1038/s41437-021-00462-5 | bioRxiv:2021.02.08.430270