Phased Or Unphased LD (pould)

v1.0.1 (October 8, 2020)

The pould package calculates four linkage disequilibrium (LD) statistics – D, Wn and the two conditional asymmetric LD (cALD) measures, WA/B and WB/A – for genotype data from pairs of genetic loci, and can treat these data as either phased or unphased for these calculations. In addition, pould includes LDWrap(), a wrapper function that parses genotype data in BIGDAWG/PyPop input format or haplotype data in HaplObserve output format, LD.sign.test(), which applies a sign test to LD values for phased and unphased haplotypes generated by LDWrap() for a given dataset, and, which generates PNG-formatted heat-map plots for each LD measure.

For examples of the application of the pould package, see:
Osoegawa et al. Hum Immunol. 2019;80(9):633-643.
Osoegawa et al. Hum Immunol. 2019;80(9):644-660.

For more information about cALD, see: Thomson G, Single RM. Conditional asymmetric linkage disequilibrium (ALD): extending the biallelic r2 measure. Genetics. 2014;198(1):321-31.

The pould package can be installed from GitHub using the R devtools package – devtools::install_github("IHIW/pould/pould", build_vignettes = TRUE).

Note: When installing pould from GitHub in a Windows environment, the following warning message may appear on Windows systems that do not have Rtools v3.5 installed:

In untar2(tarfile, files, list, exdir) : skipping pax global extended headers

This warning does not impact the function of the package. Installing Rtools v3.5 will prevent these warnings.


In addiiton to simply calculating LD values, pould can be used to compare LD for phased and unphased versions of the same dataset, e.g., to examine the extent to which phasing via segregation analysis impacts LD relative to phasing estimation via the expectation-maximization (EM) algorithm. In the example below, DRB1 and DQB1 genotype data were extracted from six-locus haplotypes that had been phased using the EM method (Mack et al. Genes Immun. 2018). In the first application of cALD(), that phasing information is ignored, and the EM algorithm is applied to estimate haplotypes. In the second application of cALD(), the original six-locus phasing information is retained.

By comparing the resulting LD values, it becomes clear that LD is uniformly lower for the pre-phased DRB1~DQB1 haplotypes than for the de novo EM estimated haplotypes. This suggests that the EM algorithm may not be accurately estimating haplotypes low-frequency (counts < 4) haplotypes for individual locus pairs during multi-locus haplotype estimation, as the number of EM estimated haplotypes evaluated (53) is considerably lower than the number of pre-phased haplotypes evaluaded (106).

## Comparing LD values for haplotypes generated by the EM algorithm (default = unphased) to LD values for haplotypes for which phased is known.
#> Calculating D', Wn and conditional ALD for 53 unphased genotypes at the DRB1 and DQB1 loci.
#> D' for DRB1~DQB1 haplotypes: 0.958463648286022 (0.9585) 
#> Wn for DRB1~DQB1 haplotypes: 0.811184751666017 (0.8112) 
#> Variation of DQB1 conditioned on DRB1 (WDQB1/DRB1) = 0.903300936956993 (0.9033)
#> Variation of DRB1 conditioned on DQB1 (WDRB1/DQB1) = 0.778712698006812 (0.7787)

#> Calculating D', Wn and conditional ALD for 106 phased genotypes at the DRB1 and DQB1 loci.
#> D' for DRB1~DQB1 haplotypes: 0.878076460805524 (0.8781) 
#> Wn for DRB1~DQB1 haplotypes: 0.733800978595899 (0.7338) 
#> Variation of DQB1 conditioned on DRB1 (WDQB1/DRB1) = 0.822989521285103 (0.823)
#> Variation of DRB1 conditioned on DQB1 (WDRB1/DQB1) = 0.721861349887199 (0.7219)