Basic usage • twigstats

The main use-case of Twigstats is to compute f2-statistics for pairs of populations.

This should be straight forward: We compute these using a single function call in R, using the output files of Relate as input. The output can then directly be fed into functions of the admixtools2 R package (https://uqrmaie1.github.io/admixtools/index.html) to compute any other f-statistic.

Each function implemented in the Twigstats package is described on this page: leospeidel.com/twigstats/reference.

Here, we will demonstrate key features using an example stored under inst/sim/, which contains simulated data for five populations as described in our paper (Figure 1).

library(twigstats)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: tidyr

Computing f2-statistics from Relate genealogies

The first use-case is to compute f2-statistics on Relate genealogies (https://myersgroup.github.io/relate/).

The central function is called f2_blocks_from_Relate() which takes Relate output files as input.
Please type ?f2_blocks_from_Relate for more details:

We will illustrate a basic Twigstats run using example files that are included in the Twigstats package. In this example, population PX is admixed between populations P2 and P3, with a mixture proportion of 20% and 80%, respectively.

These lines assign filenames to variables file_anc, file_mut, poplabels, and file_map.

#see https://myersgroup.github.io/relate/getting_started.html#Output for file formats
file_anc  <- system.file("sim/msprime_ad0.8_split250_1_chr1.anc.gz", package = "twigstats")
file_mut  <- system.file("sim/msprime_ad0.8_split250_1_chr1.mut.gz", package = "twigstats")
#see https://myersgroup.github.io/relate/input_data.html for file formats
poplabels <- system.file("sim/msprime_ad0.8_split250_1.poplabels", package = "twigstats")
file_map  <- system.file("sim/genetic_map_combined_b37_chr1.txt.gz", package = "twigstats") #recombination map (three column format)

We will first calculate f2 statistics as normal without a cutoff.
We will then plug these f2 statistics into an f4-ratio statistic which computes an estimate of the admixture proportion.

#Calculate regular f2s between all pairs of populations
f2_blocks1 <- f2_blocks_from_Relate(file_anc, file_mut, poplabels, file_map)
f4_ratio(f2_blocks1, popX="PX", popI="P1", pop1="P2", pop2="P3", popO="P4")
#>   popO popI pop1 pop2 popX       val        se
#> 1   P4   P1   P2   P3   PX 0.9974217 0.3076228

As we can see, the inferred admixture proportion is not close to the true value of 80% and the standard errors are large. This shows that we are underpowered to infer the admixture proportions using this f4-ratio statistic.

Now we will specify a twigstats cutoff of 500 generations and plug these into an f4-ratio statistic:

#Now calculate f2s using a twigstats cutoff of 500 generations. 
#This should give us a big power boost.
f2_blocks2 <- f2_blocks_from_Relate(file_anc, file_mut, poplabels, file_map, t = 500)
f4_ratio(f2_blocks2, popX="PX", popI="P1", pop1="P2", pop2="P3", popO="P4")
#>   popO popI pop1 pop2 popX       val         se
#> 1   P4   P1   P2   P3   PX 0.8110266 0.02085804

The standard errors are much reduced!

Use argument use_muts to compute f2-statistics on the (age-ascertained) mutations, instead of branch lengths of Relate genealogies.
Use argument blgsize to change jackknife block size
Set argument transitions to False to exclude transitions

Using admixtools2

We can now directly feed this into admixtools2 functions:

library(admixtools)
#no twigstats ascertainment
f4(f2_blocks1, "P4", "P3", "P2", "PX")
#> # A tibble: 1 × 8
#>   pop1  pop2  pop3  pop4      est       se     z         p
#>   <chr> <chr> <chr> <chr>   <dbl>    <dbl> <dbl>     <dbl>
#> 1 P4    P3    P2    PX    0.00127 0.000314  4.04 0.0000543
#with twigstats ascertainment
f4(f2_blocks2, "P4", "P3", "P2", "PX")
#> # A tibble: 1 × 8
#>   pop1  pop2  pop3  pop4       est        se     z         p
#>   <chr> <chr> <chr> <chr>    <dbl>     <dbl> <dbl>     <dbl>
#> 1 P4    P3    P2    PX    0.000635 0.0000253  25.1 1.37e-138

If PX exclusively clustered with P2 and did not carry gene flow from P3, we expect this f4 statistic to not be significantly deviating from 0. We observe a highly significant attraction of P3 to PX, especially when ascertaining coalescences of the last 500 generations using Twigstats, consistent with PX being a mixture of P2 and P3.

Computing f2-statistics on age-ascertained SNPs

Sometimes, we are prefer not to graft samples into genealogies (e.g. due to low sequencing coverage, or if we don’t trust the imputation).
Therefore, the second use-case of Twigstats is to compute f2-statistics on age ascertained mutations.

The central function is called f2_blocks_from_RelateAges() which takes plink (bed/bim/fam) files and mutation ages (.mut) as input.
Please type ?f2_blocks_from_RelateAges for more details:

We will begin with assigning filenames to the following variables. In our case, all files are stored under path and have prefix pref.

path <- paste0(system.file("sim/", package = "twigstats"),"/")
pref <- "msprime_ad0.8_split250_1"
file_plink <- paste0(path,pref) #only need prefix
file_mut  <- paste0(path,pref) #only need prefix (here same name as plink file but can be different)

Next, we compute f2-statistics as normal on the plink format and plug these into an f4-ratio statistic to estimate the admixture proportion.

#Compute regular f2 statistics between all pairs of populations. You can use pops to only calculate f2s between specified populations.
f2_blocks1 <- f2_blocks_from_RelateAges(pref = file_plink, file_mut)
f4_ratio(f2_blocks1, popX="PX", popI="P1", pop1="P2", pop2="P3", popO="P4")
#>   popO popI pop1 pop2 popX      val        se
#> 1   P4   P1   P2   P3   PX 1.082681 0.2931867

As before, we can see that the standard error of the inferred dmixture proportion is very large. This shows we are lacking power using the standard f-statistics approach.

We can now ascertain mutations by age. Here we choose a cutoff of 500 generations:

#Use a cutoff of 500 generations
f2_blocks2 <- f2_blocks_from_RelateAges(pref = file_plink, file_mut, t = 500)
f4_ratio(f2_blocks2, popX="PX", popI="P1", pop1="P2", pop2="P3", popO="P4")
#>   popO popI pop1 pop2 popX      val         se
#> 1   P4   P1   P2   P3   PX 0.810077 0.03097378

The standard errors are much reduced! So while we are unable to get a reliable estimate without twigstats ascertainment, we estimate PX to carry approximately 80% P2 and 20% P3 ancestry with twigstats.

Use argument pops to analyse a subset of populations and fam to change population assignment of individuals
Use argument blgsize to change jackknife block size
Set argument transitions to False to exclude transitions
Use file_mut = “./1240k/1240k” to calculate f2 statistics at 1240k sites
Use file_mut = “./1000G_ages/1000GP_Phase3_mask_prene_nosingle_chr1” to calculate f2 statistics at 1240k sites

Using admixtools2

We can now directly feed this into admixtools2 functions:

library(admixtools)
#no twigstats ascertainment
f4(f2_blocks1, "P4", "P3", "P2", "PX")
#> # A tibble: 1 × 8
#>   pop1  pop2  pop3  pop4      est       se     z        p
#>   <chr> <chr> <chr> <chr>   <dbl>    <dbl> <dbl>    <dbl>
#> 1 P4    P3    P2    PX    0.00116 0.000305  3.81 0.000139
#with twigstats ascertainment
f4(f2_blocks2, "P4", "P3", "P2", "PX")
#> # A tibble: 1 × 8
#>   pop1  pop2  pop3  pop4       est        se     z         p
#>   <chr> <chr> <chr> <chr>    <dbl>     <dbl> <dbl>     <dbl>
#> 1 P4    P3    P2    PX    0.000642 0.0000281  22.9 9.35e-116

As before, we find a highly significant attraction of P3 to PX using Twigstats, consistent with PX being a mixture of P2 and P3.