The main use-case of Twigstats is to compute f2-statistics for pairs of populations.
This should be straight forward: We compute these using a single
function call in R, using the output files of Relate as input. The
output can then directly be fed into functions of the admixtools2 R
package (https://uqrmaie1.github.io/admixtools/index.html) to
compute any other f-statistic.
Each function implemented in the Twigstats package is described on this page: leospeidel.com/twigstats/reference.
Here, we will demonstrate key features using an example stored under
inst/sim/
, which contains simulated data for five
populations as described in our paper (Figure 1).
library(twigstats)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
The first use-case is to compute f2-statistics on Relate genealogies
(https://myersgroup.github.io/relate/).
The central function is called f2_blocks_from_Relate()
which takes Relate output files as input.
Please type
?f2_blocks_from_Relate
for more details:
We will illustrate a basic Twigstats run using example files that are included in the Twigstats package. In this example, population PX is admixed between populations P2 and P3, with a mixture proportion of 20% and 80%, respectively.
These lines assign filenames to variables file_anc
,
file_mut
, poplabels
, and
file_map
.
#see https://myersgroup.github.io/relate/getting_started.html#Output for file formats
file_anc <- system.file("sim/msprime_ad0.8_split250_1_chr1.anc.gz", package = "twigstats")
file_mut <- system.file("sim/msprime_ad0.8_split250_1_chr1.mut.gz", package = "twigstats")
#see https://myersgroup.github.io/relate/input_data.html for file formats
poplabels <- system.file("sim/msprime_ad0.8_split250_1.poplabels", package = "twigstats")
file_map <- system.file("sim/genetic_map_combined_b37_chr1.txt", package = "twigstats") #recombination map (three column format)
We will first calculate f2 statistics as normal without a
cutoff.
We will then plug these f2 statistics into an f4-ratio
statistic which computes an estimate of the admixture proportion.
#Calculate regular f2s between all pairs of populations
f2_blocks1 <- f2_blocks_from_Relate(file_anc, file_mut, poplabels, file_map)
f4_ratio(f2_blocks1, popX="PX", popI="P1", pop1="P2", pop2="P3", popO="P4")
#> popO popI pop1 pop2 popX val se
#> 1 P4 P1 P2 P3 PX 0.9974217 0.3076228
As we can see, the inferred admixture proportion is not close to the true value of 80% and the standard errors are large. This shows that we are underpowered to infer the admixture proportions using this f4-ratio statistic.
Now we will specify a twigstats cutoff of 500 generations and plug these into an f4-ratio statistic:
#Now calculate f2s using a twigstats cutoff of 500 generations.
#This should give us a big power boost.
f2_blocks2 <- f2_blocks_from_Relate(file_anc, file_mut, poplabels, file_map, t = 500)
f4_ratio(f2_blocks2, popX="PX", popI="P1", pop1="P2", pop2="P3", popO="P4")
#> popO popI pop1 pop2 popX val se
#> 1 P4 P1 P2 P3 PX 0.8110266 0.02085804
The standard errors are much reduced!
We can now directly feed this into admixtools2 functions:
library(admixtools)
#no twigstats ascertainment
f4(f2_blocks1, "P4", "P3", "P2", "PX")
#> # A tibble: 1 × 8
#> pop1 pop2 pop3 pop4 est se z p
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 P4 P3 P2 PX 0.00127 0.000314 4.04 0.0000543
#with twigstats ascertainment
f4(f2_blocks2, "P4", "P3", "P2", "PX")
#> # A tibble: 1 × 8
#> pop1 pop2 pop3 pop4 est se z p
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 P4 P3 P2 PX 0.000635 0.0000253 25.1 1.37e-138
If PX exclusively clustered with P2 and did not carry gene flow from P3, we expect this f4 statistic to not be significantly deviating from 0. We observe a highly significant attraction of P3 to PX, especially when ascertaining coalescences of the last 500 generations using Twigstats, consistent with PX being a mixture of P2 and P3.
Sometimes, we are prefer not to graft samples into genealogies
(e.g. due to low sequencing coverage, or if we don’t trust the
imputation).
Therefore, the second use-case of Twigstats is to
compute f2-statistics on age ascertained mutations.
The central function is called
f2_blocks_from_RelateAges()
which takes plink (bed/bim/fam)
files and mutation ages (.mut) as input.
Please type
?f2_blocks_from_RelateAges
for more details:
We will begin with assigning filenames to the following variables. In
our case, all files are stored under path
and have prefix
pref
.
path <- paste0(system.file("sim/", package = "twigstats"),"/")
pref <- "msprime_ad0.8_split250_1"
file_plink <- paste0(path,pref) #only need prefix
file_mut <- paste0(path,pref) #only need prefix (here same name as plink file but can be different)
Next, we compute f2-statistics as normal on the plink format and plug these into an f4-ratio statistic to estimate the admixture proportion.
#Compute regular f2 statistics between all pairs of populations. You can use pops to only calculate f2s between specified populations.
f2_blocks1 <- f2_blocks_from_RelateAges(pref = file_plink, file_mut)
f4_ratio(f2_blocks1, popX="PX", popI="P1", pop1="P2", pop2="P3", popO="P4")
#> popO popI pop1 pop2 popX val se
#> 1 P4 P1 P2 P3 PX 1.082681 0.2931867
As before, we can see that the standard error of the inferred dmixture proportion is very large. This shows we are lacking power using the standard f-statistics approach.
We can now ascertain mutations by age. Here we choose a cutoff of 500 generations:
#Use a cutoff of 500 generations
f2_blocks2 <- f2_blocks_from_RelateAges(pref = file_plink, file_mut, t = 500)
f4_ratio(f2_blocks2, popX="PX", popI="P1", pop1="P2", pop2="P3", popO="P4")
#> popO popI pop1 pop2 popX val se
#> 1 P4 P1 P2 P3 PX 0.810077 0.03097378
The standard errors are much reduced! So while we are unable to get a reliable estimate without twigstats ascertainment, we estimate PX to carry approximately 80% P2 and 20% P3 ancestry with twigstats.
We can now directly feed this into admixtools2 functions:
library(admixtools)
#no twigstats ascertainment
f4(f2_blocks1, "P4", "P3", "P2", "PX")
#> # A tibble: 1 × 8
#> pop1 pop2 pop3 pop4 est se z p
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 P4 P3 P2 PX 0.00116 0.000305 3.81 0.000139
#with twigstats ascertainment
f4(f2_blocks2, "P4", "P3", "P2", "PX")
#> # A tibble: 1 × 8
#> pop1 pop2 pop3 pop4 est se z p
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 P4 P3 P2 PX 0.000642 0.0000281 22.9 9.35e-116
As before, we find a highly significant attraction of P3 to PX using Twigstats, consistent with PX being a mixture of P2 and P3.