A small real data example • twigstats

The overall process should be straight forward:

We first run Relate starting from a phased VCF file
Then we run Twigstats on these inferred trees, using a single R function called f2_blocks_from_Relate.
This object is then fed into admixtools2 to compute any f-statistic.

We will illustrate how to run Twigstats on a small example of imputed ancient genomes. We will use chromosome 1 of 25 publicly available low coverage shotgun genomes from Anglo-Saxon contexts, British Iron/Roman Age, Irish Bronze Age, and the Scandinavian Early Iron Age (Cassidy et al, PNAS 2016; Martiniano et al, Nature Communications 2016; Anastasiadou et al, Communications Biology 2023; Schiffels et al Nature Communications 2016; Gretzinger et al Nature 2022; Rodriguez-Varela et al Cell 2023).

To look up where preprepared input files are stored on your computer, please clone the Twigstats github repository and execute the following command:

git checkout documentation
ls inst/example/

In this example, we have already executed the steps detailed under “Preparing the data”, so you can skip ahead to “Running Relate”.

Preparing the data

Here, we will follow the standard pipeline for imputing ancient shotgun genomes. These steps have already been executed and you can skip ahead to “Running Relate” if you like.

Imputation

We imputed the ancient genomes with GLIMPSE (Rubinacci et al, Nature Genetics, 2021; https://odelaneau.github.io/GLIMPSE/) and using the 1000 Genomes Project reference panel. We then merged these imputed genomes with a subset of the 1000 Genomes Project dataset. You can see the genomes we will use here in the box below:

popl <- read.table(system.file("example/1000GP_Phase3_sub_aDNA.poplabels", package = "twigstats"), header = T)
print(popl)
#>         ID                              POP   GROUP    SEX
#> 1  NA12249                              CEU     EUR female
#> 2  NA12830                              CEU     EUR female
#> 3  NA07048                              CEU     EUR   male
#> 4  NA12815                              CEU     EUR female
#> 5  NA06986                              CEU     EUR   male
#> 6  NA18635                              CHB     EAS   male
#> 7  NA18609                              CHB     EAS   male
#> 8  NA18593                              CHB     EAS female
#> 9  NA18546                              CHB     EAS   male
#> 10 NA18595                              CHB     EAS female
#> 11 HG00233                              GBR     EUR female
#> 12 HG00151                              GBR     EUR   male
#> 13 HG00118                              GBR     EUR female
#> 14 HG00160                              GBR     EUR   male
#> 15 HG00140                              GBR     EUR   male
#> 16 NA18907                              YRI     AFR female
#> 17 NA19257                              YRI     AFR female
#> 18 NA18504                              YRI     AFR   male
#> 19 NA18498                              YRI     AFR   male
#> 20 NA19131                              YRI     AFR female
#> 21   3DT16             Britain.IronRoman.SG      UK      M
#> 22   6DT18             Britain.IronRoman.SG      UK      M
#> 23   6DT21             Britain.IronRoman.SG      UK      M
#> 24   6DT22             Britain.IronRoman.SG      UK      M
#> 25   6DT23             Britain.IronRoman.SG      UK      M
#> 26  C10427             Britain.IronRoman.SG      UK      U
#> 27   I0156             Britain.IronRoman.SG      UK      M
#> 28   I0160             Britain.IronRoman.SG      UK      M
#> 29   I0789             Britain.IronRoman.SG      UK      F
#> 30   M1489             Britain.IronRoman.SG      UK      F
#> 31  HAD001                 England_Saxon.SG      UK      M
#> 32  HAD013                 England_Saxon.SG      UK      F
#> 33   I0161                 England_Saxon.SG      UK      F
#> 34   I0774                 England_Saxon.SG      UK      F
#> 35   I0777                 England_Saxon.SG      UK      F
#> 36  OAI004                 England_Saxon.SG      UK      F
#> 37  OAI007                 England_Saxon.SG      UK      F
#> 38   rath1                    Ireland_BA.SG Ireland      M
#> 39   rath2                    Ireland_BA.SG Ireland      M
#> 40   rath3                    Ireland_BA.SG Ireland      M
#> 41  als001 Scandinavian_Peninsula_EIA(I).SG  Sweden      M
#> 42  lov001 Scandinavian_Peninsula_EIA(I).SG  Sweden      F
#> 43  snb010 Scandinavian_Peninsula_EIA(I).SG  Sweden      M
#> 44  snb013 Scandinavian_Peninsula_EIA(I).SG  Sweden      M
#> 45  snb014 Scandinavian_Peninsula_EIA(I).SG  Sweden      M

Given this merged VCF, we then remove SNPs with missing data, and filter the data by imputation quality (e.g. requiring a minimum imputation posterior or minimum INFO score).

Removing all SNPs where any genotype has an imputation posterior <80% is achieved by running something like the following command

bcftools view --max-alleles 2 --exclude-types indels 1000G_aDNA_chr1.vcf.gz | 
bcftools +setGT -- -t q -n . -i'SMPL_MAX(FORMAT/GP)<=0.8' | 
bcftools view -i 'F_MISSING==0.0' -O z -o 1000G_aDNA_filtered_chr1.vcf.gz

This usually works well with a small number of samples.

For larger data sets, this is likely too restrictive. Please note that Relate cannot accept missing data as input, so if you would like to e.g. retain only sites where <2% of genotypes have a posterior <80%, then you will need to first generate a list of BP positions of SNPs to remove:

bcftools view --max-alleles 2 --exclude-types indels 1000G_aDNA_chr1.vcf.gz |
    bcftools +setGT -- -t q -n . -i'SMPL_MAX(FORMAT/GP)<=0.8' |
    bcftools view -i 'F_MISSING>0.02' |
    bcftools query -f "%CHROM\t%POS\n" > BP_excl_chr${chr}.txt

We can then subset the original imputed vcf using bcftools:

bcftools view -T ^BP_excl_chr1.txt 1000G_aDNA_chr1.vcf.gz -O z -o 1000G_aDNA_filtered_chr1.vcf.gz

Other filtering approaches are also possible, e.g., we can use bcftools to annotate each SNP with an INFO score (see impute-info plugin of bcftools) and require a minimum INFO score for each SNP. This score measures how well a SNP is imputed using surrounding haplotype information.

Genome mask and ancestral genome

You can download the 1000 Genomes Project accessibility masks from this URL for hg37 and hg38.

The ancestral genome is available here for hg37 and hg38.

These files, and a recombination map for hg37 and hg38, can also be downloaded from here.

Subsetting to transversions (optional)

We may also opt to only analyse transversions, which we can extract using the following command

bcftools view -i 'TYPE="snp" && ( (REF!="A" || ALT!="G") && (REF!="G" || ALT!="A") && (REF!="C" || ALT!="T") && (REF!="T" || ALT!="C"))' 1000G_aDNA_filtered_chr1.vcf.gz -O z -o 1000G_aDNA_transv_chr1.vcf.gz

Conversion to Relate input format

Once we have a VCF containing SNPs that we are happy with, we can now use a script provided by the Relate package to convert this into the Relate input format. For details please consult the Relate documentation:

#convert from vcf to haps/sample
~/Documents/genomics/software/relate/bin/RelateFileFormats \
    --mode ConvertFromVcf \
  -i 1000G_aDNA_transv_chr1 \
    --haps 1000G_aDNA_transv_chr1.haps \
    --sample 1000G_aDNA_transv_chr1.sample

gzip -f 1000G_aDNA_transv_chr1.haps
gzip -f 1000G_aDNA_transv_chr1.sample

#apply a genomic mask and polarise alleles according to ancestral/derived state
~/Documents/genomics/software/relate/scripts/PrepareInputFiles/PrepareInputFiles.sh \
    --haps 1000G_aDNA_transv_chr1.haps.gz \
    --sample 1000G_aDNA_transv_chr1.sample.gz \
    --ancestor human_ancestor_chr1.fa.gz \
    -o 1000G_aDNA_mask_transv_chr1 \
    --remove_ids remove.txt \
    --mask ./StrictMask_chr1.fa.gz

Here, we used a human ancestral genome and mappability mask. These files are avaliable for GRCh38 as well: human ancestral genome and mappability mask.

Running Relate

Once we have prepared the data, we can now run Relate (Speidel et al, Nature Genetics 2019; MBE 2021). Please consult the Relate documentation for details on how to run Relate. Here, we will run Relate using a mutation rate of 4e-9, corresponding approximately to the average transversion mutation rate in humans, and we will use pre-inferred coalescence rates stored in 1000G_auto.coal.

path="/opt/homebrew/lib/R/4.3/site-library/twigstats/example/"

Relate --mode All \
    --haps ${path}/1000G_aDNA_mask_transv_chr1.haps.gz \
    --sample ${path}/1000G_aDNA_mask_transv_chr1.sample.gz \
    --dist ${path}/1000G_aDNA_mask_transv_chr1.dist.gz \
    --map ${path}/genetic_map_combined_b37_chr1.txt \
    -o 1000G_aDNA_mask_transv_chr1 \
    -m 4e-9 \
    --coal ${path}/1000G_auto.coal \
    --sample_ages ${path}/sample_ages.txt \
    --memory 5 \
    --seed 1

gzip -f 1000G_aDNA_mask_transv_chr1.anc
gzip -f 1000G_aDNA_mask_transv_chr1.mut
#> 
#> *********************************************************
#> ---------------------------------------------------------
#> Relate
#>  * Authors: Leo Speidel, Marie Forest, Sinan Shi, Simon Myers.
#>  * Doc:     https://myersgroup.github.io/relate
#> ---------------------------------------------------------
#> 
#> ---------------------------------------------------------
#> Using:
#>   /opt/homebrew/lib/R/4.3/site-library/twigstats/example//1000G_aDNA_mask_transv_chr1.haps.gz
#>   /opt/homebrew/lib/R/4.3/site-library/twigstats/example//1000G_aDNA_mask_transv_chr1.sample.gz
#>   /opt/homebrew/lib/R/4.3/site-library/twigstats/example//genetic_map_combined_b37_chr1.txt
#> with mu = 4e-09 and coal = /opt/homebrew/lib/R/4.3/site-library/twigstats/example//1000G_auto.coal.
#> ---------------------------------------------------------
#> 
#> ---------------------------------------------------------
#> Parsing data..
#> Warning: Will use min 0.00019GB of hard disc.
#> CPU Time spent: 2.027816s; Max Memory usage: 55Mb.
#> ---------------------------------------------------------
#> 
#> ---------------------------------------------------------
#> Read 90 haplotypes with 229231 SNPs per haplotype.
#> Expected minimum memory usage: 1.5Gb.
#> ---------------------------------------------------------
#> 
#> ---------------------------------------------------------
#> Starting chunk 0 of 0.
#> ---------------------------------------------------------
#> 
#> ---------------------------------------------------------
#> Painting sequences...
#> CPU Time spent: 2.788373s; Max Memory usage: 92Mb.
#> ---------------------------------------------------------
#> 
#> ---------------------------------------------------------
#> Estimating topologies of AncesTrees in sections 0-0...
#> [0/0]CPU Time spent: 10.669859s; Max Memory usage: 2.2e+03Mb.
#> ---------------------------------------------------------
#> 
#> ---------------------------------------------------------
#> Propagating mutations across AncesTrees...
#> CPU Time spent: 13.252204s; Max Memory usage: 2.2e+03Mb.
#> ---------------------------------------------------------
#> 
#> ---------------------------------------------------------
#> Inferring branch lengths of AncesTrees in sections 0-0...
#> [0/0]CPU Time spent: 83.471486s; Max Memory usage: 2.2e+03Mb.
#> ---------------------------------------------------------
#> 
#> ---------------------------------------------------------
#> Combining AncesTrees in Sections...
#> CPU Time spent: 84.624383s; Max Memory usage: 2.2e+03Mb.
#> ---------------------------------------------------------
#> 
#> ---------------------------------------------------------
#> Finalizing...
#> Number of not mapping SNPs: 414
#> Number of flipped SNPs    : 77
#> CPU Time spent: 86.521146s; Max Memory usage: 2.2e+03Mb.
#> ---------------------------------------------------------
#> 
#> ---------------------------------------------------------
#> Done.
#> ---------------------------------------------------------
#> *********************************************************

Running Twigstats

Finally, once we have inferred Relate trees, we can now run Twigstats. We will aim to infer an admixture in Anglo-Saxon individuals between the preceding British Iron/Roman Age and Early Iron Age Scandinavia.

To illustrate a power gain by using Twigstats, we will run Twigstats in two different ways: First, we will compute regular f2-statistics and second, we will compute f2-statistics on only recent coalescences.

We first load the Twigstats package and assign filenames to variables.

library(twigstats)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: tidyr
file_anc  <- "1000G_aDNA_mask_transv_chr1.anc.gz"
file_mut  <- "1000G_aDNA_mask_transv_chr1.mut.gz"
poplabels <- system.file("example/1000GP_Phase3_sub_aDNA.poplabels", package = "twigstats")
file_map  <- system.file("example/genetic_map_combined_b37_chr1.txt", package = "twigstats")

Next, we compute regular f2 statistics without a time cutoff, and then infer the admixture proportion using an f4-ratio statistic:

#Calculate f2s between all pairs of populations
f2_blocks <- f2_blocks_from_Relate(file_anc, file_mut, poplabels, file_map, use_muts = T)
f4_ratio(f2_blocks, popX="England_Saxon.SG", popI="Ireland_BA.SG", pop1="Britain.IronRoman.SG", pop2="Scandinavian_Peninsula_EIA(I).SG", popO="YRI")
#>   popO          popI                 pop1                             pop2
#> 1  YRI Ireland_BA.SG Britain.IronRoman.SG Scandinavian_Peninsula_EIA(I).SG
#>               popX      val       se
#> 1 England_Saxon.SG 3.031811 60.95149

As we can see, the inferred admixture proportion is not between 0 and 1 and the standard errors are extremely large, giving us no meaningful inference.

We now compute f2 statistics on lineages of the past 500 generations using Twigstats:

#Use a cutoff of 500 generations
f2_blocks <- f2_blocks_from_Relate(file_anc, file_mut, poplabels, file_map, t = 500)
f4_ratio(f2_blocks, popX="England_Saxon.SG", popI="Ireland_BA.SG", pop1="Britain.IronRoman.SG", pop2="Scandinavian_Peninsula_EIA(I).SG", popO="YRI")
#>   popO          popI                 pop1                             pop2
#> 1  YRI Ireland_BA.SG Britain.IronRoman.SG Scandinavian_Peninsula_EIA(I).SG
#>               popX       val        se
#> 1 England_Saxon.SG 0.3811487 0.1750392

The standard errors are much reduced and we get a meaningful admixture proportion of approximately 38% Scandinavian EIA ancestry! Note that the standard error is still relatively large (though excludes 0 and 1 as feasible proportions). The obvious way to improve power is to use all other chromosomes; here we only used chromosome 1. Additionally, we could build Relate trees with more samples to get more accurate constrains on dates.

Multiple chromosomes

If you have multiple chromosomes, you can use the argument chr in Twigstats functions, which takes an R array with chromosome names as input. For file_anc, file_mut, and file_map, you will only need to specify the filename prefix, such that e.g., the anc files are named paste0(file_anc,"_chr1.anc.gz") etc.

E.g., the above Twigstats runs are equivalent to

file_anc  <- "1000G_aDNA_mask_transv" #filename prefix only
file_mut  <- "1000G_aDNA_mask_transv" #filename prefix only

path      <- paste0(system.file("example/", package = "twigstats"),"/") #just necessary to get the path to the poplabels file
poplabels <- paste0(path,"1000GP_Phase3_sub_aDNA.poplabels")
file_map  <- paste0(path,"genetic_map_combined_b37") #filename prefix only
chrs <- 1:1

#Calculate f2s between all pairs of populations
#We specify the argument chr, which takes an R array as input
f2_blocks <- f2_blocks_from_Relate(file_anc, file_mut, poplabels, file_map, chr = chrs, t = 500)

#Once we have f2_blocks we can compute f-statistics as before
f4_ratio(f2_blocks, popX="England_Saxon.SG", popI="Ireland_BA.SG", pop1="Britain.IronRoman.SG", pop2="Scandinavian_Peninsula_EIA(I).SG", popO="YRI")
#>   popO          popI                 pop1                             pop2
#> 1  YRI Ireland_BA.SG Britain.IronRoman.SG Scandinavian_Peninsula_EIA(I).SG
#>               popX       val        se
#> 1 England_Saxon.SG 0.3811487 0.1750392