The overall process should be straight forward:
f2_blocks_from_Relate
.We will illustrate how to run Twigstats on a small example of imputed ancient genomes. We will use chromosome 1 of 25 publicly available low coverage shotgun genomes from Anglo-Saxon contexts, British Iron/Roman Age, Irish Bronze Age, and the Scandinavian Early Iron Age (Cassidy et al, PNAS 2016; Martiniano et al, Nature Communications 2016; Anastasiadou et al, Communications Biology 2023; Schiffels et al Nature Communications 2016; Gretzinger et al Nature 2022; Rodriguez-Varela et al Cell 2023).
To look up where preprepared input files are stored on your computer, please clone the Twigstats github repository and execute the following command:
git checkout documentation
ls inst/example/
In this example, we have already executed the steps detailed under “Preparing the data”, so you can skip ahead to “Running Relate”.
Here, we will follow the standard pipeline for imputing ancient shotgun genomes. These steps have already been executed and you can skip ahead to “Running Relate” if you like.
We imputed the ancient genomes with GLIMPSE (Rubinacci et al, Nature Genetics, 2021; https://odelaneau.github.io/GLIMPSE/) and using the 1000 Genomes Project reference panel. We then merged these imputed genomes with a subset of the 1000 Genomes Project dataset. You can see the genomes we will use here in the box below:
popl <- read.table(system.file("example/1000GP_Phase3_sub_aDNA.poplabels", package = "twigstats"), header = T)
print(popl)
#> ID POP GROUP SEX
#> 1 NA12249 CEU EUR female
#> 2 NA12830 CEU EUR female
#> 3 NA07048 CEU EUR male
#> 4 NA12815 CEU EUR female
#> 5 NA06986 CEU EUR male
#> 6 NA18635 CHB EAS male
#> 7 NA18609 CHB EAS male
#> 8 NA18593 CHB EAS female
#> 9 NA18546 CHB EAS male
#> 10 NA18595 CHB EAS female
#> 11 HG00233 GBR EUR female
#> 12 HG00151 GBR EUR male
#> 13 HG00118 GBR EUR female
#> 14 HG00160 GBR EUR male
#> 15 HG00140 GBR EUR male
#> 16 NA18907 YRI AFR female
#> 17 NA19257 YRI AFR female
#> 18 NA18504 YRI AFR male
#> 19 NA18498 YRI AFR male
#> 20 NA19131 YRI AFR female
#> 21 3DT16 Britain.IronRoman.SG UK M
#> 22 6DT18 Britain.IronRoman.SG UK M
#> 23 6DT21 Britain.IronRoman.SG UK M
#> 24 6DT22 Britain.IronRoman.SG UK M
#> 25 6DT23 Britain.IronRoman.SG UK M
#> 26 C10427 Britain.IronRoman.SG UK U
#> 27 I0156 Britain.IronRoman.SG UK M
#> 28 I0160 Britain.IronRoman.SG UK M
#> 29 I0789 Britain.IronRoman.SG UK F
#> 30 M1489 Britain.IronRoman.SG UK F
#> 31 HAD001 England_Saxon.SG UK M
#> 32 HAD013 England_Saxon.SG UK F
#> 33 I0161 England_Saxon.SG UK F
#> 34 I0774 England_Saxon.SG UK F
#> 35 I0777 England_Saxon.SG UK F
#> 36 OAI004 England_Saxon.SG UK F
#> 37 OAI007 England_Saxon.SG UK F
#> 38 rath1 Ireland_BA.SG Ireland M
#> 39 rath2 Ireland_BA.SG Ireland M
#> 40 rath3 Ireland_BA.SG Ireland M
#> 41 als001 Scandinavian_Peninsula_EIA(I).SG Sweden M
#> 42 lov001 Scandinavian_Peninsula_EIA(I).SG Sweden F
#> 43 snb010 Scandinavian_Peninsula_EIA(I).SG Sweden M
#> 44 snb013 Scandinavian_Peninsula_EIA(I).SG Sweden M
#> 45 snb014 Scandinavian_Peninsula_EIA(I).SG Sweden M
Given this merged VCF, we then remove SNPs with missing data, and filter the data by imputation quality (e.g. requiring a minimum imputation posterior or minimum INFO score).
Removing all SNPs where any genotype has an imputation posterior <80% is achieved by running something like the following command
bcftools view --max-alleles 2 --exclude-types indels 1000G_aDNA_chr1.vcf.gz |
bcftools +setGT -- -t q -n . -i'SMPL_MAX(FORMAT/GP)<=0.8' |
bcftools view -i 'F_MISSING==0.0' -O z -o 1000G_aDNA_filtered_chr1.vcf.gz
This usually works well with a small number of samples.
For larger data sets, this is likely too restrictive. Please note that Relate cannot accept missing data as input, so if you would like to e.g. retain only sites where <2% of genotypes have a posterior <80%, then you will need to first generate a list of BP positions of SNPs to remove:
bcftools view --max-alleles 2 --exclude-types indels 1000G_aDNA_chr1.vcf.gz |
bcftools +setGT -- -t q -n . -i'SMPL_MAX(FORMAT/GP)<=0.8' |
bcftools view -i 'F_MISSING>0.02' |
bcftools query -f "%CHROM\t%POS\n" > BP_excl_chr${chr}.txt
We can then subset the original imputed vcf using bcftools:
bcftools view -T ^BP_excl_chr1.txt 1000G_aDNA_chr1.vcf.gz -O z -o 1000G_aDNA_filtered_chr1.vcf.gz
Other filtering approaches are also possible, e.g., we can use bcftools to annotate each SNP with an INFO score (see impute-info plugin of bcftools) and require a minimum INFO score for each SNP. This score measures how well a SNP is imputed using surrounding haplotype information.
You can download the 1000 Genomes Project accessibility masks from this URL for hg37 and hg38.
The ancestral genome is available here for hg37 and hg38.
These files, and a recombination map for hg37 and hg38, can also be downloaded from here.
We may also opt to only analyse transversions, which we can extract using the following command
bcftools view -i 'TYPE="snp" && ( (REF!="A" || ALT!="G") && (REF!="G" || ALT!="A") && (REF!="C" || ALT!="T") && (REF!="T" || ALT!="C"))' 1000G_aDNA_filtered_chr1.vcf.gz -O z -o 1000G_aDNA_transv_chr1.vcf.gz
Once we have a VCF containing SNPs that we are happy with, we can now use a script provided by the Relate package to convert this into the Relate input format. For details please consult the Relate documentation:
#convert from vcf to haps/sample
~/Documents/genomics/software/relate/bin/RelateFileFormats \
--mode ConvertFromVcf \
-i 1000G_aDNA_transv_chr1 \
--haps 1000G_aDNA_transv_chr1.haps \
--sample 1000G_aDNA_transv_chr1.sample
gzip -f 1000G_aDNA_transv_chr1.haps
gzip -f 1000G_aDNA_transv_chr1.sample
#apply a genomic mask and polarise alleles according to ancestral/derived state
~/Documents/genomics/software/relate/scripts/PrepareInputFiles/PrepareInputFiles.sh \
--haps 1000G_aDNA_transv_chr1.haps.gz \
--sample 1000G_aDNA_transv_chr1.sample.gz \
--ancestor human_ancestor_chr1.fa.gz \
-o 1000G_aDNA_mask_transv_chr1 \
--remove_ids remove.txt \
--mask ./StrictMask_chr1.fa.gz
Here, we used a human ancestral genome and mappability mask. These files are avaliable for GRCh38 as well: human ancestral genome and mappability mask.
Once we have prepared the data, we can now run Relate (Speidel et al,
Nature Genetics 2019; MBE 2021). Please consult the Relate documentation for
details on how to run Relate. Here, we will run Relate using a mutation
rate of 4e-9, corresponding approximately to the average transversion
mutation rate in humans, and we will use pre-inferred coalescence rates
stored in 1000G_auto.coal
.
path="/opt/homebrew/lib/R/4.3/site-library/twigstats/example/"
Relate --mode All \
--haps ${path}/1000G_aDNA_mask_transv_chr1.haps.gz \
--sample ${path}/1000G_aDNA_mask_transv_chr1.sample.gz \
--dist ${path}/1000G_aDNA_mask_transv_chr1.dist.gz \
--map ${path}/genetic_map_combined_b37_chr1.txt \
-o 1000G_aDNA_mask_transv_chr1 \
-m 4e-9 \
--coal ${path}/1000G_auto.coal \
--sample_ages ${path}/sample_ages.txt \
--memory 5 \
--seed 1
gzip -f 1000G_aDNA_mask_transv_chr1.anc
gzip -f 1000G_aDNA_mask_transv_chr1.mut
#>
#> *********************************************************
#> ---------------------------------------------------------
#> Relate
#> * Authors: Leo Speidel, Marie Forest, Sinan Shi, Simon Myers.
#> * Doc: https://myersgroup.github.io/relate
#> ---------------------------------------------------------
#>
#> ---------------------------------------------------------
#> Using:
#> /opt/homebrew/lib/R/4.3/site-library/twigstats/example//1000G_aDNA_mask_transv_chr1.haps.gz
#> /opt/homebrew/lib/R/4.3/site-library/twigstats/example//1000G_aDNA_mask_transv_chr1.sample.gz
#> /opt/homebrew/lib/R/4.3/site-library/twigstats/example//genetic_map_combined_b37_chr1.txt
#> with mu = 4e-09 and coal = /opt/homebrew/lib/R/4.3/site-library/twigstats/example//1000G_auto.coal.
#> ---------------------------------------------------------
#>
#> ---------------------------------------------------------
#> Parsing data..
#> Warning: Will use min 0.00019GB of hard disc.
#> CPU Time spent: 2.155993s; Max Memory usage: 55Mb.
#> ---------------------------------------------------------
#>
#> ---------------------------------------------------------
#> Read 90 haplotypes with 229231 SNPs per haplotype.
#> Expected minimum memory usage: 1.5Gb.
#> ---------------------------------------------------------
#>
#> ---------------------------------------------------------
#> Starting chunk 0 of 0.
#> ---------------------------------------------------------
#>
#> ---------------------------------------------------------
#> Painting sequences...
#> CPU Time spent: 2.977409s; Max Memory usage: 1.2e+02Mb.
#> ---------------------------------------------------------
#>
#> ---------------------------------------------------------
#> Estimating topologies of AncesTrees in sections 0-0...
#> [0/0]CPU Time spent: 11.160444s; Max Memory usage: 2.2e+03Mb.
#> ---------------------------------------------------------
#>
#> ---------------------------------------------------------
#> Propagating mutations across AncesTrees...
#> CPU Time spent: 13.732716s; Max Memory usage: 2.2e+03Mb.
#> ---------------------------------------------------------
#>
#> ---------------------------------------------------------
#> Inferring branch lengths of AncesTrees in sections 0-0...
#> [0/0]CPU Time spent: 87.471586s; Max Memory usage: 2.2e+03Mb.
#> ---------------------------------------------------------
#>
#> ---------------------------------------------------------
#> Combining AncesTrees in Sections...
#> CPU Time spent: 88.622565s; Max Memory usage: 2.2e+03Mb.
#> ---------------------------------------------------------
#>
#> ---------------------------------------------------------
#> Finalizing...
#> Number of not mapping SNPs: 414
#> Number of flipped SNPs : 77
#> CPU Time spent: 90.528020s; Max Memory usage: 2.2e+03Mb.
#> ---------------------------------------------------------
#>
#> ---------------------------------------------------------
#> Done.
#> ---------------------------------------------------------
#> *********************************************************
Finally, once we have inferred Relate trees, we can now run Twigstats. We will aim to infer an admixture in Anglo-Saxon individuals between the preceding British Iron/Roman Age and Early Iron Age Scandinavia.
To illustrate a power gain by using Twigstats, we will run Twigstats in two different ways: First, we will compute regular f2-statistics and second, we will compute f2-statistics on only recent coalescences.
We first load the Twigstats package and assign filenames to variables.
library(twigstats)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
file_anc <- "1000G_aDNA_mask_transv_chr1.anc.gz"
file_mut <- "1000G_aDNA_mask_transv_chr1.mut.gz"
poplabels <- system.file("example/1000GP_Phase3_sub_aDNA.poplabels", package = "twigstats")
file_map <- system.file("example/genetic_map_combined_b37_chr1.txt", package = "twigstats")
Next, we compute regular f2 statistics without a time cutoff, and then infer the admixture proportion using an f4-ratio statistic:
#Calculate f2s between all pairs of populations
f2_blocks <- f2_blocks_from_Relate(file_anc, file_mut, poplabels, file_map, use_muts = T)
f4_ratio(f2_blocks, popX="England_Saxon.SG", popI="Ireland_BA.SG", pop1="Britain.IronRoman.SG", pop2="Scandinavian_Peninsula_EIA(I).SG", popO="YRI")
#> popO popI pop1 pop2
#> 1 YRI Ireland_BA.SG Britain.IronRoman.SG Scandinavian_Peninsula_EIA(I).SG
#> popX val se
#> 1 England_Saxon.SG 3.031811 60.95149
As we can see, the inferred admixture proportion is not between 0 and 1 and the standard errors are extremely large, giving us no meaningful inference.
We now compute f2 statistics on lineages of the past 500 generations using Twigstats:
#Use a cutoff of 500 generations
f2_blocks <- f2_blocks_from_Relate(file_anc, file_mut, poplabels, file_map, t = 500)
f4_ratio(f2_blocks, popX="England_Saxon.SG", popI="Ireland_BA.SG", pop1="Britain.IronRoman.SG", pop2="Scandinavian_Peninsula_EIA(I).SG", popO="YRI")
#> popO popI pop1 pop2
#> 1 YRI Ireland_BA.SG Britain.IronRoman.SG Scandinavian_Peninsula_EIA(I).SG
#> popX val se
#> 1 England_Saxon.SG 0.3812473 0.1750409
The standard errors are much reduced and we get a meaningful admixture proportion of approximately 38% Scandinavian EIA ancestry! Note that the standard error is still relatively large (though excludes 0 and 1 as feasible proportions). The obvious way to improve power is to use all other chromosomes; here we only used chromosome 1. Additionally, we could build Relate trees with more samples to get more accurate constrains on dates.
If you have multiple chromosomes, you can use the argument
chr
in Twigstats functions, which takes an R array with
chromosome names as input. For file_anc
,
file_mut
, and file_map
, you will only need to
specify the filename prefix, such that e.g., the anc files are named
paste0(file_anc,"_chr1.anc.gz")
etc.
E.g., the above Twigstats runs are equivalent to
file_anc <- "1000G_aDNA_mask_transv" #filename prefix only
file_mut <- "1000G_aDNA_mask_transv" #filename prefix only
path <- paste0(system.file("example/", package = "twigstats"),"/") #just necessary to get the path to the poplabels file
poplabels <- paste0(path,"1000GP_Phase3_sub_aDNA.poplabels")
file_map <- paste0(path,"genetic_map_combined_b37") #filename prefix only
chrs <- 1:1
#Calculate f2s between all pairs of populations
#We specify the argument chr, which takes an R array as input
f2_blocks <- f2_blocks_from_Relate(file_anc, file_mut, poplabels, file_map, chr = chrs, t = 500)
#Once we have f2_blocks we can compute f-statistics as before
f4_ratio(f2_blocks, popX="England_Saxon.SG", popI="Ireland_BA.SG", pop1="Britain.IronRoman.SG", pop2="Scandinavian_Peninsula_EIA(I).SG", popO="YRI")
#> popO popI pop1 pop2
#> 1 YRI Ireland_BA.SG Britain.IronRoman.SG Scandinavian_Peninsula_EIA(I).SG
#> popX val se
#> 1 England_Saxon.SG 0.3812473 0.1750409