UK Biobank imputation pipelines
Genotype imputation pipelines for the UK Biobank Research Analysis Platform
Last updated
Genotype imputation pipelines for the UK Biobank Research Analysis Platform
Last updated
Genotype imputation is a computational technique for estimating missing genotypes in SNP array data, using a reference panel of haplotypes. This approach extends to low-coverage whole genome sequencing data, aiding in filling missing genotypes or enhancing uncertain genotype calls from sequencing reads.
For both SNP array and low-coverage whole genome sequencing data, we've created two distinct pipelines using the UK Biobank reference panel (>200,000 samples; 700M variants) for genotype imputation. To ensure cost-effective implementation, we leverage efficient state-of-the-art tools, including IMPUTE5 (Rubinacci et al., 2020) for SNP array imputation and GLIMPSE2 (Rubinacci et al., 2023) for low-coverage WGS imputation.
Our pipelines can take input from a multi-sample VCF/BCF file with SNP array genotypes or a set of low-coverage BAM/CRAM files. Using the UK Biobank reference panel, the pipeline executes imputation through applets and dx command jobs, tailor-made for the UKB RAP. At the end of each imputation pipeline, a single multi-sample BCF file is generated per chromosome, encompassing genotype posteriors, dosages, and phased best-guess genotypes. Further outputs like haploid dosages can be obtained by specifying appropriate options in the imputation software.
If you use the pipelines in your research work, please cite the following papers:
Reference panel
Low-coverage WGS imputation
SNP array imputation
The UK Biobank imputation pipelines are developed by Simone Rubinacci & Olivier Delaneau.
The UK Biobank imputation pipelines are distributed with an MIT license.