Several users have run into problems when using genomes from UCSC or NCBI in conjunction with the VCF file from the Mouse Genomes Project (MGP, http://www.sanger.ac.uk/science/data/mouse-genomes-project).
The reason for this is that the MGP uses chromosomal coordinates from Ensembl (i.e. 1, 2, 3, X, MT
) whereas UCSC uses chromosome names that look like this: chr1, chr2, chr3, chrX, chrM
.
We have recently added a check to the SNPsplit genome preparation script that will bail if a chromosome name discrepancy is detected (FelixKrueger#4). It is however possible to convert the VCF file into a UCSC compatible version by
(a) changing the chromosome name from e.g. 1
to chr1
and (b) adding changing the chromosome names in the ID field of the VCF file headers. It is normally not necessary to change the name of the mitochondrium from MT
to chrM
because no SNP positions are recorded for the MT anyway.
Here is a one line awk
script that does an Ensembl=>UCSC conversion, but you could of course also run an equivalent script in Perl or Python...
awk '{if($1 ~ "^#") {gsub("contig=<ID=", "contig=<ID=chr"); gsub("contig=<ID=chrMT", "contig=<ID=chrM"); print} else {gsub("^MT", "M"); print "chr"$0}}' mgp.v5.merged.snps_all.dbSNP142.vcf
SNPsplit is written in Perl and is executed from the command line. To install SNPsplit simply download the latest release of the code from the Releases page and extract the files into a SNPsplit installation folder.
SNPsplit requires the following tools installed and ideally available in the PATH
:
The SNPsplit documentation can be found here: SNPsplit User Guide
-
SNPsplit publication at F1000 Research:
-
Here is a link to the SNPsplit project site at the Babraham Institute.
SNPsplit was written by Felix Krueger, part of the Babraham Bioinformatics group.