Skip to content

1. IMGT2Seq

WansonChoi edited this page Nov 9, 2022 · 2 revisions

Usage Example

$ python3 -m IMGT2Seq \
	--hg 18 \
	--imgt 3320 \
	--imgt-dir example/IMGTHLA3320 \
	--HLA A B C DMA DMB DOA DOB DPA1 DPB1 DQA1 DQA2 DQB1 DRA DRB1 DRB5 E F G MICA MICB \
	--out MyIMGT2Seq/' \
	--multiprocess 8

Introduction

IPD-IMGT/HLA database

The IPD-IMGT/HLA database is a specialist database that provides the official amino acid and DNA sequence of each HLA allele. This information can be used to generate a marker panel that can cover the polymorphism of the HLA region. By performing the association test to this marker panel, researchers can analyze the signals arising from the HLA region.

  1. Features of the IMGT database

Objective

Although the IPD-IMGT/HLA database provides the amino acid and DNA sequence information of HLA alleles, practically it would be hard for researchers to use them as it is. IMGT2Seq preprocesses these sequence information to be in more utilizable form so that they can be used in the subsequent steps of HATK.

Output

IMGT2Seq generates (1) Sequence dictionary, (2) HLA Allele Table, and (3) Maptable files which are used in bMarkerGenerator, NomenCleaner, and Heatmap.

(1) Sequence dictionary

  1. HLA_DICTIONARY_AA.hg18.imgt3320.txt
  2. HLA_DICTIONARY_AA.hg18.imgt3320.map
  3. HLA_DICTIONARY_SNPS.hg18.imgt3320.txt
  4. HLA_DICTIONARY_SNPS.hg18.imgt3320.map
  5. HLA_ALLELE_TABLE.imgt3320.hat

(You can find the example in 'example/RESULT_EXAMPLE/' folder.)

File 1 is a dictionary of amino acid residue sequences ranging from the 1st exon to the last exon. File 2 is a genomic position information of those amino acid sequences. Files 3 and 4 are the same as files 1 and 2 but as to the DNA base pair sequences not only including exons but also introns between them. File 5 is for NomenCleaner. All HLA allele name strings in the given version of the database are processed into file 5. NomenCleaner can convert a given HLA allele name to the updated name by using this file 5.

(1-1) Genomic Position

When it comes to the genomic positions of the amino acids or SNPs in files 2 and 4, they were assigned as follows.

  1. The starting position of the 1st exon of each HLA gene in the human genome build version(eg. hg18, hg19, or hg38) is picked based on IGV(https://software.broadinstitute.org/software/igv/).
  2. The incremented value of that starting position by 1 was allocated to the next base position consecutively.
  3. In the case of amino acid residues, the middle point of 3 base pairs of the codon was set to its genomic position value.

Note that they are not perfectly compatible with those of Human genome build version(ex. hg18, hg19, hg38). It's mainly because the IPD-IMGT/HLA database is more frequently updated than the Human Genome. So, we made IMGT2Seq share only the start position of the 1st exon of each HLA gene based on IGV(https://software.broadinstitute.org/software/igv/). In other words, Consecutively incremented(or decremented) genomic positions are assigned to the rest of the positions except the start position. Please be aware that it can be undesirable to use the genomic position information generated by HATK in a research which is very sensitive to genomic position values.

(1-2) Special Characters

Users might get confused due to some characters contained in the result of IMGT2Seq such as '*(asterisk)', '.(dot)', or 'x'. To understand these characters, users should get the concept of 'Official Reference Sequence' and 'Virtual Sequence' defined by the IPD-IMGT/HLA. Official and detailed explanations about this can be found in the next link(https://www.ebi.ac.uk/ipd/imgt/hla/nomenclature/alignments.html).

Briefly, there are the official reference sequences for all HLA genes, e.g. A01:01:01:01 for HLA-A, B07:02:01:01 for HLA-B, which are used as a reference sequence. Meanwhile, for each HLA allele, all the individual sequence entries are submitted to the IPD-IMGT/HLA database and a virtual sequence is created by aligning to them. Insertions and deletions(Indels) are coming out during this alignment and are marked as '.(dot)'.

Researchers should distinguish 'the Dots('.') in the official reference sequence' and 'the Dots in the virtual sequence'. The former ones are not assigned numbering(See 'Numbering of the Sequence Alignment' section - https://www.ebi.ac.uk/ipd/imgt/hla/nomenclature/alignments.html). This fact naturally implies that the former dots would be INSERTION while the latter dots would be DELETION. In HATK, the former dots are processed to 'Z' or 'z', where 'Z' means insertions and 'z' does no insertions(normal status). These spots are represented by markers in the form of 'INS_~'. On the other hand, the dots in the virtual sequence would mean "Deletion in the virtual sequence".

"*(asterisk)" represents 'unknown at any point in the alignment'. In IMGT2Seq, this character is processed to 'x' by us.

In summary,

  1. '*'(asterisk): Unknown at any point in the alignment. This will be processed to 'x' by the IMGT2Seq.
  2. 'x': Same as the '*'(asterisk).
  3. 'Z': There is an insertion.
  4. 'z': There is no insertion(Normal status).
  5. '.'(dots) : There is a deletion.

(2) HLA Allele Table

(3) Maptable

Version 2