Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gmap-based method to identify allelic contig table #16

Closed
starsyi opened this issue Aug 13, 2019 · 22 comments
Closed

Gmap-based method to identify allelic contig table #16

starsyi opened this issue Aug 13, 2019 · 22 comments

Comments

@starsyi
Copy link

starsyi commented Aug 13, 2019

您好!
您这个ALLHiC软件要想鉴定contigs之间的等位基因,需要一个contigs的注释文件和cds文件。也就是说需要对contigs进行从头注释得到这两个信息文件,才能鉴定等位基因,从而去除染色体之间的噪音,是这样吗?
因为我这里看来,做一个大的基因组,注释是非常消耗时间和计算资源的,我在想怎么只通过已经发表的cds序列blastn到contigs上,来鉴定contigs之间的等位基因,这样可以吗?
祝好!!!

@tangerzhang
Copy link
Owner

Hi starsyi,
Thanks for your question. I have realized that annotation is also quite time-comsuming and have developed a gmap-based method to generate allele.ctg.table. Below are the steps and command lines:

  1. run gmap get a gff3 file
gmap_build -D . -d DB target.genome
gmap -D . -d DB -t 12 -f 2 -n $N reference.cds.fasta > gmap.gff3

Note: 1) target.genome is the contig level of polyploid genome assembly
2) $N could be the ploidy of your target genome, for example $N=4 if it is a tetraploid
3) reference.cds.fasta is the coding sequences of diploid genome, which could be reference to get allelic table
2. generate the allelic.ctg.table

perl gmap2AlleleTable.pl referenece.gff3

Note: 1) reference.gff3 is the gff3 annotation of the diploid genome
2) gmap2Alleletable.pl could be get in the following link: https://github.com/tangerzhang/ALLHiC/blob/master/scripts/gmap2AlleleTable.pl

@tangerzhang tangerzhang changed the title 关于等位基因鉴定 Gmap-based method to identify allelic contig table Aug 15, 2019
@Rachel-pku
Copy link

hi, tangerzhang !
sorry for disturb you. when i use the script perl gmap2AlleleTable.pl referenece.gff3 ,the allelic.ctg.table had no contents and the log file has no contents.

@tangerzhang
Copy link
Owner

hi, tangerzhang !
sorry for disturb you. when i use the script perl gmap2AlleleTable.pl referenece.gff3 ,the allelic.ctg.table had no contents and the log file has no contents.

Could you please share a couple of lines of your reference.gff3 file and gmap.gff3 file?

@yuxuanyuan
Copy link

I also got an empty table. Maybe it's better to give an option in gmap2AlleleTable.pl to let users decide which gene name should be used in the reference.gff3, 'Name' or 'ID'?

@tangerzhang
Copy link
Owner

Thanks for the suggestion. Since some gff3 files are not compatible with the script, I will suggest that the users can make a bed file for the reference annotation and then use the revised script (Attached). The BED file should contain at least four columns, including: ChrID Start_Position End_Position and GeneID.
The command line to use this script is below:

perl gmap2AlleleTableBED.pl ref.bed

gmap2AlleleTableBED.pl.txt

@Zachary-Wu
Copy link

Thanks for the suggestion. Since some gff3 files are not compatible with the script, I will suggest that the users can make a bed file for the reference annotation and then use the revised script (Attached). The BED file should contain at least four columns, including: ChrID Start_Position End_Position and GeneID.
The command line to use this script is below:

perl gmap2AlleleTableBED.pl ref.bed

gmap2AlleleTableBED.pl.txt

hi, tangerzhang,

I got my bed file by a python script.

Then, I got an empty table when I use the script {Perl gmap2AlleleTableBED.pl reference.gff3}.

So, I recomposed gmap2AlleleTableBED.pl code by adding {$gene =~ s/;.*//g;} next to {my $gene = $1 if(/Name=(\S+)/); } and altering {print OUT "$data[0]$data[3];} to {print OUT "$data[0]$data[2];}.

Finally, the output looks like the figure of allelic.ctg.table.

I wonder if it is OK that there is a considerable discrepancy between the allelic.ctg.table(based on blased) and the allelic.ctg.table(based on Gmap) .

@tangerzhang
Copy link
Owner

Hi @Zachary-Wu
I had a couple of tests before to compare blast-based and gmap-based methods. It turns out that they are slightly different and the difference will not affect your further results.

@babala111
Copy link

babala111 commented Dec 28, 2020

hi, tangerzhang !
sorry for disturb you. when i use the script perl gmap2AlleleTable.pl referenece.gff3 ,the allelic.ctg.table had no contents and the log file has no contents.

Could you please share a couple of lines of your reference.gff3 file and gmap.gff3 file?

gmap.gff3

Generated by GMAP version 2020-10-14 using call: gmap.sse42 -D . -d DB -t 12 -f 2 -n 2 /home02/mqyin/reference/reference.CDS.fasta

tig00001992_np12 DB gene 14871 16501 . - . ID=Cg8g024190.2.1.path1;Name=Cg8g024190.2.1;Dir=sense
tig00001992_np12 DB mRNA 14871 16501 . - . ID=Cg8g024190.2.1.mrna1;Name=Cg8g024190.2.1;Parent=Cg8g024190.2.1.path1;Dir=sense;coverage=100.0;identity=99.6;matches=556;mismatches=2;indels=0;unknowns=0
tig00001992_np12 DB exon 16399 16501 99 - . ID=Cg8g024190.2.1.mrna1.exon1;Name=Cg8g024190.2.1;Parent=Cg8g024190.2.1.mrna1;Target=Cg8g024190.2.1 1 103 +
tig00001992_np12 DB exon 15949 16298 99 - . ID=Cg8g024190.2.1.mrna1.exon2;Name=Cg8g024190.2.1;Parent=Cg8g024190.2.1.mrna1;Target=Cg8g024190.2.1 104 453 +

reference.gff3
chr1 C.maxima_DB_2016 gene 417 731 . - . ID=Cg1g000010;Name=Cg1g000010
chr1 C.maxima_DB_2016 mRNA 417 731 . - . ID=Cg1g000010.1;Parent=Cg1g000010;Name=Cg1g000010.1
chr1 C.maxima_DB_2016 exon 417 731 . - . ID=Cg1g000010.1.exon1;Parent=Cg1g000010.1
chr1 C.maxima_DB_2016 CDS 417 731 . - 0 ID=Cg1g000010.1.cds1;Parent=Cg1g000010.1

@ardy20
Copy link

ardy20 commented Mar 3, 2021

Hi

I was not able to find the script gmap_build! Could you please point me where that might be?

@tangerzhang
Copy link
Owner

Hi

I was not able to find the script gmap_build! Could you please point me where that might be?

This requires to install GMAP program, which can be found here (http://research-pub.gene.com/gmap/).

@Ahahaha3
Copy link

dear @tangerzhang, i got the Allele.ctg.table file by 'perl gmap2AlleleTableBED.pl ref.bed', but the file like this:
1OTG35741utg000533lutg000533l
1OTG35741utg000533lutg000533l
1OTG35741utg000533lutg000533l
1OTG35741utg000533lutg000533l
1OTG35741utg000533lutg000533l
1OTG35743utg000928l
1OTG35743utg000928l
1OTG35743utg000928l
Column in the file have no delimiters.

@sc-zhang
Copy link
Collaborator

@Ahahaha3, would mind paste some lines of you ref.bed file here, and we will check why it happened.

@Ahahaha3
Copy link

@sc-zhang, Thanks for your reply. There are some lines of my ref.bed.
1 1602 2074 OTG35738
1 2517 4007 OTG35738
1 1922 2074 OTG35738
1 2517 3884 OTG35738
1 17215 18564 OTG35739
1 17215 18564 OTG35739
1 219845 220039 OTG35740
1 219845 220039 OTG35740
1 241552 241972 OTG35741
1 242584 242684 OTG35741
1 243337 243511 OTG35741
1 243652 243815 OTG35741
1 244029 244130 OTG35741
1 244206 244310 OTG35741
1 249037 249485 OTG35741
1 241552 241972 OTG35741
1 242584 242684 OTG35741
1 243337 243511 OTG35741
1 243652 243815 OTG35741
1 244029 244130 OTG35741
1 244206 244310 OTG35741
1 249037 249171 OTG35741
1 296366 296455 OTG35742
1 296592 297164 OTG35742
1 297276 297596 OTG35742
1 296390 296455 OTG35742
1 296592 297164 OTG35742
1 297276 297596 OTG35742

@sc-zhang
Copy link
Collaborator

@Ahahaha3 , sorry, I also need some lines of gmap.gff3 to check why it happened.

@lxingze
Copy link

lxingze commented Mar 16, 2021

gmap.gff3 is
tig00002906 DB gene 4106495 4106584 . - . ID=ATCG00210.1.path1;Name=ATCG00210.1
tig00002906 DB mRNA 4106495 4106584 . - . ID=ATCG00210.1.mrna1;Name=ATCG00210.1;Parent=ATCG00210.1.path1;coverage=100.0;identity=92.2;matches=83;mismatches=7;indels=0;unknowns=0
tig00002906 DB exon 4106495 4106584 92 - . ID=ATCG00210.1.mrna1.exon1;Name=ATCG00210.1;Parent=ATCG00210.1.mrna1;Target=ATCG00210.1 1 90 +
tig00002906 DB CDS 4106495 4106584 92 - 0 ID=ATCG00210.1.mrna1.cds1;Name=ATCG00210.1;Parent=ATCG00210.1.mrna1;Target=ATCG00210.1 1 90 +

reference.cds.fasta is
Chr1 3631 5899 AT1G01010.1
Chr1 6788 9130 AT1G01020.1
Chr1 6788 9130 AT1G01020.4
Chr1 6788 9130 AT1G01020.3
Chr1 6788 9130 AT1G01020.5
Chr1 6788 8737 AT1G01020.2
Chr1 6788 8737 AT1G01020.6
Chr1 11649 13714 AT1G01030.1
Chr1 11649 13714 AT1G01030.2
Chr1 23416 31120 AT1G01040.2
Chr1 23121 31227 AT1G01040.1

i got the Allele.ctg.table file by 'perl gmap2AlleleTableBED.pl ref.bed', but the file like this:
Chr1AT1G01040.2tig00009238
Chr1AT1G01040.1tig00009238
Chr1AT1G01050.1tig00003879tig00009117
Chr1AT1G01050.2tig00003879tig00009117
Chr1AT1G01090.1tig00008635
Chr1AT1G01120.1tig00009175tig00009424
Chr1AT1G01140.3tig00006398tig00006398
Chr1AT1G01140.2tig00006398tig00006398
Chr1AT1G01140.1tig00006398tig00006398
Chr1AT1G01220.1tig00009238tig00009238
Chr1AT1G01220.7tig00009238tig00009238
Chr1AT1G01220.3tig00009238tig00009238
Chr1AT1G01220.8tig00009238tig00009238
Chr1AT1G01220.4tig00009238tig00009238
Chr1AT1G01220.2tig00009238
Chr1AT1G01220.6tig00009238

@sc-zhang
Copy link
Collaborator

sc-zhang commented Mar 16, 2021

@lxingze , I cannot get the same result as you paste here, so I want to know which version of perl you use, and then make a further test.

@lxingze
Copy link

lxingze commented Mar 16, 2021

@lxingze , I cannot get the same result as you paste here, so I want to know which version of perl you use, and then make a further test.

This is perl 5, version 26, subversion 2 (v5.26.2)

@sc-zhang
Copy link
Collaborator

@lxingze , maybe you need check if you use the latest version of gmap2AlleleTableBED.pl. If it still not work, maybe you can try codes below, and check if it works.

#!/usr/bin/perl -w

die "Usage: perl $0 ref.bed\n" if(!defined ($ARGV[0]));
my $refGFF = $ARGV[0];
open(IN, "grep 'gene' gmap.gff3 |") or die"";
while(<IN>){
    chomp;
    my @data = split(/\s+/,$_);
    my $gene = $1 if(/Name=(\S+)/);
    $infordb{$gene} .= $data[0]."\t";
    }   
close IN; 

open(OUT, "> Allele.ctg.table") or die"";
open(IN, $refGFF) or die"";
while(<IN>){
    chomp;
    my @data = split(/\s+/,$_);
    my $gene = $data[3];
       $gene =~ s/;.*//g;
    next if(!exists($infordb{$gene}));
    my @tdb = split(/\s+/,$infordb{$gene});
    my %tmpdb = (); 
    map {$tmpdb{$_}++} @tdb;
    print OUT $data[0]."\t".$data[3]."\t";
    map {print OUT $_."\t"} keys %tmpdb;
    print OUT "\n";
    }   
close IN; 
close OUT;

@xingchang-web
Copy link

gmap.gff3
老师,您好。请问对于异源四倍体基因组,需要一个还是两个近缘的二倍体reference呢?

1 similar comment
@xingchang-web
Copy link

gmap.gff3
老师,您好。请问对于异源四倍体基因组,需要一个还是两个近缘的二倍体reference呢?

@Huangxc01
Copy link

Hi, @tangerzhang, When I want to use gmap2AlleleTable.pl to aquire the Allelic.contig.table I encountered an error as below:
[Use of uninitialized value $gene in substitution (s///) at gmap2AlleleTable.pl line 20, line 28232.
Use of uninitialized value $gene in exists at gmap2AlleleTable.pl line 21, line 28232.]

My gmap.gff3 contents like this:
ctg000450 gramineusDB gene 5520161 5699876 . + . ID=contig029.path1;Name=contig029;Dir=sense
ctg000450 gramineusDB mRNA 5520161 5699876 . + . ID=contig029.mrna1;Name=contig029;Parent=contig029.path1
ctg000450 gramineusDB exon 5520161 5522253 52 + . ID=contig029.mrna1.exon1;Name=contig029;Parent=contig029
ctg000450 gramineusDB exon 5522283 5522345 96 + . ID=contig029.mrna1.exon2;Name=contig029;Parent=contig029
ctg000450 gramineusDB exon 5523878 5524494 91 + . ID=contig029.mrna1.exon3;Name=contig029;Parent=contig029
ctg000450 gramineusDB exon 5524509 5524876 86 + . ID=contig029.mrna1.exon4;Name=contig029;Parent=contig029
ctg000450 gramineusDB exon 5524906 5524917 100 + . ID=contig029.mrna1.exon5;Name=contig029;Parent=contig029
ctg000450 gramineusDB exon 5524945 5526152 92 + . ID=contig029.mrna1.exon6;Name=contig029;Parent=contig029
ctg000450 gramineusDB exon 5531519 5531763 77 + . ID=contig029.mrna1.exon7;Name=contig029;Parent=contig029
ctg000450 gramineusDB exon 5531803 5531904 39 + . ID=contig029.mrna1.exon8;Name=contig029;Parent=contig029
ctg000450 gramineusDB exon 5531927 5531944 88 + . ID=contig029.mrna1.exon9;Name=contig029;Parent=contig029
ctg000450 gramineusDB exon 5545430 5545482 88 + . ID=contig029.mrna1.exon10;Name=contig029;Parent=contig02
ctg000450 gramineusDB exon 5545656 5545933 78 + . ID=contig029.mrna1.exon11;Name=contig029;Parent=contig02
ctg000450 gramineusDB exon 5550215 5550222 100 + . ID=contig029.mrna1.exon13;Name=contig029;Parent=contig02
ctg000450 gramineusDB exon 5552476 5552491 81 + . ID=contig029.mrna1.exon14;Name=contig029;Parent=contig02
ctg000450 gramineusDB exon 5552498 5553060 84 + . ID=contig029.mrna1.exon15;Name=contig029;Parent=contig02
ctg000450 gramineusDB exon 5597057 5597064 100 + . ID=contig029.mrna1.exon16;Name=contig029;Parent=contig02

And my reference.gff3 format like this:
chr1 evm gene 109407 114443 . + . ID=ATA1.1
chr1 evm mRNA 109407 114443 . + . ID=ATA1.1.t1;Parent=ATA1.1
chr1 evm exon 109407 110469 . + . ID=ATA1.1.t1.exon1;Parent=ATA1.1.t1
chr1 evm CDS 109407 110469 . + 0 ID=ATA1.1.t1.CDS1;Parent=ATA1.1.t1
chr1 evm exon 110734 110808 . + . ID=ATA1.1.t1.exon2;Parent=ATA1.1.t1
chr1 evm CDS 110734 110808 . + 2 ID=ATA1.1.t1.CDS2;Parent=ATA1.1.t1
chr1 evm exon 111325 111539 . + . ID=ATA1.1.t1.exon3;Parent=ATA1.1.t1
chr1 evm CDS 111325 111539 . + 2 ID=ATA1.1.t1.CDS3;Parent=ATA1.1.t1
chr1 evm exon 112391 112614 . + . ID=ATA1.1.t1.exon4;Parent=ATA1.1.t1
chr1 evm CDS 112391 112614 . + 0 ID=ATA1.1.t1.CDS4;Parent=ATA1.1.t1
chr1 evm exon 113798 114443 . + . ID=ATA1.1.t1.exon5;Parent=ATA1.1.t1
chr1 evm CDS 113798 114443 . + 1 ID=ATA1.1.t1.CDS5;Parent=ATA1.1.t1
chr1 evm gene 118993 126684 . + . ID=ATA1.2
chr1 evm mRNA 118993 126684 . + . ID=ATA1.2.t1;Parent=ATA1.2
chr1 evm five_prime_UTR 118993 119178 . + . ID=ATA1.2.t1.utr5p1;Parent=ATA1.2.t1
chr1 evm exon 118993 119317 . + . ID=ATA1.2.t1.exon1;Parent=ATA1.2.t1
chr1 evm CDS 119179 119317 . + 0 ID=ATA1.2.t1.CDS1;Parent=ATA1.2.t1
chr1 evm exon 119923 119996 . + . ID=ATA1.2.t1.exon2;Parent=ATA1.2.t1
chr1 evm CDS 119923 119996 . + 2 ID=ATA1.2.t1.CDS2;Parent=ATA1.2.t1
chr1 evm exon 121241 121324 . + . ID=ATA1.2.t1.exon3;Parent=ATA1.2.t1
chr1 evm CDS 121241 121324 . + 0 ID=ATA1.2.t1.CDS3;Parent=ATA1.2.t1
chr1 evm exon 121423 121503 . + . ID=ATA1.2.t1.exon4;Parent=ATA1.2.t1
chr1 evm CDS 121423 121503 . + 0 ID=ATA1.2.t1.CDS4;Parent=ATA1.2.t1

I think the format of reference.gff3 was normal format, of course, I've tried using gmap2AlleleTableBED.pl and encountered the same error, could you please tell me how to modify perl script or file contents to acquire the Allelic.contig.table I want.

@Wenwen012345
Copy link

Thanks for the suggestion. Since some gff3 files are not compatible with the script, I will suggest that the users can make a bed file for the reference annotation and then use the revised script (Attached). The BED file should contain at least four columns, including: ChrID Start_Position End_Position and GeneID.
The command line to use this script is below:

perl gmap2AlleleTableBED.pl ref.bed

gmap2AlleleTableBED.pl.txt

hi, tangerzhang,

I got my bed file by a python script.

Then, I got an empty table when I use the script {Perl gmap2AlleleTableBED.pl reference.gff3}.

So, I recomposed gmap2AlleleTableBED.pl code by adding {$gene =~ s/;.*//g;} next to {my $gene = $1 if(/Name=(\S+)/); } and altering {print OUT "$data[0]$data[3];} to {print OUT "$data[0]$data[2];}.

Finally, the output looks like the figure of allelic.ctg.table.

I wonder if it is OK that there is a considerable discrepancy between the allelic.ctg.table(based on blased) and the allelic.ctg.table(based on Gmap) .

@tangerzhang
Hi, I used the same method to get the correct result. Perhaps you may need to modify the gmap2AlleleTableBED.pl script in the original file for convenience? Thanks.

@Zachary-Wu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests