Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agat_sp_merge_annotations.pl output incompatible with cellranger #457

Closed
LliliansCalvo opened this issue May 6, 2024 · 40 comments
Closed

Comments

@LliliansCalvo
Copy link

agat_sp_merge_annotations.pl output incompatible with cellranger
I have two gff files I want to merge.
One is an old annotation, and the other is a new annotation I have just made using braker3.

In order to merge them i am using agat_1.0.0 in the singularity container.
Here is the code with all the steps I have done:

# GFF to GTF using agat
singularity exec agat_1.0.0--pl5321hdfd78af_0.sif   agat_convert_sp_gxf2gxf.pl  --gff braker.gff3  -o agat_braker.gff3.gtf

singularity exec agat_1.0.0--pl5321hdfd78af_0.sif   agat_convert_sp_gxf2gxf.pl  --gff replaced_chromosomes_evmodPasaWccl_w_orthologs.gtf                 -o agat_replaced_chromosomes_evmodPasaWccl_w_orthologs.gtf 

# Merge the two files

singularity exec   agat_1.0.0--pl5321hdfd78af_0.sif  agat_sp_merge_annotations.pl         --gff agat_braker.gff3.gtf         --gff agat_replaced_chromosomes_evmodPasaWccl_w_orthologs.gtf --out agat_sanitized_sp_merge_braker3_old_annotation

head agat_sanitized_sp_merge_braker3_old_annotation
##gff-version 3
JAPIVC010000919.1	.	gene	556	8241	.	+	.	ID=evm.model.ctg717.1;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1	.	mRNA	556	8241	.	+	.	ID=nbis-mrna-11951;Parent=evm.model.ctg717.1;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1	.	exon	556	1010	.	+	.	ID=exon-92011;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1	.	exon	1430	1864	.	+	.	ID=exon-92012;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1	.	exon	2267	2447	.	+	.	ID=exon-92013;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1	.	exon	3334	3475	.	+	.	ID=exon-92014;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1	.	exon	5986	6266	.	+	.	ID=exon-92015;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1	.	exon	6682	6893	.	+	.	ID=exon-92016;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1	.	exon	7047	7328	.	+	.	ID=exon-92017;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1

# Doesn't  work!
cellranger mkref  --genome=C_fellah  --fasta=mod_GCA_030586385.1_ASM3058638v1_genomic.fna  --genes=agat_sanitized_sp_merge_braker3_old_annotation

[error] mkref has failed: error building reference package
Error while parsing GTF file agat_sanitized_sp_merge_braker3_old_annotation
Property 'transcript_id' not found in GTF line 4: JAPIVC010000919.1	.	exon	556	1010	.	+	.	ID=exon-92011;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1

awk 'NR==4' agat_sanitized_sp_merge_braker3_old_annotation
JAPIVC010000919.1	.	exon	556	1010	.	+	.	ID=exon-92011;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1


# To try and solve this I then did:

singularity exec   agat_1.0.0--pl5321hdfd78af_0.sif   agat_sp_manage_attributes.pl --gff agat_sanitized_sp_merge_braker3_old_annotation --att gene_id/gene_name --cp  -o step1_agat_sanitized_sp_merge_braker3_old_annotation

singularity exec agat_1.0.0--pl5321hdfd78af_0.sif  agat_convert_sp_gff2gtf.pl  --gff step1_agat_sanitized_sp_merge_braker3_old_annotation -o step2_agat_sanitized_sp_merge_braker3_old_annotation

head step2_agat_sanitized_sp_merge_braker3_old_annotation
##gtf-version 3
JAPIVC010000296.1	.	gene	4567	9586	.	-	.	gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "evm.model.ctg150.1"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1	.	transcript	4567	9586	.	-	.	gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "nbis-mrna-12890"; Parent "evm.model.ctg150.1"; gene_name "evm.model.ctg150.1"; original_biotype "mrna"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1	.	exon	4567	4848	.	-	.	gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41546"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1	.	exon	4966	5096	.	-	.	gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41545"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1	.	exon	5166	5284	.	-	.	gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41544"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1	.	exon	5380	5504	.	-	.	gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41543"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1	.	exon	5858	6127	.	-	.	gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41542"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1	.	exon	7027	7154	.	-	.	gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41541"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1	.	exon	7246	7642	.	-	.	gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41540"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";


# Run cellranger again
 cellranger mkref  --genome=C_fellah  --fasta=mod_GCA_030586385.1_ASM3058638v1_genomic.fna  --genes=step2_agat_sanitized_sp_merge_braker3_old_annotation

[error] mkref has failed: error building reference package
Error while parsing GTF file step2_agat_sanitized_sp_merge_braker3_old_annotation
Error parsing GTF at line 8024.  Parsed attribute had a quote in the middle of a value.  Please ensure quotes are only used to encapsulate attribute values.
 Bad Attribute Value = transcript_id 

awk 'NR==8024' step2_agat_sanitized_sp_merge_braker3_old_annotation
JAPIVC010000802.1	.	gene	525047	606909	.	+	.	gene_id "evm.model.ctg61.39_evm" "evm.model.ctg61.40"; transcript_id "evm.model.ctg61.39_evm.model.ctg61" "evm.model.ctg61.40.1.5d03f9f9"; ID "evm.model.ctg61.39_evm"; gene_name "evm.model.ctg61.39_evm"; ortholog_ID "TITIN_DROME" "E2AEH5_CAMFO";

As you can see agat generates 2 gene_id and 2 transcript_id for this transcript. I have fixed manually this one but this also happens for other genes.
Hope you can help !
Thanks !!

@Juke34
Copy link
Collaborator

Juke34 commented Jun 3, 2024

It sounds the code involved in your issue has been updated after v1.0.0.
Could you give a try with the latest version (v1.4.0)?
You may go more straithgforward:

agat config  --expose --output_format GTF
agat_sp_merge_annotations.pl   --gff agat_braker.gff3.gtf    --gff agat_replaced_chromosomes_evmodPasaWccl_w_orthologs.gtf --out agat_result.gtf

The sanitization is made by all script with _sp_ prefix, so no need to use agat_convert_sp_gxf2gxf.pl excepted if you want to keep track of the intermediate sanitized files (before to be merged).

@Juke34 Juke34 closed this as completed Jun 24, 2024
@mschilli87

This comment was marked as outdated.

@mschilli87
Copy link

We are still getting an error when trying to build a custom Cell Ranger index based on the AGAT-merged GTF, even when using v.1.4.0. I'll try to provide more details tomorrow.

@Juke34
Copy link
Collaborator

Juke34 commented Sep 4, 2024

Could you share a sample of the data used?

@mschilli87
Copy link

mschilli87 commented Sep 4, 2024

I suspect the problematic entries come from this GFF file which defines some pairs of genes with the exact same coordinates and structure but different IDs.

@Juke34
Copy link
Collaborator

Juke34 commented Sep 4, 2024

Following this post in parallel https://www.biostars.org/p/9601455/#9601546
I see two potential issues:
AGAT creating mRNA while transcripts is expected
And in our case hereI think it is related to multiple values attributes.

@Juke34 Juke34 reopened this Sep 4, 2024
@Juke34
Copy link
Collaborator

Juke34 commented Sep 4, 2024

To avoid to have mRNA instead of transcript in the output you should avoid converting in "relax" mode that allow all type of feature type in column 3.
To avoid this problem adapt the AGAT config file using agat config --expose --output_format GTF --gtf_output_version 3

@mschilli87

This comment was marked as resolved.

@Juke34
Copy link
Collaborator

Juke34 commented Sep 4, 2024

There are probably two issues. One is the fact that you have multiple value attributes like gene_id "name1" "name2";
I should check that it is really problematic before adding a trick into AGAT.
And the mRNA feature that supposed to be transcript.

Could you send me over the fasta file corresponding to the GFF file you provided?

@mschilli87

This comment was marked as resolved.

@mschilli87
Copy link

@Juke34: Is there anything I can do to troubleshoot this further or do I need to give up on using AGAT and find an alternative approach to generate my GTF?

@LliliansCalvo: Would you mind sharing your post-processing script that enabled you to feed AGAT's output into Cell Ranger?

@Juke34
Copy link
Collaborator

Juke34 commented Sep 12, 2024

@mschilli87 I can easily add patch to avoid gene_id attribute to have several values but I do need the GFF sample that provided the

JAPIVC010000802.1	.	gene	525047	606909	.	+	.	gene_id "evm.model.ctg61.39_evm" "evm.model.ctg61.40"; transcript_id "evm.model.ctg61.39_evm.model.ctg61" "evm.model.ctg61.40.1.5d03f9f9"; ID "evm.model.ctg61.39_evm"; gene_name "evm.model.ctg61.39_evm"; ortholog_ID "TITIN_DROME" "E2AEH5_CAMFO";

GTF output.

@mschilli87

This comment was marked as resolved.

@Juke34
Copy link
Collaborator

Juke34 commented Sep 26, 2024

I wanted the fasta to make some test directly with cell ranger.
I can share a patch with you that you can try otherwise

@Juke34
Copy link
Collaborator

Juke34 commented Sep 26, 2024

@mschilli87 Could you try the code from branch 457 ? And let me know if the error left?

@mschilli87

This comment was marked as outdated.

@mschilli87
Copy link

mschilli87 commented Oct 22, 2024

I am still seing some multi-value tags (incl. tag, gene_name, & transcript_name) with this version (I do get a lot of the 'keeping only the first one' messages though).


edit: Could this be related to me having agat_sp_merge_annotation.pl write GFFe output and converting this GFF3 file to GTF for CellRanger using agat_convert_sp_gff2gtf.pl (from the same branch!) in a second step?

@Juke34
Copy link
Collaborator

Juke34 commented Oct 22, 2024

Yes I remove double attributes only for gene_id and transcript_id as it is the only sensible attributes to get a consistent file.

P.S: Merging annotation should merge 2 isoforms in one only and only if they are 100% identical. But multi-values is not something problematic in general, except for ID/Parent. attributes in GFF and gene_id, transcript_id where it should be forbidden to keep file consistency.

Does it work with cellranger?

@mschilli87
Copy link

mschilli87 commented Oct 22, 2024

@Juke34:

Does it work with cellranger?

No. It seems to be throwing off there parser. A line with double tag value triggers an error about unexpected quotes in gene_name, even though only one value is given in that particular field.

@mschilli87
Copy link

@Juke34

Yes I remove double attributes only for gene_id and transcript_id as it is the only sensible attributes to get a consistent file.

Also, would it be possible to preserve that information somewhere? In our case, one gene_id might be an ENSEMBL Gene ID while the other is just a random number. While I don't expect AGAT to pick the more useful one, it would be great to be able to check if the random number ID masks a well know gene if it comes up as interesting in a downstream analysis. Maybe an altarnative IDs attribute or at least mentioning the IDs in a parsable way in the warning messages. This would also make it easier now to quantify the problem and trace some of the cases upstream to figure out why some genes have shadow copies in that annotation.

@Juke34
Copy link
Collaborator

Juke34 commented Oct 23, 2024

In the current code it is saved in agat_other_gene_id and agat_other_transcript_id attribute accordingly.

@Juke34
Copy link
Collaborator

Juke34 commented Oct 23, 2024

I still don't get how you end up with several values in gene_id and transcript_id, I should definitely fix that if it is AGAT that makes this mistake.

@mschilli87

This comment was marked as outdated.

@mschilli87
Copy link

OK. Here is an update:
In the GFF I linked above there are records with multiple tags. You'll find lines ending in

...; tag "basic"; tag "Ensembl_canonical"

After merging them with our annotation into GFF3, those lines are reformatted by AGAT to contain

...;tag=basic,Ensembl_canonical;...

From there we use AGAT fto convert the GFF3 to GTF where we get

...; tag "basic" "Ensembl_canonical"; ...

Which seems to be throwing off CellRanger:

[error] mkref has failed: error building reference package
Error while parsing GTF file <path-to-GTF-file>
Error parsing GTF at line 8.  Parsed attribute had a quote in the middle of a value.  Please ensure quotes are only used to encapsulate attribute values.
Bad Attribute Value = transcript_name

Note that the erro points towards the transcript_name attribute which is valid in line 8:

chrM    AGAT    gene    577     647     .       +       .       gene_id "ENSG00000210049.1"; transcript_id "ENST00000387314.1"; ID "ENSG00000210049.1"; gene_name "MT-TF"; gene_type "misc_RNA"; hgnc_id "HGNC:7481"; level "3"; tag "basic" "Ensembl_canonical"; transcript_name "MT-TF-201"; transcript_support_level "NA"; transcript_type "misc_RNA";

However, when substituting ; tag "basic" "Ensembl_canonical"; with ;tag "basic"; tag "Ensembl_canonical"; , CellRanger accepts this line and complains about another one further down the line.


Regarding the issue with several IDs values, I am still investigating and hope this turns out to be an error on my end somehow.

@mschilli87
Copy link

Update 2: I definitely still get multiple gene IDs after merging.

Input GFF:

chrM    ENSEMBL transcript      3230    3304    .       +       .       gene_id "26266"; transcript_id "26266"; gene_type "misc_RNA"; gene_name "26266"; transcript_type "misc_RNA"; transcript_name "26266"; level 3;
chrM    ENSEMBL exon    3230    3304    .       +       .       gene_id "26266"; transcript_id "26266"; gene_type "misc_RNA"; gene_name "26266"; transcript_type "misc_RNA"; transcript_name "26266"; level 3;
chrM    ENSEMBL transcript      3230    3304    .       +       .       gene_id "ENSG00000209082.1"; transcript_id "ENST00000386347.1"; gene_type "misc_RNA"; gene_name "MT-TL1"; transcript_type "misc_RNA"; transcript_name "MT-TL1-201"; level 3; transcript_support_level "NA"; hgnc_id "HGNC:7490"; tag "basic"; tag "Ensembl_canonical";
chrM    ENSEMBL exon    3230    3304    .       +       .       gene_id "ENSG00000209082.1"; transcript_id "ENST00000386347.1"; gene_type "misc_RNA"; gene_name "MT-TL1"; transcript_type "misc_RNA"; transcript_name "MT-TL1-201"; exon_number 1; exon_id "ENSE00002006242.1"; level 3; transcript_support_level "NA"; hgnc_id "HGNC:7490"; tag "basic"; tag "Ensembl_canonical";

Output GFF3:

chrM    AGAT    gene    3230    3304    .       +       .       ID=agat-gene-295;gene_id=26266,ENSG00000209082.1;gene_name=26266,MT-TL1;gene_type=misc_RNA;hgnc_id=HGNC:7490;level=3;tag=basic,Ensembl_canonical;transcript_id=26266,ENST00000386347.1;transcript_name=26266,MT-TL1-201;transcript_support_level=NA;transcript_type=misc_RNA
chrM    ENSEMBL transcript      3230    3304    .       +       .       ID=26266;Parent=agat-gene-295;gene_id=26266;gene_name=26266;gene_type=misc_RNA;level=3;merged_ID=ENST00000386347.1;merged_Parent=ENSG00000209082.1;merged_gene_id=ENSG00000209082.1;merged_gene_name=MT-TL1;merged_gene_type=misc_RNA;merged_hgnc_id=HGNC:7490;merged_level=3;merged_tag=basic,Ensembl_canonical;merged_transcript_id=ENST00000386347.1;merged_transcript_name=MT-TL1-201;merged_transcript_support_level=NA;merged_transcript_type=misc_RNA;transcript_id=26266;transcript_name=26266;transcript_type=misc_RNA

@Juke34
Copy link
Collaborator

Juke34 commented Oct 25, 2024

Thank you, this helps a lot.
I will see what I can don when I will be back from vacation.

@mschilli87

This comment was marked as resolved.

@Juke34
Copy link
Collaborator

Juke34 commented Nov 26, 2024

I have been swamped.
Thank you for remembering me. I have now fixed it. I need to update plenty of related tests. It should be soon ready and merged in Master branch.

@mschilli87

This comment was marked as resolved.

@Juke34
Copy link
Collaborator

Juke34 commented Nov 27, 2024

You can try branch 457. It will be available in the next release (I didn't plan any soon...)

@Juke34
Copy link
Collaborator

Juke34 commented Nov 28, 2024

Fetaures may still have several values in an attributes while merging, but now we avoid that for gene_id, ID, Parent attributes (Which are sensible because used to understand the relationship between the features).

If cellranger do need to absolute avoid several values attrributes, I may add an extra script to clean line by line this issue.

@mschilli87

This comment was marked as outdated.

@mschilli87
Copy link

Dear @Juke34,

We are still getting the same error from CellRanger.
It keeps complaining about multiple transcript_names in line 8, even though line 8 has only one transcript_name. It does, however, contain several tags. Additionally, there are still lines with multiple gene_names and (actually) transcript_names but since CellRanger exists upon the first error, I can't easily tell if those cause problems as well.

If possible I would like to test with a version that has single-value attributes throughout, to make sure CellRanger processes such a file just fine.

Then, we could maybe try to come up with a reasonable interface to specify the attributes that must not contain > 1 value so I could test which ones I could preserve (if any) for the cellRanger analysis.

Thank you again for you support.

@Juke34
Copy link
Collaborator

Juke34 commented Nov 29, 2024

Ok then I will add an extra option for the GFF2GTF to deduplicate multi values "attributes" i.e.

gene_name "NameA" "NameB" "NameC"

becomes

gene_name ''NameA"; gene_name1 "NameB"; gene_name1 "NameC"

@mschilli87
Copy link

That sounds very useful. Thank you!

@Juke34
Copy link
Collaborator

Juke34 commented Dec 3, 2024

Should be fine by now. Get the last commit from the 457 branch. Do the install :

  • make
  • make install

then activate the new param:
agat config --expose --output_format gtf --deflate_attribute

then run your test:
agat_convert_sp_gxf2gxf.pl --gff in.gff -o out.gtf

@Juke34
Copy link
Collaborator

Juke34 commented Dec 3, 2024

Actually I will make a new release. I will wait your feedback to integrate also this fix in the new release.

@mschilli87
Copy link

I'll try to get back to you by tomorrow.

@mschilli87
Copy link

Sorry for the delay. While I cannot report 100 % success, this seams to be due to a problem with the genome FASTA rather than the GTF. At least the error message above is gone now. So feel free to close this issue with the new release. Thank you again for your support in this.

@Juke34
Copy link
Collaborator

Juke34 commented Dec 5, 2024

Thx for your feedback

@Juke34 Juke34 closed this as completed in 9ecbcc6 Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants