Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcript ID in reference and expression profile do not match #237

Open
SimonHegele opened this issue Oct 30, 2024 · 8 comments
Open

Transcript ID in reference and expression profile do not match #237

SimonHegele opened this issue Oct 30, 2024 · 8 comments
Labels

Comments

@SimonHegele
Copy link

SimonHegele commented Oct 30, 2024

After quantification I tried simulating

IDs in expression profile:
NM_001001130.3

IDs in reference:
>NM_001001130.3 Mus musculus zinc finger protein 85 (Zfp85), mRNA

@lcoombe
Copy link
Member

lcoombe commented Oct 30, 2024

Hi @SimonHegele,

I'm not sure that I understand the issue - It looks like NM_001001130.3 is found in both the expression profile and reference? In fasta format, the string after > and before the first white space is considered the ID (ex. https://www.ncbi.nlm.nih.gov/genbank/fastaformat/). Anything following the space is part of the 'comment'.
If you have the unique ID in the expression profile, you can cross-reference with the fasta headers to see the comment part of the header line if you wish.

I hope that helps - thank you for your interest in NanoSim!
Lauren

@SimonHegele
Copy link
Author

Hi,
sry, my message missed some crucial information. And I digged in further what is the problem.

The error message:
"Please make sure transcript IDs in the expression profile match with those in reference transcriptome (example: both Ensembl IDs)"

I added two lines of code:
print(list(dict_exp.keys())[0]) print(list(dict_len.keys())[0])
to the make_cdf( ) function of the simulator.py script which throws this error to print yields this output:
NM_001001130 NM-001001130
So in the files the IDs are identical, but for the dict_len generated from the reference fasta somehow the underscore is replaced with a minus. The error did not occur after removing the underscore from both the reference fasta and the quantification file

@lcoombe
Copy link
Member

lcoombe commented Nov 4, 2024

Hi @SimonHegele,

Thanks for that extra information and context!

After tracing back the error message from the make_cdf function that you reference, I can see this line when reading in sequences, which I think explains the behaviour you are seeing: https://github.com/bcgsc/NanoSim/blob/master/src/simulator.py#L344-L345
So, it looks like it will split at underscores, then re-join with hyphens.
Honestly - I'm not really sure why it was implemented like that (I wasn't actively involved in the code development). Maybe @SaberHQ can comment?

Without knowing the background of why that was done in that way from those who initially implemented NanoSim and trans-NanoSim, I am hesitant to change the code. I think that your workaround makes sense (although I know it isn't ideal) - I can more information to that error messages suggesting your fix so that it is less cryptic in the future.

@SaberHQ
Copy link
Collaborator

SaberHQ commented Nov 4, 2024

Hi @SimonHegele, Thanks for using NanoSim and bringing this issue to our attention.

To better assist you, could you please share how you quantified the expression profile for the reference transcripts?

To clarify, NanoSim provides a built-in functionality to quantify expression profiles. If you use the same reference transcriptome file for both the quantify mode and the transcriptome mode in the read_analysis.py script, everything should work smoothly.

Just as a reminder, NanoSim allows users to employ any expression profile of their own (calculated by other tools or one with manually adjusted values), provided that the file adheres to a pre-defined format: a three-column, tab-delimited file with Transcript ID, Count, and TPM value, including a header.

NanoSim doesn’t necessarily detect variations in the IDs in different files automatically, so users need to ensure that the transcript IDs in the reference transcriptome and the quantification files do match with each other if they decide to provide their own file directly and not run NanoSim's quantify mode.

For more clarity and your reference to further look into: When running NanoSim in quantify mode (see line #390 in the read_analysis.py script), the align_transcriptome function executes with the quantification flag, ultimately invoking the EM_trans function in the get_primary_sam.py script which quantifies the expression values and outputs them in a tsv file. This file could be used in the simulation stage. Kindly refer to the readme file for more information.

Feel free to reach out if you need further assistance.

Thank you @lcoombe for looking into this. Cheers.

@lcoombe
Copy link
Member

lcoombe commented Nov 4, 2024

Thanks for looking at this @SaberHQ! Please correct me if I'm wrong, @SimonHegele - I believe the issue is that the IDs did match between the quantification file and the reference transcriptome, but were not being recognized as matching by NanoSim?

For example, NM_001001130 was being stored as NM-001001130 when loading in the reference transcriptome for some reason - I think it is due to the code that I linked above?

@SaberHQ
Copy link
Collaborator

SaberHQ commented Nov 4, 2024

You’re absolutely right @lcoombe —this is the key question, and it’s why I asked @SimonHegele for clarification earlier. Understanding how the expression file was created is crucial. Specifically, whether it was generated using NanoSim’s quantify mode or another method makes a big difference.

From our extensive benchmarking, we’ve consistently seen that IDs remain identical when the quantify mode is used. This has been reliable for years, so I’m hesitant to label this as a bug without further details. However, there’s always the possibility of a specific edge case we might have overlooked, which we should definitely investigate if needed.

I also suspect the issue might be linked to the following lines of code in get_primary_sam.py:

https://github.com/bcgsc/NanoSim/blob/master/src/get_primary_sam.py#L252-L261

@SimonHegele
Copy link
Author

@SaberHQ I used NanoSim for the quantification.
read_analysis.py quantify -i ont.fastq -rt rna.fna -o nanosim_out -t 56 -e trans
and then the simulation with
simulator.py transcriptome -rt rna.fna -rg GCF_000001635.27_GRCm39_genomic.fna -c human_NA12878_cDNA_Bham1_guppy/training -e nanosim_out_transcriptome_quantification.tsv -o nanosim_100000 -n 100000 -t 56
The transcriptome is from the GRCm39 mouse assembly (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001635.27/)

@SaberHQ
Copy link
Collaborator

SaberHQ commented Nov 20, 2024

Thanks for the clarification @SimonHegele

Then, looks like there is a bug and we will look into it for sure. I suspect that is either related to the code lines in get_primary_sam.py script I mentioned above or it might be due to the one pointed out by Lauren. I am adding a bug tag to this issue and we will keep you updated if/when it is resolved.

PS: Would you be able to also share a couple lines from the alignment file? I wanna make sure it is not due to minimap2 sam format output etc...

Thanks.

@SaberHQ SaberHQ added the bug label Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants