Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions Regarding RAT Tool and Contig Processing #133

Open
lucianhu opened this issue Oct 11, 2024 · 11 comments
Open

Questions Regarding RAT Tool and Contig Processing #133

lucianhu opened this issue Oct 11, 2024 · 11 comments

Comments

@lucianhu
Copy link

lucianhu commented Oct 11, 2024

Hello Team,

Thank you for the wonderful tool. However, I have two questions:

When I run RAT, it gives an error related to the reads with the following message. From what I understand, my FASTQ headers must be present in the contigs. My contigs were generated by MEGAHIT, and they only have names like >k127_10000, without any FASTQ names. What should I do in this case?

When I run RAT, it requires me to perform CAT.contig2classification.txt first, but your tutorial does not mention that.

Does this RAT method (mapping reads to contigs) require re-running the deduplication step to accurately calculate relative abundance?

Thank you and please assist,

Nhu

./CAT_pack reads --mode cr -c contigs.fa -1 R_1.fastq -2 R_2.fastq -d /path/to/20240422_CAT_nr/db/ -t /path/to/200422_CAT_nr/tax/ -o lala -p out.CAT.predicted_proteins.faa -a out.CAT.alignment.diamond --c2c out.CAT.contig2classification.txt

<frozen genericpath>:39: RuntimeWarning: bool is used as a file descriptor
# CAT_pack v6.0.

[2024-10-11 09:41:47] samtools found: samtools 1.21.
[2024-10-11 09:41:47] bwa found: Version: 0.7.18-r1243-dirty.

RAT is running. Mapping reads against assembly with bwa mem.

[2024-10-11 09:41:47] Running bwa mem for read mapping. File fastq.bwamem.sorted will be generated.Do not forget to cite bwa mem and samtools when using RAT in your publication!
[2024-10-11 09:41:47] Contigs fasta is already indexed.
[2024-10-11 09:41:47] Running bwa mem...
...
[main] Version: 0.7.18-r1243-dirty
[main] CMD: bwa mem -t 24 
[main] Real time: 118.429 sec; CPU: 2701.163 sec
[2024-10-11 09:43:52] Sorting bam file...
[bam_sort_core] merging from 0 files and 24 in-memory blocks...
[2024-10-11 09:44:00] Read mapping done!

[2024-10-11 09:44:00] contig2classification file supplied. Processing contig classifications.
[2024-10-11 09:44:00] Loading file 20240422_CAT_nr/tax/nodes.dmp.
[2024-10-11 09:44:03] Processing mapping file(s).

[2024-10-11 09:44:32] Chosen mode: cr. Classifying unclassified contigs and unmapped reads with diamond if no classification file is supplied.
[2024-10-11 09:44:32] No unmapped2classification file supplied .Grabbing unmapped and unclassified sequences...
[2024-10-11 09:44:33] Contigs written! Appending forward reads...
Traceback (most recent call last):
  File "CAT_pack/CAT_pack/./CAT_pack", line 101, in <module>
    main()
    ~~~~^^
  File "CAT_pack/CAT_pack/./CAT_pack", line 85, in main
    reads.run()
    ~~~~~~~~~^^
  File "CAT_pack/CAT_pack/reads.py", line 435, in run
    make_unclassified_seq_fasta(reads_files[0], unmapped_reads['fw'],
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                uncl_unm_fasta, 'fastq', 'a','_1')
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "CAT_pack/CAT_pack/reads.py", line 1260, in make_unclassified_seq_fasta
    fasta_dict[seq]))
    ~~~~~~~~~~^^^^^
KeyError: 'FT100012261L1C013R00202177739'
@sheaster
Copy link

Hi Nhu,
i also use megahit and am getting the same error "... line 1260, in make_unclassified_seq_fasta
fasta_dict[seq])) ... KeyError: ...."
Did you find a solution?
Thanks,
Shea

@lucianhu
Copy link
Author

Hi Shea,

No, I haven't found a solution to the error yet.

Best,

Nhu

@sheaster
Copy link

I'm trying the latest version 6.0.1 and will let you know if that works for me.

@lucianhu
Copy link
Author

Thank you very much. I posted my question on Biostars, but no one replied.
Have a good day,
Nhu

@sheaster
Copy link

v6.0.1 still has this error.

@lucianhu
Copy link
Author

Thank you for your message. In my case, I used Kraken2 to classify reads, but it's actually not ideal because the classification percentage is very low.

@thauptfeld
Copy link
Collaborator

Hi there,
sorry about the wait. From what the error is, I don't think there is an issue with the contigs or the assembly. There seems to be something going wrong between the read names in the read files and the read names in the sorted bam file.
Could you post the first 10 lines of one of the read files and the first 10 lines of the corresponding sorted bam file (output of samtools view)?

@lucianhu
Copy link
Author

lucianhu commented Nov 1, 2024

Hello Thauptfeld,

Thank you for your response.

Here is my BAM file,

FT100012261L1C012R00301420211	99	k127_43926	1	60	150M	=	62	211	GCCGATCCGCTCGAGCCCGCGCCCAAAGGCGCCGCCGGGCAACTCGCCGAAGCGCACCAGCACGACCGAGCGTCCGTTGCGCCCGAGCCGCTCGCGCTTCTCGAACCACCGGTTGCCGGCGAACGCCATCAACAGCACGACGACGATCAA	IIIIEDIIIIDIIEIIIIIIIIIIEEEIIIIIIIIIIIIIEEIDIIIIIEEIIIIEIIEIIEIIEIIIEIIIDIIICDIIIIIIIEIIIIICIIIIIDDIDIIEEIIEIIIICDIIIIIIIEEIIIIEDIEEIEIIEIIEIIEGIECIEE	NM:i:0	MD:Z:150	MC:Z:150M	MQ:i:60	AS:i:150	XS:i:0
FT100012261L1C012R00301420211	147	k127_43926	62	60	150M	=	1	-211	ACGACCGAGCGTCCGTTGCGCCCGAGCCGCTCGCGCTTCTCGAACCACCGGTTGCCGGCGAACGCCATCAACAGCACGACGACGATCAACAGCGGGAGCAGCGCGACCAGCATCACGAGTGCGAGGAGGACGTCGCCCAGGCGCTTGAGG	CIHCIIICIIICIIIDEEIIIGI?CIIIIIDIIIIIEEIDIICBIICIIIIEEGIIIIIIBCIIIICEIDCICIICIICIICIICDICCIDIIIIIDIICIIIIICIIDIICEIDIIDIEIIIDIIDIICIIEIIIIIDIIIIIEEIDII	NM:i:0	MD:Z:150	MC:Z:150M	MQ:i:60	AS:i:150	XS:i:0
FT100012261L1C006R00300498314	99	k127_43926	72	40	127S19M4S	=	72	19	GCCCCCCCTGTCTTCGGCTTCCTCGGCGCCTTCCTTCTTGATGTGACCCTCGACGTCGAGCTCCTCGTTGTTCAGCAGGTTGCTCGATAAGTTGAGCAGCTCATCCGACATGTCTCCTCCTTCGGGAGTCCGTTGCGCCCGAGCCGTGCA	IIIIIIIIDICIDDIIBICD@IDIIIIDIIDDIIDDIDD@DCICIBIIICIF7IFCIIEIICIICIHCCCCCICHIBIICDI=CI=6C8DGCCCEAIEIIDIDCII6DIDCICICIICIICCII?IDHAIIICC:IIIII)?@IIGAII?	NM:i:0	MD:Z:19	MC:Z:15S19M116S	MQ:i:40	AS:i:19	XS:i:0
FT100012261L1C006R00300498314	147	k127_43926	72	40	15S19M116S	=	72	-19	TCTCCTCCTTCGGGAGTCCGTTGCGCCCGAGCCGTGCAGTTACTCCTCCAGACGTGGTGGAGGAGCGAAAGCTTCCCGGCATTGTCTCGCTCTCCGAGAAGCGATTCAATCCCTCGATGAGCCGGTTCCGTGTGCTGGCGTCGTCCCTGC	DI?HIDGIBEIIIICIBIIIEEIIIIIIICIIHIEIIDIEECIDIIDIIDICIIEIIEIIDIIDIIIDDCIIDEIIIIIICEEIEIEIIIDIEIIIDIDCIIIDEEIDDEIIIEIIDDIDIIIIIEEIIIEIEIIEIIIIEIIEIIIDII	NM:i:0	MD:Z:19	MC:Z:127S19M4S	MQ:i:40	AS:i:19	XS:i:0
FT100012261L1C003R00101959697	163	k127_43926	111	60	150M	=	165	204	GGTTGCCGGCGAACGCCATCAACAGCACGACGACGATCAACAGCGGGAGCAGCGCGACCAGCATCACGAGTGCGAGGAGGACGTCGCCCAGGCGCTTGAGGAGCAGCGAAACGCGGGCACGCGGGCGCCCGAGCCGTTGCGCGGCTCGTG	IIDDIIIIIIIEEIIIIEDIEEIEIIEIIEIIEIIECIEEIEIIIIIEIIEIIIIIEIIEIIECIEIIEICIIIEIIDIIDIICIIIIIEIIIIICCIDIIEIIEIIIEEEIIIIIIIEIIIIIIIIIIIIDIIIICCIIIIIIICIICI	NM:i:0	MD:Z:150	MC:Z:150M	MQ:i:60	AS:i:150	XS:i:0

and this is from the read files.

@FT100012261L1C001R00100000393/1
TTTGCCGGACAACCAGCCGCCGCCCAAGGGGGACCAAGGCAGCAGCCCGATCCCCGCATCGAGCGCCGCCGGGACGATCTCGAATTCGATGTCCCGGACAAGAAGGCTGTACTGCGGCTGCAGCCTCACCGGCGGGTTCCAGCCTTGGCT
+
CCCIIIF<6I@EII;;II=IIIIII>>IB?II.IIE:I:I@DI><IIIIC=IIIIII:CII5=I6IIIIIIIIDIH,BICA>E0:?II*BICIII=.@IE-GD;AI05(=5ICII0IIC/I<I)IBI-I0@FI.FIB=II,-II*C6II;
@FT100012261L1C001R00100000429/1
TCCAGCGGCACGCCTAGCTCCTGCGCGACCGTTTCCAACGACCGGGGCGTGCTCAGAATGTTTTTACGCTCGAAATTGCCCCGCTCGGTCACGTCGAAGTAGCGCAGCACCAGTGGTGCGTCGTCGGGTCCTACCACGGCAGTAATCTCG
+
CIIEDII<IDIIIICEIICIICIIIIIDIIICC<IIDEIFDIIIIIII<C;IBIEIE@CH5BCCCEIIICIHEDDCCIIIIIIHCI9IAIEIHBIHDBIC8IIHIDHIEIIEF<IACHII+IC7IA-IBIIC<IIEII/ID-CD@CICIA
@FT100012261L1C001R00100000695/1
CGAGGAAGAGGCCCGCGACCAGATCGAGACCAACCTGTTCGGCGCCCTGTGGATCACCCAGGCTGCGCTGCCCATCCTGCGTGAGCAGGGGAGCGGTCACATCATCCAGGTGTCGTCGATCGGGGGCATCTCCGCGTTCCCGCGCGTCGG

I look forward to hearing from you.

Best,

Nhu

@sheaster
Copy link

sheaster commented Nov 1, 2024

Hi All,
yesterday, I am pretty sure I was able to get RAT running with a similar bam file by changing "fasta_dict[seq]" to "fasta_dict[seq_id]" in line 1259 of reads.py: outf.write('>{0}{1}\n{2}\n'.format(seq_id, suffix, fasta_dict[seq]))

Cheers,
Shea

@lucianhu
Copy link
Author

lucianhu commented Nov 2, 2024

Hello Thauptfeld,

Thank you for considering my feedback. Yesterday, I tried your updated code, but there are still some issues.

[M::mem_process_seqs] Processed 628586 reads in 129.108 CPU sec, 5.694 real sec
[main] Version: 0.7.18-r1243-dirty
[main] CMD: bwa mem -t 24 /path/to/Phase_1_1.contigs.fa ./Phase_1_1_1.fastq ./Phase_1_1_2.fastq
[main] Real time: 114.846 sec; CPU: 2686.158 sec
[2024-11-02 10:05:31] Sorting bam file...
[bam_sort_core] merging from 0 files and 24 in-memory blocks...
[2024-11-02 10:05:39] Read mapping done!

[2024-11-02 10:05:39] contig2classification file supplied. Processing contig classifications.
[2024-11-02 10:05:39] Loading file /path/to/20240422_CAT_nr/tax/nodes.dmp.
[2024-11-02 10:05:41] Processing mapping file(s).

[2024-11-02 10:06:10] Chosen mode: cr. Classifying unclassified contigs and unmapped reads with diamond if no classification file is supplied.
[2024-11-02 10:06:10] No unmapped2classification file supplied .Grabbing unmapped and unclassified sequences...
[2024-11-02 10:06:11] Contigs written! Appending forward reads...
Traceback (most recent call last):
  File "/path/to/CAT_pack", line 101, in <module>
    main()
    ~~~~^^
  File "/path/to/CAT_pack", line 85, in main
    reads.run()
    ~~~~~~~~~^^
  File "/path/to/CAT_pack/reads.py", line 435, in run
    make_unclassified_seq_fasta(reads_files[0], unmapped_reads['fw'],
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                uncl_unm_fasta, 'fastq', 'a','_1')
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/CAT_pack/reads.py", line 1260, in make_unclassified_seq_fasta
    fasta_dict[seq]))
    ~~~~~~~~~~^^^^^
KeyError: 'FT100012261L1C009R00400078634'

Sincerely,

Nhu

@thauptfeld
Copy link
Collaborator

Hi Nhu,

the issue with the run seems to be that in the read files, the read id's have a /1 at the end (or a /2 in the reverse file I assume), which is removed during the mapping. So, in the bam file they don't have the /1 at the end. Therefore, RAT is looking for a read id that doesn't exist in the read file. RAT is supposed to be able to handle this for '/1' and '_1', so I'll have to check what is going on.

Will let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants