Questions Regarding RAT Tool and Contig Processing #133

lucianhu · 2024-10-11T09:58:52Z

Hello Team,

Thank you for the wonderful tool. However, I have two questions:

When I run RAT, it gives an error related to the reads with the following message. From what I understand, my FASTQ headers must be present in the contigs. My contigs were generated by MEGAHIT, and they only have names like >k127_10000, without any FASTQ names. What should I do in this case?

When I run RAT, it requires me to perform CAT.contig2classification.txt first, but your tutorial does not mention that.

Does this RAT method (mapping reads to contigs) require re-running the deduplication step to accurately calculate relative abundance?

Thank you and please assist,

Nhu

./CAT_pack reads --mode cr -c contigs.fa -1 R_1.fastq -2 R_2.fastq -d /path/to/20240422_CAT_nr/db/ -t /path/to/200422_CAT_nr/tax/ -o lala -p out.CAT.predicted_proteins.faa -a out.CAT.alignment.diamond --c2c out.CAT.contig2classification.txt

<frozen genericpath>:39: RuntimeWarning: bool is used as a file descriptor
# CAT_pack v6.0.

[2024-10-11 09:41:47] samtools found: samtools 1.21.
[2024-10-11 09:41:47] bwa found: Version: 0.7.18-r1243-dirty.

RAT is running. Mapping reads against assembly with bwa mem.

[2024-10-11 09:41:47] Running bwa mem for read mapping. File fastq.bwamem.sorted will be generated.Do not forget to cite bwa mem and samtools when using RAT in your publication!
[2024-10-11 09:41:47] Contigs fasta is already indexed.
[2024-10-11 09:41:47] Running bwa mem...
...
[main] Version: 0.7.18-r1243-dirty
[main] CMD: bwa mem -t 24 
[main] Real time: 118.429 sec; CPU: 2701.163 sec
[2024-10-11 09:43:52] Sorting bam file...
[bam_sort_core] merging from 0 files and 24 in-memory blocks...
[2024-10-11 09:44:00] Read mapping done!

[2024-10-11 09:44:00] contig2classification file supplied. Processing contig classifications.
[2024-10-11 09:44:00] Loading file 20240422_CAT_nr/tax/nodes.dmp.
[2024-10-11 09:44:03] Processing mapping file(s).

[2024-10-11 09:44:32] Chosen mode: cr. Classifying unclassified contigs and unmapped reads with diamond if no classification file is supplied.
[2024-10-11 09:44:32] No unmapped2classification file supplied .Grabbing unmapped and unclassified sequences...
[2024-10-11 09:44:33] Contigs written! Appending forward reads...
Traceback (most recent call last):
  File "CAT_pack/CAT_pack/./CAT_pack", line 101, in <module>
    main()
    ~~~~^^
  File "CAT_pack/CAT_pack/./CAT_pack", line 85, in main
    reads.run()
    ~~~~~~~~~^^
  File "CAT_pack/CAT_pack/reads.py", line 435, in run
    make_unclassified_seq_fasta(reads_files[0], unmapped_reads['fw'],
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                uncl_unm_fasta, 'fastq', 'a','_1')
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "CAT_pack/CAT_pack/reads.py", line 1260, in make_unclassified_seq_fasta
    fasta_dict[seq]))
    ~~~~~~~~~~^^^^^
KeyError: 'FT100012261L1C013R00202177739'

The text was updated successfully, but these errors were encountered:

sheaster · 2024-10-31T18:29:56Z

Hi Nhu,
i also use megahit and am getting the same error "... line 1260, in make_unclassified_seq_fasta
fasta_dict[seq])) ... KeyError: ...."
Did you find a solution?
Thanks,
Shea

lucianhu · 2024-10-31T18:32:56Z

Hi Shea,

No, I haven't found a solution to the error yet.

Best,

Nhu

sheaster · 2024-10-31T18:37:01Z

I'm trying the latest version 6.0.1 and will let you know if that works for me.

lucianhu · 2024-10-31T18:47:16Z

Thank you very much. I posted my question on Biostars, but no one replied.
Have a good day,
Nhu

sheaster · 2024-10-31T19:52:12Z

v6.0.1 still has this error.

lucianhu · 2024-10-31T19:59:14Z

Thank you for your message. In my case, I used Kraken2 to classify reads, but it's actually not ideal because the classification percentage is very low.

thauptfeld · 2024-11-01T07:28:49Z

Hi there,
sorry about the wait. From what the error is, I don't think there is an issue with the contigs or the assembly. There seems to be something going wrong between the read names in the read files and the read names in the sorted bam file.
Could you post the first 10 lines of one of the read files and the first 10 lines of the corresponding sorted bam file (output of samtools view)?

lucianhu · 2024-11-01T13:45:40Z

Hello Thauptfeld,

Thank you for your response.

Here is my BAM file,

FT100012261L1C012R00301420211	99	k127_43926	1	60	150M	=	62	211	GCCGATCCGCTCGAGCCCGCGCCCAAAGGCGCCGCCGGGCAACTCGCCGAAGCGCACCAGCACGACCGAGCGTCCGTTGCGCCCGAGCCGCTCGCGCTTCTCGAACCACCGGTTGCCGGCGAACGCCATCAACAGCACGACGACGATCAA	IIIIEDIIIIDIIEIIIIIIIIIIEEEIIIIIIIIIIIIIEEIDIIIIIEEIIIIEIIEIIEIIEIIIEIIIDIIICDIIIIIIIEIIIIICIIIIIDDIDIIEEIIEIIIICDIIIIIIIEEIIIIEDIEEIEIIEIIEIIEGIECIEE	NM:i:0	MD:Z:150	MC:Z:150M	MQ:i:60	AS:i:150	XS:i:0
FT100012261L1C012R00301420211	147	k127_43926	62	60	150M	=	1	-211	ACGACCGAGCGTCCGTTGCGCCCGAGCCGCTCGCGCTTCTCGAACCACCGGTTGCCGGCGAACGCCATCAACAGCACGACGACGATCAACAGCGGGAGCAGCGCGACCAGCATCACGAGTGCGAGGAGGACGTCGCCCAGGCGCTTGAGG	CIHCIIICIIICIIIDEEIIIGI?CIIIIIDIIIIIEEIDIICBIICIIIIEEGIIIIIIBCIIIICEIDCICIICIICIICIICDICCIDIIIIIDIICIIIIICIIDIICEIDIIDIEIIIDIIDIICIIEIIIIIDIIIIIEEIDII	NM:i:0	MD:Z:150	MC:Z:150M	MQ:i:60	AS:i:150	XS:i:0
FT100012261L1C006R00300498314	99	k127_43926	72	40	127S19M4S	=	72	19	GCCCCCCCTGTCTTCGGCTTCCTCGGCGCCTTCCTTCTTGATGTGACCCTCGACGTCGAGCTCCTCGTTGTTCAGCAGGTTGCTCGATAAGTTGAGCAGCTCATCCGACATGTCTCCTCCTTCGGGAGTCCGTTGCGCCCGAGCCGTGCA	IIIIIIIIDICIDDIIBICD@IDIIIIDIIDDIIDDIDD@DCICIBIIICIF7IFCIIEIICIICIHCCCCCICHIBIICDI=CI=6C8DGCCCEAIEIIDIDCII6DIDCICICIICIICCII?IDHAIIICC:IIIII)?@IIGAII?	NM:i:0	MD:Z:19	MC:Z:15S19M116S	MQ:i:40	AS:i:19	XS:i:0
FT100012261L1C006R00300498314	147	k127_43926	72	40	15S19M116S	=	72	-19	TCTCCTCCTTCGGGAGTCCGTTGCGCCCGAGCCGTGCAGTTACTCCTCCAGACGTGGTGGAGGAGCGAAAGCTTCCCGGCATTGTCTCGCTCTCCGAGAAGCGATTCAATCCCTCGATGAGCCGGTTCCGTGTGCTGGCGTCGTCCCTGC	DI?HIDGIBEIIIICIBIIIEEIIIIIIICIIHIEIIDIEECIDIIDIIDICIIEIIEIIDIIDIIIDDCIIDEIIIIIICEEIEIEIIIDIEIIIDIDCIIIDEEIDDEIIIEIIDDIDIIIIIEEIIIEIEIIEIIIIEIIEIIIDII	NM:i:0	MD:Z:19	MC:Z:127S19M4S	MQ:i:40	AS:i:19	XS:i:0
FT100012261L1C003R00101959697	163	k127_43926	111	60	150M	=	165	204	GGTTGCCGGCGAACGCCATCAACAGCACGACGACGATCAACAGCGGGAGCAGCGCGACCAGCATCACGAGTGCGAGGAGGACGTCGCCCAGGCGCTTGAGGAGCAGCGAAACGCGGGCACGCGGGCGCCCGAGCCGTTGCGCGGCTCGTG	IIDDIIIIIIIEEIIIIEDIEEIEIIEIIEIIEIIECIEEIEIIIIIEIIEIIIIIEIIEIIECIEIIEICIIIEIIDIIDIICIIIIIEIIIIICCIDIIEIIEIIIEEEIIIIIIIEIIIIIIIIIIIIDIIIICCIIIIIIICIICI	NM:i:0	MD:Z:150	MC:Z:150M	MQ:i:60	AS:i:150	XS:i:0

and this is from the read files.

@FT100012261L1C001R00100000393/1
TTTGCCGGACAACCAGCCGCCGCCCAAGGGGGACCAAGGCAGCAGCCCGATCCCCGCATCGAGCGCCGCCGGGACGATCTCGAATTCGATGTCCCGGACAAGAAGGCTGTACTGCGGCTGCAGCCTCACCGGCGGGTTCCAGCCTTGGCT
+
CCCIIIF<6I@EII;;II=IIIIII>>IB?II.IIE:I:I@DI><IIIIC=IIIIII:CII5=I6IIIIIIIIDIH,BICA>E0:?II*BICIII=.@IE-GD;AI05(=5ICII0IIC/I<I)IBI-I0@FI.FIB=II,-II*C6II;
@FT100012261L1C001R00100000429/1
TCCAGCGGCACGCCTAGCTCCTGCGCGACCGTTTCCAACGACCGGGGCGTGCTCAGAATGTTTTTACGCTCGAAATTGCCCCGCTCGGTCACGTCGAAGTAGCGCAGCACCAGTGGTGCGTCGTCGGGTCCTACCACGGCAGTAATCTCG
+
CIIEDII<IDIIIICEIICIICIIIIIDIIICC<IIDEIFDIIIIIII<C;IBIEIE@CH5BCCCEIIICIHEDDCCIIIIIIHCI9IAIEIHBIHDBIC8IIHIDHIEIIEF<IACHII+IC7IA-IBIIC<IIEII/ID-CD@CICIA
@FT100012261L1C001R00100000695/1
CGAGGAAGAGGCCCGCGACCAGATCGAGACCAACCTGTTCGGCGCCCTGTGGATCACCCAGGCTGCGCTGCCCATCCTGCGTGAGCAGGGGAGCGGTCACATCATCCAGGTGTCGTCGATCGGGGGCATCTCCGCGTTCCCGCGCGTCGG

I look forward to hearing from you.

Best,

Nhu

sheaster · 2024-11-01T15:56:33Z

Hi All,
yesterday, I am pretty sure I was able to get RAT running with a similar bam file by changing "fasta_dict[seq]" to "fasta_dict[seq_id]" in line 1259 of reads.py: outf.write('>{0}{1}\n{2}\n'.format(seq_id, suffix, fasta_dict[seq]))

Cheers,
Shea

lucianhu · 2024-11-02T09:18:08Z

Hello Thauptfeld,

Thank you for considering my feedback. Yesterday, I tried your updated code, but there are still some issues.

[M::mem_process_seqs] Processed 628586 reads in 129.108 CPU sec, 5.694 real sec
[main] Version: 0.7.18-r1243-dirty
[main] CMD: bwa mem -t 24 /path/to/Phase_1_1.contigs.fa ./Phase_1_1_1.fastq ./Phase_1_1_2.fastq
[main] Real time: 114.846 sec; CPU: 2686.158 sec
[2024-11-02 10:05:31] Sorting bam file...
[bam_sort_core] merging from 0 files and 24 in-memory blocks...
[2024-11-02 10:05:39] Read mapping done!

[2024-11-02 10:05:39] contig2classification file supplied. Processing contig classifications.
[2024-11-02 10:05:39] Loading file /path/to/20240422_CAT_nr/tax/nodes.dmp.
[2024-11-02 10:05:41] Processing mapping file(s).

[2024-11-02 10:06:10] Chosen mode: cr. Classifying unclassified contigs and unmapped reads with diamond if no classification file is supplied.
[2024-11-02 10:06:10] No unmapped2classification file supplied .Grabbing unmapped and unclassified sequences...
[2024-11-02 10:06:11] Contigs written! Appending forward reads...
Traceback (most recent call last):
  File "/path/to/CAT_pack", line 101, in <module>
    main()
    ~~~~^^
  File "/path/to/CAT_pack", line 85, in main
    reads.run()
    ~~~~~~~~~^^
  File "/path/to/CAT_pack/reads.py", line 435, in run
    make_unclassified_seq_fasta(reads_files[0], unmapped_reads['fw'],
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                uncl_unm_fasta, 'fastq', 'a','_1')
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/CAT_pack/reads.py", line 1260, in make_unclassified_seq_fasta
    fasta_dict[seq]))
    ~~~~~~~~~~^^^^^
KeyError: 'FT100012261L1C009R00400078634'

Sincerely,

Nhu

thauptfeld · 2024-11-04T09:41:58Z

Hi Nhu,

the issue with the run seems to be that in the read files, the read id's have a /1 at the end (or a /2 in the reverse file I assume), which is removed during the mapping. So, in the bam file they don't have the /1 at the end. Therefore, RAT is looking for a read id that doesn't exist in the read file. RAT is supposed to be able to handle this for '/1' and '_1', so I'll have to check what is going on.

Will let you know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions Regarding RAT Tool and Contig Processing #133

Questions Regarding RAT Tool and Contig Processing #133

lucianhu commented Oct 11, 2024 •

edited

Loading

sheaster commented Oct 31, 2024

lucianhu commented Oct 31, 2024

sheaster commented Oct 31, 2024

lucianhu commented Oct 31, 2024

sheaster commented Oct 31, 2024

lucianhu commented Oct 31, 2024

thauptfeld commented Nov 1, 2024

lucianhu commented Nov 1, 2024

sheaster commented Nov 1, 2024

lucianhu commented Nov 2, 2024

thauptfeld commented Nov 4, 2024

Questions Regarding RAT Tool and Contig Processing #133

Questions Regarding RAT Tool and Contig Processing #133

Comments

lucianhu commented Oct 11, 2024 • edited Loading

sheaster commented Oct 31, 2024

lucianhu commented Oct 31, 2024

sheaster commented Oct 31, 2024

lucianhu commented Oct 31, 2024

sheaster commented Oct 31, 2024

lucianhu commented Oct 31, 2024

thauptfeld commented Nov 1, 2024

lucianhu commented Nov 1, 2024

sheaster commented Nov 1, 2024

lucianhu commented Nov 2, 2024

thauptfeld commented Nov 4, 2024

lucianhu commented Oct 11, 2024 •

edited

Loading