Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partition in nanopolish eventalign output #176

Open
baibhav-bioinfo opened this issue Oct 20, 2024 · 24 comments
Open

partition in nanopolish eventalign output #176

baibhav-bioinfo opened this issue Oct 20, 2024 · 24 comments

Comments

@baibhav-bioinfo
Copy link

Hi,
i have done the partitions of the fastq file as it was not finishing for whole fastq in time.

so now i have eventalign.txt file for each part. I am using m6anet for pred the m6a sites.

what should i do? should i combine each eventalign.txt file before running dataprep or i can run the dataprep for each part an then combine the json file before inference running?

@yuukiiwa
Copy link
Collaborator

Hi @baibhav-bioinfo,

You can combine the eventalign.txt files before you run m6anet dataprep.

Thanks!

Best wishes,
Yuk Kei

@baibhav-bioinfo
Copy link
Author

if i combine the evenalign files, its 1.7 TB for a sample.
and i have a runtime limit of 24 hrs on HPC
will it be able to runt the dataprep in 24 hrs?

if not, the whats the solution?

@baibhav-bioinfo
Copy link
Author

The dataprep on combined.eventalign.txt did not complete in 24 hrs

i have already ran the dataprep and inference of m6anet on each part of original fastq.
as output in inference we get a tsv file with 6 columns
transcript_id, transcript_position, n_reads, probability, Kmer and mod_ratio

can i just run for all parts separately and then merge the tsv output.
we can add the n_reads for each common m6a site in all parts and take average for probability and mod_ratio columns.

@yuukiiwa
Copy link
Collaborator

Hi @baibhav-bioinfo,

This is not ideal. Is it possible that you can request for longer running time on your HPC?

Also, did you limit the alignment to contain only primary alignment during minimap2? That will help lower the number of sites, which should make the preprocessing faster.

Thanks!

Best wishes,
Yuk Kei

@baibhav-bioinfo
Copy link
Author

actually no, there is time limit of 24 hours on jobs in TACC.

also i did ran the minimap2 mapping to contain no secondary alignments.

in this case what should be the solution?

@baibhav-bioinfo
Copy link
Author

baibhav-bioinfo commented Oct 22, 2024

okay on some nodes i will be able to run for 48 hrs

but then also i dont think it will complete

some of my Direct RNA Seq samples have 20 million long nanopore reads

Have you or anyone faced this issue before? or i am the first one facing the data size issue?

@yuukiiwa
Copy link
Collaborator

yuukiiwa commented Oct 24, 2024

Hi @baibhav-bioinfo,

We don't have any machine time limit on our side. We also have PromethION samples that finished running m6anet dataprep within 48 hours.

Thanks!

Best wishes,
Yuk Kei

@baibhav-bioinfo
Copy link
Author

baibhav-bioinfo commented Oct 24, 2024

okay.
first of all Thankyou so much for your valuable responses, its helping me a lot understanding the pipeline.

i ran the m6anet dataprep on one of the samples for 48 hours
Their are 4 outputs of the dataprep: data.log, data.json, data.info and eventalign.index files
the eventalign.index file have stopped making any progress just 8 hours after running. But after that its been more than 24 hours but there is not any increase in the size of eventalign.index and no any other files made.

i wonder if the eventalign.index file have completed making? as i can not see any progress in any output file generation for the last 24 hours

@baibhav-bioinfo
Copy link
Author

baibhav-bioinfo commented Oct 24, 2024

also one more query about the Nanopore DRS reads if you can answer would be very helpful

i wanted to ask if the DRS reads (fastq) we get after basecalling have the polyA tails in them or not?
someone from nanopore community suggested the polyA tails are removed during the basecalling.
(This question is concerned with the APA analysis)

Thankyou so much for your time

@baibhav-bioinfo
Copy link
Author

the job i ran did not finish in 48 hours also

only eventalign.index file is made till now and as i mentioned it was made in only 8 hours and then for the next 40 hours no progress in it or any other file

i am pasting my sbatch script setting, Kindly suggest any changes which might speed up my dataprep

#!/bin/bash
#SBATCH --job-name=m6anet_dataprep_c6r1_combined
#SBATCH --output=%x_%j.out # Output filename (%x will be replaced by the job name, %j by the job ID)
#SBATCH --error=%x_%j.err # Error filename
#SBATCH --time=48:00:00 # Set the maximum runtime (e.g., 24 hours)
#SBATCH --ntasks=4 # Number of tasks (typically 1 for Canu)
#SBATCH --cpus-per-task=24 # Number of CPU cores per task
#SBATCH --partition=spr # Specify the partition
#SBATCH -N 1
#SBATCH --mail-user=baibhav.kumar@utrgv.edu # Email notifications
#SBATCH --mail-type=ALL # Type of email notification (BEGIN, END, FAIL, ALL)

conda activate m6anet_python_3.8

m6anet dataprep --eventalign $SCRATCH/c6_r1.combined.eventalign.tsv --out_dir $SCRATCH/c6_r1.combined_48.m6anet_dataprep_out --n_processes 96

@yuukiiwa
Copy link
Collaborator

okay. first of all Thankyou so much for your valuable responses, its helping me a lot understanding the pipeline.

i ran the m6anet dataprep on one of the samples for 48 hours Their are 4 outputs of the dataprep: data.log, data.json, data.info and eventalign.index files the eventalign.index file have stopped making any progress just 8 hours after running. But after that its been more than 24 hours but there is not any increase in the size of eventalign.index and no any other files made.

i wonder if the eventalign.index file have completed making? as i can not see any progress in any output file generation for the last 24 hours

This is normal

@yuukiiwa
Copy link
Collaborator

also one more query about the Nanopore DRS reads if you can answer would be very helpful

i wanted to ask if the DRS reads (fastq) we get after basecalling have the polyA tails in them or not? someone from nanopore community suggested the polyA tails are removed during the basecalling. (This question is concerned with the APA analysis)

Thankyou so much for your time

The poly(A) tails are still around in the fastq file in the reads. They are no longer included in starting in the eventalign file as they are mapped to a reference without poly(A) tails in the alignment step with minimap2

@yuukiiwa
Copy link
Collaborator

the job i ran did not finish in 48 hours also

only eventalign.index file is made till now and as i mentioned it was made in only 8 hours and then for the next 40 hours no progress in it or any other file

i am pasting my sbatch script setting, Kindly suggest any changes which might speed up my dataprep

#!/bin/bash #SBATCH --job-name=m6anet_dataprep_c6r1_combined #SBATCH --output=%x_%j.out # Output filename (%x will be replaced by the job name, %j by the job ID) #SBATCH --error=%x_%j.err # Error filename #SBATCH --time=48:00:00 # Set the maximum runtime (e.g., 24 hours) #SBATCH --ntasks=4 # Number of tasks (typically 1 for Canu) #SBATCH --cpus-per-task=24 # Number of CPU cores per task #SBATCH --partition=spr # Specify the partition #SBATCH -N 1 #SBATCH --mail-user=baibhav.kumar@utrgv.edu # Email notifications #SBATCH --mail-type=ALL # Type of email notification (BEGIN, END, FAIL, ALL)

conda activate m6anet_python_3.8

m6anet dataprep --eventalign $SCRATCH/c6_r1.combined.eventalign.tsv --out_dir $SCRATCH/c6_r1.combined_48.m6anet_dataprep_out --n_processes 96

Can you show the ls -lh of the output directory? And head eventalign.txt? Thanks

@baibhav-bioinfo
Copy link
Author

$head eventalign.index
transcript_id,read_index,pos_start,pos_end
SbiRTX430.K010900.1,30,172,23808
SbiRTX430.K010900.1,7,23808,48666
SbiRTX430.K010900.1,28,48666,71719
SbiRTX430.K010900.1,13,71719,94863
SbiRTX430.K010900.1,18,94863,117353
SbiRTX430.K010900.1,6,117353,141390
SbiRTX430.K010900.1,3,141390,163434
SbiRTX430.K010900.1,19,163434,188326
SbiRTX430.K010900.1,11,188326,209284

$ ls -lh
total 449M
-rw------- 1 baibhav G-825461 449M Oct 23 02:29 eventalign.index

@baibhav-bioinfo
Copy link
Author

hi @yuukiiwa there is a request, if you can provide your work email....can i ask the queries there?
As putting so much project details here seems risky to me.

let me know
Thankyou so much for the help so far.

@yuukiiwa
Copy link
Collaborator

hi @yuukiiwa there is a request, if you can provide your work email....can i ask the queries there? As putting so much project details here seems risky to me.

let me know Thankyou so much for the help so far.

Unfortunately, I cannot reply to any email regarding software usage, and if I happen to receive one, I can only reply here.

@yuukiiwa
Copy link
Collaborator

$head eventalign.index transcript_id,read_index,pos_start,pos_end SbiRTX430.K010900.1,30,172,23808 SbiRTX430.K010900.1,7,23808,48666 SbiRTX430.K010900.1,28,48666,71719 SbiRTX430.K010900.1,13,71719,94863 SbiRTX430.K010900.1,18,94863,117353 SbiRTX430.K010900.1,6,117353,141390 SbiRTX430.K010900.1,3,141390,163434 SbiRTX430.K010900.1,19,163434,188326 SbiRTX430.K010900.1,11,188326,209284

$ ls -lh total 449M -rw------- 1 baibhav G-825461 449M Oct 23 02:29 eventalign.index

Did you delete the data.* outputs from the output directory? Do you remember whether they have nothing in them?

Can I check whether you aligned to the genome or the transcriptome(transcript cDNA)? m6Anet only supports transcriptome alignment.

Thanks!

Best wishes,
Yuk Kei

@baibhav-bioinfo
Copy link
Author

Thanks for no, i did not delete

hi @yuukiiwa there is a request, if you can provide your work email....can i ask the queries there? As putting so much project details here seems risky to me.
let me know Thankyou so much for the help so far.

Unfortunately, I cannot reply to any email regarding software usage, and if I happen to receive one, I can only reply here.

No issues, i will ask my queries here only.

@baibhav-bioinfo
Copy link
Author

$head eventalign.index transcript_id,read_index,pos_start,pos_end SbiRTX430.K010900.1,30,172,23808 SbiRTX430.K010900.1,7,23808,48666 SbiRTX430.K010900.1,28,48666,71719 SbiRTX430.K010900.1,13,71719,94863 SbiRTX430.K010900.1,18,94863,117353 SbiRTX430.K010900.1,6,117353,141390 SbiRTX430.K010900.1,3,141390,163434 SbiRTX430.K010900.1,19,163434,188326 SbiRTX430.K010900.1,11,188326,209284
$ ls -lh total 449M -rw------- 1 baibhav G-825461 449M Oct 23 02:29 eventalign.index

Did you delete the data.* outputs from the output directory? Do you remember whether they have nothing in them?

Can I check whether you aligned to the genome or the transcriptome(transcript cDNA)? m6Anet only supports transcriptome alignment.

Thanks!

Best wishes, Yuk Kei

No, as i told there were no outputs made other than the eventalign.index file in the output folder

@baibhav-bioinfo
Copy link
Author

no, i deleted nothing from the output directory. only one file was made

I just realised i concatenated the tsv files for each file and the final file have the headers many times (as each part have its own header).
can this be the reason of not completing the process?

i have removed the headers, except the top one and will run the process again, lets see if it runs this time.

@baibhav-bioinfo
Copy link
Author

baibhav-bioinfo commented Nov 1, 2024

Hello, as i mentioned i removed the redundant headers from combined evenalign.txt and ran the dataprep again, now all the 4 files have been generated in the output folder but there is nothing in them.

just the eventalign.index file is made in 8 hours then nothing is there in any of the four files for next 40 hours.
So, again we made some progress but the command still not completed in 48 hours.

is there anything more we can change to make it finish in 48 hours limit?

ls -lh
total 451M
-rw------- 1 baibhav G-825461 52 Oct 30 19:35 data.info
-rw------- 1 baibhav G-825461 0 Oct 30 19:35 data.json
-rw------- 1 baibhav G-825461 0 Oct 30 19:35 data.log
-rw------- 1 baibhav G-825461 451M Oct 30 19:35 eventalign.index

as you can see the files generated within 8 hours of running, then nothing happened for next 40 hours. (except the data,info file have headers as "transcript_id,transcript_position,start,end,n_reads")

following is the command for minimap2 mapping with trascriptome

minimap2 -ax map-ont -uf -t 64 --secondary=no SbicolorRTx430_552_v2.1.transcript.fa c6_r1.fastq.gz > c6_r1.sam

kindly let us know,
Thankyou so much for the help so far.

@baibhav-bioinfo
Copy link
Author

I wonder if the files are filled at once when the whole job finishes?
or they get filled in realtime?
Can you recall any case when this happened?
Thanks

@baibhav-bioinfo
Copy link
Author

I did a test run by taking 1 million lines from each of the eventalign.txt parts i have
i merged them into one (1.45 GB), and ran the dataprep step and it completed in 10 min

so means its working

My original merged file is 1700 GBs and its not finishing in 48 hours.

so if i do the dataprep for each part eventalign and merge the output as i described earlier adding the read number and averaging the probabilities for each unique transcripts.
You mentioned earlier its not ideal, so as i am not left with much options here.
Can you suggest what are the probable issues with this approach?
If I am very cautious while doing it will i be able to make merging the outputs of m6Anet successful?

@baibhav-bioinfo
Copy link
Author

hi,
Thankyou so much for the help so far,
The dataprep was not able to complete in 48 hours time limit. So i ran it with the parts and now i have the dataprep results for each part.

earlier i asked whether running "m6anet inference" on each part then merging to get the final result for combined sample is feasible or not.

you replied its not ideal, i wanted to know is there any conceptual flaw in it ?
actually i am not left with any option other than that.

Any help would be much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants