Put FastQ File Splitting in a Dedicated Process #164

ieres-amgen · 2023-05-17T21:35:12Z

Description of feature

Hi Nicolas, thanks once again for all your hard work on this pipeline.

Just putting in a formal request, as I described on slack, for the fastq file splitting process to be put into a dedicated task context (rather than handled exclusively by the native nextflow engine). There are a number of HPCC setups where the nextflow engine runs on a device with limited bandwidth, in comparison to the machines that are spun-up per-task; this can lead to the native splitFastq running very slowly (e.g. splitting a pair of 250Gb FASTQs into 20M read chunks taking >48 hours). I believe nf-core already has a seqkit module implementation for read splitting that could make it relatively straightforward to put the read splitting in a dedicated task process, so it can be done on a pre-specified device with better I/O throughput.

Thanks for considering!!
-Ittai

askol-lurie · 2023-05-18T13:45:05Z

I'd like to ditto that request. 99% of the time taken up by the pipeline when I run it is taken up in this step.

ieres-amgen · 2023-06-02T18:54:09Z

To give some further context, when inputting very large FASTQ files, the .splitFastq implementation currently being used will in INPUT_CHECK will:

Read and decompress every imported FASTQ.gz
Parse the FASTQ reads and split into chunks of reads
Recompress the chunks
Write the chunks to scratch storage

Given these currently happen on the device where the engine itself is running, and the very large size of FASTQs for these type of data, this ends up taking a long time. Increasing the workflow engine instance resources doesn't help much since the actions are not parallelized regardless, so additional CPUs don't provide much help.

Again, very much appreciate the time and energy you've put into this pipeline, and hopeful it can be applied to larger-scale data more easily with this change. Thanks!

nservant · 2024-01-26T15:12:06Z

I agree that this could be a very nice feature indeed. Actually, it might be useful for several nf-core pipelines. Will try to discuss this point with the nf-core core members ;)

Krithika-Bhuvan · 2024-04-30T14:53:49Z

Hi @ieres-amgen - Can you explain a little more how you currently implement this .splitFastq ?
I'm new to this workflow and tried setting "--split-fastq TRUE" in my yaml file but the workflow didn't move forward for me.

To give some further context, when inputting very large FASTQ files, the .splitFastq implementation currently being used will in INPUT_CHECK will:

Read and decompress every imported FASTQ.gz

Parse the FASTQ reads and split into chunks of reads

Recompress the chunks

Write the chunks to scratch storage

Given these currently happen on the device where the engine itself is running, and the very large size of FASTQs for these type of data, this ends up taking a long time. Increasing the workflow engine instance resources doesn't help much since the actions are not parallelized regardless, so additional CPUs don't provide much help.

Again, very much appreciate the time and energy you've put into this pipeline, and hopeful it can be applied to larger-scale data more easily with this change. Thanks!

ieres-amgen · 2024-05-01T16:17:00Z

@Krithika-Bhuvan, this thread is about changing the general way split_fastq is implemented, not troubleshooting its current functionality.

Based on what you wrote, my guess is that you encounter issues because you're passing the wrong name for the parameter (it's "split_fastq" instead of "split-fastq"), but I can't say for certain without more information about the errors you see. I highly recommend heading over to the nf-core slack and seeking guidance there if you encounter further issues, that is the best place for troubleshooting.

Krithika-Bhuvan · 2024-05-02T14:34:57Z

Thank you for the explanation @ieres-amgen. Is there anyway to check if the splitting process is working or not ? I could not locate any log files related to this during my test so I can't tell if that is working or not (I used the right tags in the yaml file). Any suggestions on where to look would be helpful. Thank you !

ieres-amgen · 2024-05-03T15:53:07Z

No, to my knowledge, part of the disadvantage of not having this in a dedicated process is that there is no way to check on progress.

Krithika-Bhuvan · 2024-05-03T16:14:32Z

I'm new to the pipeline so I've been wondering If I was doing something wrong. Thank you for confirming this ! Its just a waiting game now.

ieres-amgen added the enhancement New feature or request label May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Put FastQ File Splitting in a Dedicated Process #164

Put FastQ File Splitting in a Dedicated Process #164

ieres-amgen commented May 17, 2023 •

edited

Loading

askol-lurie commented May 18, 2023

ieres-amgen commented Jun 2, 2023

nservant commented Jan 26, 2024

Krithika-Bhuvan commented Apr 30, 2024

ieres-amgen commented May 1, 2024

Krithika-Bhuvan commented May 2, 2024

ieres-amgen commented May 3, 2024

Krithika-Bhuvan commented May 3, 2024

Put FastQ File Splitting in a Dedicated Process #164

Put FastQ File Splitting in a Dedicated Process #164

Comments

ieres-amgen commented May 17, 2023 • edited Loading

Description of feature

askol-lurie commented May 18, 2023

ieres-amgen commented Jun 2, 2023

nservant commented Jan 26, 2024

Krithika-Bhuvan commented Apr 30, 2024

ieres-amgen commented May 1, 2024

Krithika-Bhuvan commented May 2, 2024

ieres-amgen commented May 3, 2024

Krithika-Bhuvan commented May 3, 2024

ieres-amgen commented May 17, 2023 •

edited

Loading