Dynamic resource estimation #168

IanSudbery · 2016-01-25T10:31:05Z

Currently we have several places where the amount of memory required by a job is dependent on the size of the input files, which in turn are dependent on the project. Whenever somebody runs a pipeline where the currently assigned memory is insufficient, they increase the memory requirements for that job in the pipeline.

The result of this is that for the resource requirements for any given task are set for the largest input files that have every been processed, even if that is much larger than is the norm. Further, this is a rachet-type process, and the resource requirements of our tasks are only ever going to go up.

I can think of two solutions to this. The first is to encode more of the resource requirements in the ini file. The second would be to implement some sort of dynamic resource requirement determination for certain tasks.

For example:

The memory requirement of fastqs2fastqs when reconciling pairs is highly dependent on the size of the input fastq files. This can be run by pipeline_readqc, but it would need to be set to a very high memory requirement so that large files could be handled (for 30M read files the requirement is well in excess of 10GB). Instead, I propose that the pipeline task measures the size of the input file, and sets the memory requirement appropriately. This would require some experiments to determine the shape of the memory vs size curve, but shouldn't be too difficult.

Is this something we should do more of? Thoughts?

TomSmithCGAT · 2016-01-25T10:44:18Z

I came across exactly this issue when writing the pipeline a new pipeline for alignment-free transcript quantification and differential expression analysis (pipeline_transcriptdiffexpression.py) as Sleuth's memory usage is dependent upon the number of transcripts, samples and bootstraps. The pipeline therefore determines the Sleuth memory usage without any user input.

My preference would be to get the pipeline to determine the memory requirement in these cases. I don't think it's reasonable to expect the user to know how much memory each script may take.

Tom

From: Ian Sudbery [notifications@github.com]
Sent: 25 January 2016 10:31
To: CGATOxford/CGATPipelines
Subject: [CGATPipelines] Dynamic resource estimation (#168)

Currently we have several places where the amount of memory required by a job is dependent on the size of the input files, which in turn are dependent on the project Whenever somebody runs a pipeline where the currently assigned memory is insufficient, they increase the memory requirements for that job in the pipeline

The result of this is that for the resource requirements for any given task are set for the largest input files that have every been processed, even if that is much larger than is the norm Further, this is a rachet-type process, and the resource requirements of our tasks are only ever going to go up

I can think of two solutions to this The first is to encode more of the resource requirements in the ini file The second would be to implement some sort of dynamic resource requirement determination for certain tasks

For example:

The memory requirement of fastqs2fastqs when reconciling pairs is highly dependent on the size of the input fastq files This can be run by pipeline_readqc, but it would need to be set to a very high memory requirement so that large files could be handled (for 30M read files the requirement is well in excess of 10GB) Instead, I propose that the pipeline task measures the size of the input file, and sets the memory requirement appropriately This would require some experiments to determine the shape of the memory vs size curve, but shouldn't be too difficult

Is this something we should do more of? Thoughts?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/168.

AndreasHeger · 2016-01-25T10:57:49Z

Yes, that would be a good thing to do. Given that most of the memory requirements grow linear with number of reads, read length, etc., maybe it can be supported by a simple function such as:

job_memory = estimate_memory_requirement_from_file_size(filename, min_memory, max_memory, min_size, max_size)

with min_memory = minimum memory to use if file size is min_size or smaller, max_memory = maximum memory to use if file is max_size or larger and interpolated in between.

IanSudbery added the enhancement label Jan 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic resource estimation #168

Dynamic resource estimation #168

IanSudbery commented Jan 25, 2016

TomSmithCGAT commented Jan 25, 2016

AndreasHeger commented Jan 25, 2016

Dynamic resource estimation #168

Dynamic resource estimation #168

Comments

IanSudbery commented Jan 25, 2016

TomSmithCGAT commented Jan 25, 2016

AndreasHeger commented Jan 25, 2016