-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic resource estimation #168
Comments
I came across exactly this issue when writing the pipeline a new pipeline for alignment-free transcript quantification and differential expression analysis (pipeline_transcriptdiffexpression.py) as Sleuth's memory usage is dependent upon the number of transcripts, samples and bootstraps. The pipeline therefore determines the Sleuth memory usage without any user input. My preference would be to get the pipeline to determine the memory requirement in these cases. I don't think it's reasonable to expect the user to know how much memory each script may take. Tom From: Ian Sudbery [notifications@github.com] Currently we have several places where the amount of memory required by a job is dependent on the size of the input files, which in turn are dependent on the project Whenever somebody runs a pipeline where the currently assigned memory is insufficient, they increase the memory requirements for that job in the pipeline The result of this is that for the resource requirements for any given task are set for the largest input files that have every been processed, even if that is much larger than is the norm Further, this is a rachet-type process, and the resource requirements of our tasks are only ever going to go up I can think of two solutions to this The first is to encode more of the resource requirements in the ini file The second would be to implement some sort of dynamic resource requirement determination for certain tasks For example: The memory requirement of fastqs2fastqs when reconciling pairs is highly dependent on the size of the input fastq files This can be run by pipeline_readqc, but it would need to be set to a very high memory requirement so that large files could be handled (for 30M read files the requirement is well in excess of 10GB) Instead, I propose that the pipeline task measures the size of the input file, and sets the memory requirement appropriately This would require some experiments to determine the shape of the memory vs size curve, but shouldn't be too difficult Is this something we should do more of? Thoughts? — |
Yes, that would be a good thing to do. Given that most of the memory requirements grow linear with number of reads, read length, etc., maybe it can be supported by a simple function such as: job_memory = estimate_memory_requirement_from_file_size(filename, min_memory, max_memory, min_size, max_size) with min_memory = minimum memory to use if file size is min_size or smaller, max_memory = maximum memory to use if file is max_size or larger and interpolated in between. |
Currently we have several places where the amount of memory required by a job is dependent on the size of the input files, which in turn are dependent on the project. Whenever somebody runs a pipeline where the currently assigned memory is insufficient, they increase the memory requirements for that job in the pipeline.
The result of this is that for the resource requirements for any given task are set for the largest input files that have every been processed, even if that is much larger than is the norm. Further, this is a rachet-type process, and the resource requirements of our tasks are only ever going to go up.
I can think of two solutions to this. The first is to encode more of the resource requirements in the ini file. The second would be to implement some sort of dynamic resource requirement determination for certain tasks.
For example:
The memory requirement of fastqs2fastqs when reconciling pairs is highly dependent on the size of the input fastq files. This can be run by pipeline_readqc, but it would need to be set to a very high memory requirement so that large files could be handled (for 30M read files the requirement is well in excess of 10GB). Instead, I propose that the pipeline task measures the size of the input file, and sets the memory requirement appropriately. This would require some experiments to determine the shape of the memory vs size curve, but shouldn't be too difficult.
Is this something we should do more of? Thoughts?
The text was updated successfully, but these errors were encountered: