Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic resource estimation #168

Open
IanSudbery opened this issue Jan 25, 2016 · 2 comments
Open

Dynamic resource estimation #168

IanSudbery opened this issue Jan 25, 2016 · 2 comments

Comments

@IanSudbery
Copy link
Member

Currently we have several places where the amount of memory required by a job is dependent on the size of the input files, which in turn are dependent on the project. Whenever somebody runs a pipeline where the currently assigned memory is insufficient, they increase the memory requirements for that job in the pipeline.

The result of this is that for the resource requirements for any given task are set for the largest input files that have every been processed, even if that is much larger than is the norm. Further, this is a rachet-type process, and the resource requirements of our tasks are only ever going to go up.

I can think of two solutions to this. The first is to encode more of the resource requirements in the ini file. The second would be to implement some sort of dynamic resource requirement determination for certain tasks.

For example:

The memory requirement of fastqs2fastqs when reconciling pairs is highly dependent on the size of the input fastq files. This can be run by pipeline_readqc, but it would need to be set to a very high memory requirement so that large files could be handled (for 30M read files the requirement is well in excess of 10GB). Instead, I propose that the pipeline task measures the size of the input file, and sets the memory requirement appropriately. This would require some experiments to determine the shape of the memory vs size curve, but shouldn't be too difficult.

Is this something we should do more of? Thoughts?

@TomSmithCGAT
Copy link
Member

I came across exactly this issue when writing the pipeline a new pipeline for alignment-free transcript quantification and differential expression analysis (pipeline_transcriptdiffexpression.py) as Sleuth's memory usage is dependent upon the number of transcripts, samples and bootstraps. The pipeline therefore determines the Sleuth memory usage without any user input.

My preference would be to get the pipeline to determine the memory requirement in these cases. I don't think it's reasonable to expect the user to know how much memory each script may take.

Tom


From: Ian Sudbery [notifications@github.com]
Sent: 25 January 2016 10:31
To: CGATOxford/CGATPipelines
Subject: [CGATPipelines] Dynamic resource estimation (#168)

Currently we have several places where the amount of memory required by a job is dependent on the size of the input files, which in turn are dependent on the project Whenever somebody runs a pipeline where the currently assigned memory is insufficient, they increase the memory requirements for that job in the pipeline

The result of this is that for the resource requirements for any given task are set for the largest input files that have every been processed, even if that is much larger than is the norm Further, this is a rachet-type process, and the resource requirements of our tasks are only ever going to go up

I can think of two solutions to this The first is to encode more of the resource requirements in the ini file The second would be to implement some sort of dynamic resource requirement determination for certain tasks

For example:

The memory requirement of fastqs2fastqs when reconciling pairs is highly dependent on the size of the input fastq files This can be run by pipeline_readqc, but it would need to be set to a very high memory requirement so that large files could be handled (for 30M read files the requirement is well in excess of 10GB) Instead, I propose that the pipeline task measures the size of the input file, and sets the memory requirement appropriately This would require some experiments to determine the shape of the memory vs size curve, but shouldn't be too difficult

Is this something we should do more of? Thoughts?


Reply to this email directly or view it on GitHubhttps://github.com//issues/168.

@AndreasHeger
Copy link
Member

Yes, that would be a good thing to do. Given that most of the memory requirements grow linear with number of reads, read length, etc., maybe it can be supported by a simple function such as:

job_memory = estimate_memory_requirement_from_file_size(filename, min_memory, max_memory, min_size, max_size)

with min_memory = minimum memory to use if file size is min_size or smaller, max_memory = maximum memory to use if file is max_size or larger and interpolated in between.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants