Slowness with large input #991

gaow · 2018-07-05T14:42:01Z

This topic has been discussed before but perhaps not the same context. I've got a couple of workflow steps like this:

[step]
input: '/path/to/a/single/file.gz', for_each = 'chroms', concurrent = True
output: dynamic(glob.glob('{cwd}/{y_data:bnn}/chr*/*.rds'))

[another_step]
input: glob.glob(f'{cwd}/{y_data:bnn}/chr*/*.rds'), group_by = 1, concurrent = True
output: dynamic(glob.glob(f'{cwd}/{y_data:bnn}/SuSiE_CS_*/*.rds'))
R: expand = "${ }"

I run it in 2 separate sequential SoS commands:

sos run step
sos run another_step

You see the first step takes a single file file.gz, pair it with different chroms then create many small rds dynamic output. The actual output length at the end of the pipeline is

>>> len(glob.glob('chr*/*.rds'))
43601

Now when I run the 2nd step it got stuck at the single SoS process to prepare for the run, for 10 minutes (i started writing this post 5 min ago), and it is still working on it ... not yet analyzing the data.

~43K files does not sound a big deal right? But this is indeed the first time I use dynamic output of a previous step as the input of the next, in separate commands. I am wondering what is going on maybe in this context? and if we can do something about it.

The text was updated successfully, but these errors were encountered:

gaow · 2018-07-05T15:08:11Z

BTW I killed it after 30min. I guess the following can reproduce it?

cd test; for i in {1..40000}; do touch $i.txt; done

(takes 1 minute on my computer).

and

[1]
input: glob.glob('*.txt'), group_by=1, concurrent=True
output: "/tmp/1.rds"
run:
  touch /tmp/1.rds

I guess the reason is because none of those input files has signature created beforehand? I'm otherwise sure that we've dealt with larger number of files than this before! For now I'll perhaps manually generate some PBS files and send to cluster to have my jobs done, but i guess it would be important to be able to scale this.

If indeed file signature is an issue, then this may be the advantage of using time stamps as signature?

BoPeng · 2018-07-05T15:19:34Z

The first thing I notice is that glob.glob('*.txt') is not needed. It actually slows down sos quite a bit because sos will need to process 40000 input files instead of 1 (*.txt).

gaow · 2018-07-05T15:23:26Z

I do not get it ... *.txt will also be expanded to 40k files that SoS needs to take care about? In fact i just tried *.txt. Perhaps it is too soon to say but it is still hanging here for 3 minutes ... Using -v4 I see both syntax hang at the same spot.

I had to use glob because my actual pattern was chr*/*.rds. That would need glob right?

BoPeng · 2018-07-05T15:25:44Z

I just notice that it take sos a lot sooner to reach the processing step if you use *.txt directly because sos will expand *.txt and know they are all files. SoS needs to check each of the 40000 input files if you expand before hand.

But the processing step is also slow so this might not change anything significantly. Still checking.

…991

BoPeng · 2018-07-05T16:18:31Z

The biggest problem is the checking of the input files against the DAG. Let me know if the patch helps.

gaow · 2018-07-05T18:19:28Z

It certainly has helped. The 40K processing now takes a minute!

gaow · 2018-07-05T23:37:35Z

Well, on the cluster it still takes some time to preprocess it, about 5min.

DEBUG: _input: /home/gaow/GIT/LargeFiles/AS_output/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF/chr16/chr16_16180848_16184262.rds
TRACE: Set _index to 43599 of type <class 'int'>
TRACE: Set _output to TBD of type <class 'sos.targets.sos_targets'>
TRACE: Set step_output to TBD of type <class 'sos.targets.sos_targets'>
TRACE: Lock acquired for output files []
DEBUG: Signature mismatch: Missing signature file /scratch/midway2/gaow/GIT/mvarbvs/.sos/.runtime/bcc70b81b18575ce.exe_info
TRACE: Set _runtime to {} of type <class 'dict'>
TRACE: Lock released for output files []
...

using -v4 I see these messages 43K times. 5 min is not that bad. I'm just reporting this here and see if there are other improvements that can be made.

BoPeng · 2018-07-06T00:44:24Z

I know there are a lot of debug trace messages, but at least now these messages are helpful when things go wrong (like this ticket). I would say 5 min is not too bad for 40k substeps so let us leave further optimization for later.

gaow · 2018-07-06T00:52:26Z

I might have said it too soon. It took < 5min to go through those lines for 43K files, which indeed is not bad. But then it got stuck on this message for >30min:

DEBUG: _input: /home/gaow/GIT/LargeFiles/AS_output/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF/chr16/chr16_87742072_87742871.rds
TRACE: Set _index to 43600 of type <class 'int'>
TRACE: Set _output to TBD of type <class 'sos.targets.sos_targets'>
TRACE: Set step_output to TBD of type <class 'sos.targets.sos_targets'>
TRACE: Lock acquired for output files []
DEBUG: Signature mismatch: Missing signature file /scratch/midway2/gaow/GIT/mvarbvs/.sos/.runtime/eb866d1e1f86bab5.exe_info
TRACE: Set _runtime to {} of type <class 'dict'>
TRACE: Lock released for output files []
DEBUG: Send ['M601_725c599d26bb5ebe']

(see the last line). In the mean time, nothing got sent to the job schedular. Is it preparing for something else? At this point there are no new trace or debug information displayed.

BoPeng · 2018-07-06T01:02:41Z

This message sends a task to a workflow worker... frankly I do not know what is going on. I would suggest that you reduce the number of files (e.g. input: '1*.txt') and try to see if you can repeat it reliably.

gaow · 2018-07-06T01:37:33Z

I can confirm that this hang can be reproduced regardless of how many tasks are there. I left in only 3 tasks but it is still hanging. I then ran rm -rf ~/.sos which has already took 30 minutes to run, and is still running. I think we've ran into similar issues before. I think this will help for now. Will report back otherwise.

BoPeng · 2018-07-06T01:53:59Z

So removal of ~/.sos allow the jobs to run?

gaow · 2018-07-06T02:04:33Z

I hope it is the case. Currently the rm command is still running -- 1 hour has passed!!

To give you an idea of what's going on now:

[MW] date && ls | wc -l
Thu Jul  5 20:59:18 CDT 2018
107798
[MW] date && ls | wc -l
Thu Jul  5 20:59:24 CDT 2018
107167
[MW] date && ls | wc -l
Thu Jul  5 20:59:34 CDT 2018
105271
[MW] date && ls | wc -l
Thu Jul  5 20:59:38 CDT 2018
104789

and files removed per second:

> n = c(107798,107167,105271,104789) 
> tm = c(18,24,34,38)
> d_n = -(n[2:4] - n[1:3]) 
> d_t = tm[2:4] - tm[1:3]  
> d_n/d_t
[1] 105.1667 189.6000 120.5000

So let's say 120 files is removed per second, and for 3600 seconds it's 432,000 files in my ~/.sos. I think all should be removed in maybe 20min and I'll try run it again.

BoPeng · 2018-07-06T02:53:36Z

I once read that modern file systems are already "database-like" so there is no really need to manage files by ourselves in databases to achieve better performance, but I guess I am wrong here. When you have some time, perhaps you can send me a list files under .sos with type, date and size, and think of ways to improve the situation.

gaow · 2018-07-06T03:12:19Z

Sure ... Now my ~/.sos folder is empty. I'm running a test of 1200 files, It indeed worked right away, but it failed

TRACE: Set _index to 1207 of type <class 'int'>
TRACE: Set _output to TBD of type <class 'sos.targets.sos_targets'>
TRACE: Set step_output to TBD of type <class 'sos.targets.sos_targets'>
TRACE: Lock acquired for output files []
DEBUG: Signature mismatch: Missing signature file /scratch/midway2/gaow/GIT/mvarbvs/.sos/.runtime/fc933018fca75c75.exe_info
WARNING: Result for M1000_d11a627b71de72ed is not received
TRACE: Set _runtime to {} of type <class 'dict'>
  File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/__main__.py", line 372, in cmd_run
    executor.run(args.__targets__, mode='dryrun' if args.dryrun else 'run')
Traceback (most recent call last):
  File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/__main__.py", line 372, in cmd_run
    executor.run(args.__targets__, mode='dryrun' if args.dryrun else 'run')
  File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/workflow_executor.py", line 1381, in run
    raise e
  File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/workflow_executor.py", line 1240, in run
    ', '.join([f'{y} job{"s" if y > 1 else ""} {x}' for x, y in status if y > 0]))
RuntimeError: 1 job failed
ERROR: 1 job failed

Will have to dig into what's going on ... but I'll need to get some analysis results so I'm moving the 43K jobs to my 40 threads desktop which should complete all analysis in 6hrs anyways. I'll test again on the cluster later and report back.

BoPeng · 2018-07-06T14:20:44Z

BTW, there was a nextflow tweet saying XXX million tasks have been completed by nextflow. Not sure how it came up with this number but SoS is quickly catching up with your small tasks. 😄

gaow · 2018-07-10T18:18:00Z

Not sure how it came up with this number but SoS is quickly catching up with your small tasks.

Well I've been analyzing some data since last Thursday and my local .sos folder already has about a million files which I just removed.

gaow · 2018-07-16T16:18:36Z

I've got frustrating feedback from others using SoS to analyze many small tasks. Also as my own analysis scale up (77K analysis unit for 40 phenotypes), it seems I have to remove ~/.sos/tasks on daily basis. The cluster system is not happy with too many files in the home directory. The problem is also related to having to restart the submission a few times, and that seems to rapidly grow the number of tasks files accumulated.

Having global ~/.sos/tasks is also problematic because when I remove ~/.sos/tasks I lose it for other projects, too.

My questions are:

How many files does SoS write to ~/.sos/tasks for per task, or per output?
Can we make sure re-starting the same script does not generate additional task files? That is, try to keep it strictly one file per output. That way, minor change to the script will not lead to duplicated task files.
Can we have some option to use .sos not ~/.sos? I understand our design goal, but in my context it brings more troubles than convenience.

I think as long as we can make it strictly per file per output (or per file-set per output) things will be more under control. I understand you might not have these large scale test applications, but I guess we can at least start checking the code and think about at which steps we can reduce number of files stored?

gaow · 2018-07-16T16:19:30Z

BTW I now end up having to rely on my single desktop to analyze these data. The local .sos folder is more managable and easier to delete on local SSD.

BoPeng · 2018-07-16T17:09:09Z

Given that we now have a cache for all task files, there is one easy thing we can do. That is to say we can remove all related files for "terminal" states. That is to say, we can remove "completed" tasks since its status will no longer change. There are some subtleties here because the cache does not save all information (e.g. the real result) and we need to count for the resubmission of the same task, but I guess the general idea is sound.

gaow · 2018-07-16T18:13:13Z

That sound a reasonable general idea.

How about using local .sos for tasks? At least users will feel it cleaner that way and things are under control per project.

BoPeng · 2018-07-17T04:18:46Z

That will not work because tasks on the server do not have the concept of local .sos. Even if there is a project directory, it will make it impossible to do sos status -q host since the tasks files would be scattered around.

gaow · 2018-07-17T11:32:00Z

Okay that makes sense ... but is there a way to only delete project specific task signatures? I am thinking about making directories under ~/.sos/tasks that somehow can be identified by project and be removed altogether when necessary. Potentially that will also make it faster to look up tasks if there is the concept of project for tasks on the server?

BoPeng · 2018-07-17T11:35:39Z

Task tags are designed for that.

On Tue, Jul 17, 2018 at 6:32 AM gaow ***@***.***> wrote: Okay that makes sense ... but is there a way to only delete project specific task signatures? I am thinking about making directories under ~/.sos/tasks that somehow can be identified by project and be removed altogether when necessary. Potentially that will also make it faster to look up tasks if there is the concept of project for tasks on the server? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#991 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJbmIG_KHHSW0GJlAOTdq8Eqm-CoRiiPks5uHcswgaJpZM4VEA-0> .

-- Sent from Gmail Mobile

BoPeng · 2018-07-17T20:26:44Z

Just did a quick summary of my task files on cluster:

436 tasks, 339 completed, 96 failed, 1 running
2315 files
140 .job_id for tasks sent to pbs (others are executed directly on headnode)
436 .task for all tasks
436 .err, 288 .out for completed tasks, 140 .out and 91 .err files are empty
435 .res for completed tasks
436 .pulse for resource etc

gaow · 2018-07-17T20:40:21Z

Well, this is because you've got 436 tasks while in my case each analysis is 77K tasks, each task ranges from 5min to 40min computation depending on how well the algorithm converges ... so I did not consolidate multiple tasks to one job. And I've got ~40 such analysis to run. (my analysis involves computation for various molecular QTL analysis on different cell types)

BoPeng · 2018-07-17T20:43:53Z

I am simply trying to see what files can be removed. Note that sos status -v3 pulls results from .res, .out, .err, and .pulse files even for completed tasks, so my previous proposal will not work if most tasks are in completed status.

gaow · 2018-07-20T14:35:29Z

Great! I just tested this version. as suggested I started from scratch -- removed the ~/.sos folder. Then reran a test job but got the following:

INFO: f47f9ce7d12987cb restart from status pending
ERROR: [Errno 2] No such file or directory: '/home/gaow/.sos/tasks/midway2/f47f9ce7d12987cb.task'
INFO: 3540012a53cdcbae restart from status pending
ERROR: [Errno 2] No such file or directory: '/home/gaow/.sos/tasks/midway2/3540012a53cdcbae.task'
INFO: fdbe4f180cbfae6e restart from status pending
ERROR: [Errno 2] No such file or directory: '/home/gaow/.sos/tasks/midway2/fdbe4f180cbfae6e.task'
INFO: f05d69f454651dbd restart from status pending
ERROR: [Errno 2] No such file or directory: '/home/gaow/.sos/tasks/midway2/f05d69f454651dbd.task'
INFO: 754c86330b87236c restart from status pending
ERROR: [Errno 2] No such file or directory: '/home/gaow/.sos/tasks/midway2/754c86330b87236c.task'

Do you need a MWE to reproduce?

BoPeng · 2018-07-20T14:52:43Z

I cannot identify the problem when checking the source code.

Previously, I used host/xxx.task to save host-specific task files to differentiate tasks sent to different remote hosts. Right now xxx.task is reused in the sense that task engines rewrite existing task files before they submit them to different remote hosts.

Problems will arise if you run

sos execute task_id -q hostA
sos execute task_id -q hostB

at the same time but I suppose this is a very rare condition that we can handle later.

BoPeng · 2018-07-20T15:02:19Z

I removed the alias handling yesterday so I suspect that you are not running the latest trunk, which has just passed travis and appveyor. Please update your local copy, remove site-packages if needed, and try again.

gaow · 2018-07-20T15:08:50Z

Okay here is a simple test that reproduces the behavior:

[1]
params1 = [x for x in range(100)]
params2 = [x for x in range(10)]
output_files = [f'{x}.txt' for x in range(1000)]
input: for_each = ['params1', 'params2']
output: f'~/tmp/{output_files[_index]}'
task: trunk_workers = 1, queue = 'midway2', walltime = '5m', trunk_size = 1, mem = '2G', cores = 1, tags = f'test_{_output:bn}', workdir = './', concurrent = True
bash: expand = True
    touch {_output}

Here is my config file:

hosts:
  localhost:
    address: localhost
    paths:
      home: /home/{user_name}
  midway2:
    address: localhost
    description: UChicago RCC cluster Midway 2
    job_template: '#!/bin/bash
      #SBATCH --time={walltime}
      #{partition}
      #SBATCH --nodes=1
      #SBATCH --ntasks-per-node=1
      #SBATCH --cpus-per-task={cores}
      #SBATCH --mem-per-cpu={mem//10**9}G
      #SBATCH --job-name={job_name}
      #SBATCH --output={cur_dir}/.sos/{job_name}.out
      #SBATCH --error={cur_dir}/.sos/{job_name}.err
      cd {cur_dir}
      sos execute {task} -v {verbosity} -s {sig_mode}'
    kill_cmd: scancel {job_id}
    max_cores: 40
    max_mem: 64G
    max_running_jobs: 40
    max_walltime: '36:00:00'
    partition: SBATCH --partition=broadwl
    paths:
      home: /home/gaow
    queue_type: pbs
    status_check_interval: 20
    status_cmd: squeue --job {job_id}
    submit_cmd: sbatch {job_file}
    submit_cmd_output: Submitted batch job {job_id}
localhost: localhost

I'm on the latest trunk. I believe at some point yesterday it worked, but not this version.

BoPeng · 2018-07-20T15:16:13Z

Did you update sos on midway?

gaow · 2018-07-20T15:18:10Z

I'm running directly on that machine, as indicated by address: localhost. I'm not running from my local computer.

BoPeng · 2018-07-20T15:18:49Z

I see, this is a local PBS queue, might be the reason.

BoPeng · 2018-07-20T15:22:56Z

Ok, just pushed a change to sos-pbs. I am not terribly comfortable at merging everything but let us first see if everything works and if there is a proper fix for #999.

gaow · 2018-07-20T15:38:06Z

Okay it seems to be running again.

I'm wondering for large variables such as output_files in my previous example, should we save their signature instead? Because currently,

[MW] du -h tasks/
48M	tasks/
[MW] ll tasks/*.task | wc -l
1500

1500 tasks leaves behand 48 Mb data. Does it sound an issue to you?

gaow · 2018-07-20T15:39:24Z

I am not terribly comfortable at merging everything

But this should put us in a good position to use some efficient database solutions when it comes to that point?

BoPeng · 2018-07-20T15:39:38Z

Do not know...Travis passed but AppVeyor actually failed so I am not done with the current change yet.

…991

BoPeng · 2018-07-22T19:08:58Z

The new_format branch uses a completely new format for .task file so that it can better handle task status query and embedding of other files such as .res and .out. It also absorbs file signatures so that execution of tasks on remote hosts will not leave any signature file there. This branch passes all tests but should still be considered experimental. In particular, this format is NOT compatible with existing task files so you have to remove all existing task files to use it.

gaow · 2018-07-23T16:34:46Z

Great! I'm curious about what happens under the hood -- what file format is it and how is it optimized to respond efficiently to sos status commands? Is it now one signature per output file?

BoPeng · 2018-07-23T16:41:09Z

https://github.com/vatlab/SoS/blob/new_format/src/sos/tasks.py#L137

There is a fixed size header for task files. sos can read/write important information to the header without loading the entire file. E.g. get status by reading a few bytes from the file although I am changing the 10 byte string to an integer soon.

The header also have "size" of different blocks so that task parameters and stdout, stderr can be retrieved separately. stdout, stderr are compressed to save diskspace.

With this change, task status are written directly to the task file and there is less a need to "guess" the status of the task under different situations, and information such as time for status change is also recorded so that it is possible to learn the history of the task.

gaow · 2018-07-23T18:58:54Z

Okay!

A related note is do we want to keep all the script files generated? That can accumulate fast over time, too. I was initially under the impression that we have them in system tmp folder and copy over to local .sos unless there is an exception captured. But it does not look like the case. I think I still see all intermediate scripts generated yet not removed after.

BoPeng · 2018-07-23T20:36:21Z

I am generally happy with the new format and will merge it to the trunk when relevant changes in sos-notebook is made. I should perhaps add a warning for old task files because the new code does not read them at all and it is very difficult to make it backward compatible due to completely new way of storing and retrieving information.

For your question, yes the scripts are supposed to only save when error happens but I think at least docker is an exception because docker cannot read /tmp. Not sure about other cases though.

BoPeng · 2018-07-26T16:59:43Z

It was a lot of work to overhaul the task handling mechanism but it is done now. I want to repeat again that the new version is incompatible with the old versions so you will have to upgrade everything (sos, sos-notebook, sos-pbs) and remove all existing task files in order to use them. In particular, new version will error out with old task file and old version will hang forever for new task file.

BoPeng pushed a commit that referenced this issue Jul 5, 2018

Re-generate dangling target list only when the DAG has been changed. #…

eb0611e

…991

BoPeng closed this as completed Jul 5, 2018

gaow reopened this Jul 16, 2018

BoPeng mentioned this issue Jul 20, 2018

Potential problems with the reuse of .task files #999

Closed

BoPeng mentioned this issue Jul 20, 2018

Task specific signature mode. #1000

Closed

BoPeng pushed a commit that referenced this issue Jul 21, 2018

Properly handle sos resume by not overwriting task file #991

430f1a2

BoPeng pushed a commit that referenced this issue Jul 21, 2018

Fix overwriting task files, which could change the status of the task #…

12aaeee

…991

BoPeng pushed a commit that referenced this issue Jul 21, 2018

Properly define and test TaskFile class #991

b0eaf44

BoPeng pushed a commit that referenced this issue Jul 21, 2018

Start using TaskFile class #991

d8e5282

BoPeng pushed a commit that referenced this issue Jul 22, 2018

Prepare to add signatures to .task file #991

b0f9b12

BoPeng pushed a commit that referenced this issue Jul 22, 2018

Working on task preparation #991

7e059b7

BoPeng pushed a commit that referenced this issue Jul 22, 2018

Use in-memory signature for tasks #991

0af712e

BoPeng pushed a commit that referenced this issue Jul 23, 2018

Adjust header to allow more efficient writing of task status #991

a540ac4

BoPeng pushed a commit that referenced this issue Jul 25, 2018

Raise an error for unsupported task file format #991

b87e40f

BoPeng pushed a commit that referenced this issue Jul 25, 2018

Minor fixes to task status reporting #991

a65d904

BoPeng closed this as completed Jul 30, 2018

Slowness with large input #991

Slowness with large input #991

Comments

gaow commented Jul 5, 2018

gaow commented Jul 5, 2018 • edited Loading

BoPeng commented Jul 5, 2018

gaow commented Jul 5, 2018

BoPeng commented Jul 5, 2018 • edited Loading

BoPeng commented Jul 5, 2018

gaow commented Jul 5, 2018

gaow commented Jul 5, 2018

BoPeng commented Jul 6, 2018

gaow commented Jul 6, 2018

BoPeng commented Jul 6, 2018

gaow commented Jul 6, 2018 • edited Loading

BoPeng commented Jul 6, 2018

gaow commented Jul 6, 2018

BoPeng commented Jul 6, 2018

gaow commented Jul 6, 2018

BoPeng commented Jul 6, 2018

gaow commented Jul 10, 2018

gaow commented Jul 16, 2018

gaow commented Jul 16, 2018

BoPeng commented Jul 16, 2018

gaow commented Jul 16, 2018

BoPeng commented Jul 17, 2018 • edited Loading

gaow commented Jul 17, 2018

BoPeng commented Jul 17, 2018 via email

BoPeng commented Jul 17, 2018

gaow commented Jul 17, 2018

BoPeng commented Jul 17, 2018

gaow commented Jul 20, 2018

BoPeng commented Jul 20, 2018

BoPeng commented Jul 20, 2018

gaow commented Jul 20, 2018

BoPeng commented Jul 20, 2018

gaow commented Jul 20, 2018

BoPeng commented Jul 20, 2018

BoPeng commented Jul 20, 2018 • edited Loading

gaow commented Jul 20, 2018

gaow commented Jul 20, 2018

BoPeng commented Jul 20, 2018

BoPeng commented Jul 22, 2018

gaow commented Jul 23, 2018 • edited Loading

BoPeng commented Jul 23, 2018 • edited Loading

gaow commented Jul 23, 2018

BoPeng commented Jul 23, 2018

BoPeng commented Jul 26, 2018

gaow commented Jul 5, 2018 •

edited

Loading

BoPeng commented Jul 5, 2018 •

edited

Loading

gaow commented Jul 6, 2018 •

edited

Loading

BoPeng commented Jul 17, 2018 •

edited

Loading

BoPeng commented Jul 20, 2018 •

edited

Loading

gaow commented Jul 23, 2018 •

edited

Loading

BoPeng commented Jul 23, 2018 •

edited

Loading