Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowness with large input #991

Closed
gaow opened this issue Jul 5, 2018 · 64 comments
Closed

Slowness with large input #991

gaow opened this issue Jul 5, 2018 · 64 comments

Comments

@gaow
Copy link
Member

gaow commented Jul 5, 2018

This topic has been discussed before but perhaps not the same context. I've got a couple of workflow steps like this:

[step]
input: '/path/to/a/single/file.gz', for_each = 'chroms', concurrent = True
output: dynamic(glob.glob('{cwd}/{y_data:bnn}/chr*/*.rds'))
[another_step]
input: glob.glob(f'{cwd}/{y_data:bnn}/chr*/*.rds'), group_by = 1, concurrent = True
output: dynamic(glob.glob(f'{cwd}/{y_data:bnn}/SuSiE_CS_*/*.rds'))
R: expand = "${ }"

I run it in 2 separate sequential SoS commands:

sos run step
sos run another_step

You see the first step takes a single file file.gz, pair it with different chroms then create many small rds dynamic output. The actual output length at the end of the pipeline is

>>> len(glob.glob('chr*/*.rds'))
43601

Now when I run the 2nd step it got stuck at the single SoS process to prepare for the run, for 10 minutes (i started writing this post 5 min ago), and it is still working on it ... not yet analyzing the data.

~43K files does not sound a big deal right? But this is indeed the first time I use dynamic output of a previous step as the input of the next, in separate commands. I am wondering what is going on maybe in this context? and if we can do something about it.

@gaow
Copy link
Member Author

gaow commented Jul 5, 2018

BTW I killed it after 30min. I guess the following can reproduce it?

cd test; for i in {1..40000}; do touch $i.txt; done

(takes 1 minute on my computer).

and

[1]
input: glob.glob('*.txt'), group_by=1, concurrent=True
output: "/tmp/1.rds"
run:
  touch /tmp/1.rds

I guess the reason is because none of those input files has signature created beforehand? I'm otherwise sure that we've dealt with larger number of files than this before! For now I'll perhaps manually generate some PBS files and send to cluster to have my jobs done, but i guess it would be important to be able to scale this.

If indeed file signature is an issue, then this may be the advantage of using time stamps as signature?

@BoPeng
Copy link
Contributor

BoPeng commented Jul 5, 2018

The first thing I notice is that glob.glob('*.txt') is not needed. It actually slows down sos quite a bit because sos will need to process 40000 input files instead of 1 (*.txt).

@gaow
Copy link
Member Author

gaow commented Jul 5, 2018

I do not get it ... *.txt will also be expanded to 40k files that SoS needs to take care about? In fact i just tried *.txt. Perhaps it is too soon to say but it is still hanging here for 3 minutes ... Using -v4 I see both syntax hang at the same spot.

I had to use glob because my actual pattern was chr*/*.rds. That would need glob right?

@BoPeng
Copy link
Contributor

BoPeng commented Jul 5, 2018

I just notice that it take sos a lot sooner to reach the processing step if you use *.txt directly because sos will expand *.txt and know they are all files. SoS needs to check each of the 40000 input files if you expand before hand.

But the processing step is also slow so this might not change anything significantly. Still checking.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 5, 2018

The biggest problem is the checking of the input files against the DAG. Let me know if the patch helps.

@gaow
Copy link
Member Author

gaow commented Jul 5, 2018

It certainly has helped. The 40K processing now takes a minute!

@BoPeng BoPeng closed this as completed Jul 5, 2018
@gaow
Copy link
Member Author

gaow commented Jul 5, 2018

Well, on the cluster it still takes some time to preprocess it, about 5min.

DEBUG: _input: /home/gaow/GIT/LargeFiles/AS_output/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF/chr16/chr16_16180848_16184262.rds
TRACE: Set _index to 43599 of type <class 'int'>
TRACE: Set _output to TBD of type <class 'sos.targets.sos_targets'>
TRACE: Set step_output to TBD of type <class 'sos.targets.sos_targets'>
TRACE: Lock acquired for output files []
DEBUG: Signature mismatch: Missing signature file /scratch/midway2/gaow/GIT/mvarbvs/.sos/.runtime/bcc70b81b18575ce.exe_info
TRACE: Set _runtime to {} of type <class 'dict'>
TRACE: Lock released for output files []
...

using -v4 I see these messages 43K times. 5 min is not that bad. I'm just reporting this here and see if there are other improvements that can be made.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 6, 2018

I know there are a lot of debug trace messages, but at least now these messages are helpful when things go wrong (like this ticket). I would say 5 min is not too bad for 40k substeps so let us leave further optimization for later.

@gaow
Copy link
Member Author

gaow commented Jul 6, 2018

I might have said it too soon. It took < 5min to go through those lines for 43K files, which indeed is not bad. But then it got stuck on this message for >30min:

DEBUG: _input: /home/gaow/GIT/LargeFiles/AS_output/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF/chr16/chr16_87742072_87742871.rds
TRACE: Set _index to 43600 of type <class 'int'>
TRACE: Set _output to TBD of type <class 'sos.targets.sos_targets'>
TRACE: Set step_output to TBD of type <class 'sos.targets.sos_targets'>
TRACE: Lock acquired for output files []
DEBUG: Signature mismatch: Missing signature file /scratch/midway2/gaow/GIT/mvarbvs/.sos/.runtime/eb866d1e1f86bab5.exe_info
TRACE: Set _runtime to {} of type <class 'dict'>
TRACE: Lock released for output files []
DEBUG: Send ['M601_725c599d26bb5ebe']

(see the last line). In the mean time, nothing got sent to the job schedular. Is it preparing for something else? At this point there are no new trace or debug information displayed.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 6, 2018

This message sends a task to a workflow worker... frankly I do not know what is going on. I would suggest that you reduce the number of files (e.g. input: '1*.txt') and try to see if you can repeat it reliably.

@gaow
Copy link
Member Author

gaow commented Jul 6, 2018

I can confirm that this hang can be reproduced regardless of how many tasks are there. I left in only 3 tasks but it is still hanging. I then ran rm -rf ~/.sos which has already took 30 minutes to run, and is still running. I think we've ran into similar issues before. I think this will help for now. Will report back otherwise.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 6, 2018

So removal of ~/.sos allow the jobs to run?

@gaow
Copy link
Member Author

gaow commented Jul 6, 2018

I hope it is the case. Currently the rm command is still running -- 1 hour has passed!!

To give you an idea of what's going on now:

[MW] date && ls | wc -l
Thu Jul  5 20:59:18 CDT 2018
107798
[MW] date && ls | wc -l
Thu Jul  5 20:59:24 CDT 2018
107167
[MW] date && ls | wc -l
Thu Jul  5 20:59:34 CDT 2018
105271
[MW] date && ls | wc -l
Thu Jul  5 20:59:38 CDT 2018
104789

and files removed per second:

> n = c(107798,107167,105271,104789) 
> tm = c(18,24,34,38)
> d_n = -(n[2:4] - n[1:3]) 
> d_t = tm[2:4] - tm[1:3]  
> d_n/d_t
[1] 105.1667 189.6000 120.5000

So let's say 120 files is removed per second, and for 3600 seconds it's 432,000 files in my ~/.sos. I think all should be removed in maybe 20min and I'll try run it again.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 6, 2018

I once read that modern file systems are already "database-like" so there is no really need to manage files by ourselves in databases to achieve better performance, but I guess I am wrong here. When you have some time, perhaps you can send me a list files under .sos with type, date and size, and think of ways to improve the situation.

@gaow
Copy link
Member Author

gaow commented Jul 6, 2018

Sure ... Now my ~/.sos folder is empty. I'm running a test of 1200 files, It indeed worked right away, but it failed

TRACE: Set _index to 1207 of type <class 'int'>
TRACE: Set _output to TBD of type <class 'sos.targets.sos_targets'>
TRACE: Set step_output to TBD of type <class 'sos.targets.sos_targets'>
TRACE: Lock acquired for output files []
DEBUG: Signature mismatch: Missing signature file /scratch/midway2/gaow/GIT/mvarbvs/.sos/.runtime/fc933018fca75c75.exe_info
WARNING: Result for M1000_d11a627b71de72ed is not received
TRACE: Set _runtime to {} of type <class 'dict'>
  File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/__main__.py", line 372, in cmd_run
    executor.run(args.__targets__, mode='dryrun' if args.dryrun else 'run')
Traceback (most recent call last):
  File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/__main__.py", line 372, in cmd_run
    executor.run(args.__targets__, mode='dryrun' if args.dryrun else 'run')
  File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/workflow_executor.py", line 1381, in run
    raise e
  File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/workflow_executor.py", line 1240, in run
    ', '.join([f'{y} job{"s" if y > 1 else ""} {x}' for x, y in status if y > 0]))
RuntimeError: 1 job failed
ERROR: 1 job failed

Will have to dig into what's going on ... but I'll need to get some analysis results so I'm moving the 43K jobs to my 40 threads desktop which should complete all analysis in 6hrs anyways. I'll test again on the cluster later and report back.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 6, 2018

BTW, there was a nextflow tweet saying XXX million tasks have been completed by nextflow. Not sure how it came up with this number but SoS is quickly catching up with your small tasks. 😄

@gaow
Copy link
Member Author

gaow commented Jul 10, 2018

Not sure how it came up with this number but SoS is quickly catching up with your small tasks. 

Well I've been analyzing some data since last Thursday and my local .sos folder already has about a million files which I just removed.

@gaow
Copy link
Member Author

gaow commented Jul 16, 2018

I've got frustrating feedback from others using SoS to analyze many small tasks. Also as my own analysis scale up (77K analysis unit for 40 phenotypes), it seems I have to remove ~/.sos/tasks on daily basis. The cluster system is not happy with too many files in the home directory. The problem is also related to having to restart the submission a few times, and that seems to rapidly grow the number of tasks files accumulated.

Having global ~/.sos/tasks is also problematic because when I remove ~/.sos/tasks I lose it for other projects, too.

My questions are:

  1. How many files does SoS write to ~/.sos/tasks for per task, or per output?
  2. Can we make sure re-starting the same script does not generate additional task files? That is, try to keep it strictly one file per output. That way, minor change to the script will not lead to duplicated task files.
  3. Can we have some option to use .sos not ~/.sos? I understand our design goal, but in my context it brings more troubles than convenience.

I think as long as we can make it strictly per file per output (or per file-set per output) things will be more under control. I understand you might not have these large scale test applications, but I guess we can at least start checking the code and think about at which steps we can reduce number of files stored?

@gaow gaow reopened this Jul 16, 2018
@gaow
Copy link
Member Author

gaow commented Jul 16, 2018

BTW I now end up having to rely on my single desktop to analyze these data. The local .sos folder is more managable and easier to delete on local SSD.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 16, 2018

Given that we now have a cache for all task files, there is one easy thing we can do. That is to say we can remove all related files for "terminal" states. That is to say, we can remove "completed" tasks since its status will no longer change. There are some subtleties here because the cache does not save all information (e.g. the real result) and we need to count for the resubmission of the same task, but I guess the general idea is sound.

@gaow
Copy link
Member Author

gaow commented Jul 16, 2018

That sound a reasonable general idea.

How about using local .sos for tasks? At least users will feel it cleaner that way and things are under control per project.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 17, 2018

That will not work because tasks on the server do not have the concept of local .sos. Even if there is a project directory, it will make it impossible to do sos status -q host since the tasks files would be scattered around.

@gaow
Copy link
Member Author

gaow commented Jul 17, 2018

Okay that makes sense ... but is there a way to only delete project specific task signatures? I am thinking about making directories under ~/.sos/tasks that somehow can be identified by project and be removed altogether when necessary. Potentially that will also make it faster to look up tasks if there is the concept of project for tasks on the server?

@BoPeng
Copy link
Contributor

BoPeng commented Jul 17, 2018 via email

@BoPeng
Copy link
Contributor

BoPeng commented Jul 17, 2018

Just did a quick summary of my task files on cluster:

  • 436 tasks, 339 completed, 96 failed, 1 running
  • 2315 files
  • 140 .job_id for tasks sent to pbs (others are executed directly on headnode)
  • 436 .task for all tasks
  • 436 .err, 288 .out for completed tasks, 140 .out and 91 .err files are empty
  • 435 .res for completed tasks
  • 436 .pulse for resource etc

@gaow
Copy link
Member Author

gaow commented Jul 17, 2018

Well, this is because you've got 436 tasks while in my case each analysis is 77K tasks, each task ranges from 5min to 40min computation depending on how well the algorithm converges ... so I did not consolidate multiple tasks to one job. And I've got ~40 such analysis to run. (my analysis involves computation for various molecular QTL analysis on different cell types)

@BoPeng
Copy link
Contributor

BoPeng commented Jul 17, 2018

I am simply trying to see what files can be removed. Note that sos status -v3 pulls results from .res, .out, .err, and .pulse files even for completed tasks, so my previous proposal will not work if most tasks are in completed status.

@gaow
Copy link
Member Author

gaow commented Jul 20, 2018

Great! I just tested this version. as suggested I started from scratch -- removed the ~/.sos folder. Then reran a test job but got the following:

INFO: f47f9ce7d12987cb restart from status pending
ERROR: [Errno 2] No such file or directory: '/home/gaow/.sos/tasks/midway2/f47f9ce7d12987cb.task'
INFO: 3540012a53cdcbae restart from status pending
ERROR: [Errno 2] No such file or directory: '/home/gaow/.sos/tasks/midway2/3540012a53cdcbae.task'
INFO: fdbe4f180cbfae6e restart from status pending
ERROR: [Errno 2] No such file or directory: '/home/gaow/.sos/tasks/midway2/fdbe4f180cbfae6e.task'
INFO: f05d69f454651dbd restart from status pending
ERROR: [Errno 2] No such file or directory: '/home/gaow/.sos/tasks/midway2/f05d69f454651dbd.task'
INFO: 754c86330b87236c restart from status pending
ERROR: [Errno 2] No such file or directory: '/home/gaow/.sos/tasks/midway2/754c86330b87236c.task'

Do you need a MWE to reproduce?

@BoPeng
Copy link
Contributor

BoPeng commented Jul 20, 2018

I cannot identify the problem when checking the source code.

Previously, I used host/xxx.task to save host-specific task files to differentiate tasks sent to different remote hosts. Right now xxx.task is reused in the sense that task engines rewrite existing task files before they submit them to different remote hosts.

Problems will arise if you run

sos execute task_id -q hostA
sos execute task_id -q hostB

at the same time but I suppose this is a very rare condition that we can handle later.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 20, 2018

I removed the alias handling yesterday so I suspect that you are not running the latest trunk, which has just passed travis and appveyor. Please update your local copy, remove site-packages if needed, and try again.

@gaow
Copy link
Member Author

gaow commented Jul 20, 2018

Okay here is a simple test that reproduces the behavior:

[1]
params1 = [x for x in range(100)]
params2 = [x for x in range(10)]
output_files = [f'{x}.txt' for x in range(1000)]
input: for_each = ['params1', 'params2']
output: f'~/tmp/{output_files[_index]}'
task: trunk_workers = 1, queue = 'midway2', walltime = '5m', trunk_size = 1, mem = '2G', cores = 1, tags = f'test_{_output:bn}', workdir = './', concurrent = True
bash: expand = True
    touch {_output} 

Here is my config file:

hosts:
  localhost:
    address: localhost
    paths:
      home: /home/{user_name}
  midway2:
    address: localhost
    description: UChicago RCC cluster Midway 2
    job_template: '#!/bin/bash
      #SBATCH --time={walltime}
      #{partition}
      #SBATCH --nodes=1
      #SBATCH --ntasks-per-node=1
      #SBATCH --cpus-per-task={cores}
      #SBATCH --mem-per-cpu={mem//10**9}G
      #SBATCH --job-name={job_name}
      #SBATCH --output={cur_dir}/.sos/{job_name}.out
      #SBATCH --error={cur_dir}/.sos/{job_name}.err
      cd {cur_dir}
      sos execute {task} -v {verbosity} -s {sig_mode}'
    kill_cmd: scancel {job_id}
    max_cores: 40
    max_mem: 64G
    max_running_jobs: 40
    max_walltime: '36:00:00'
    partition: SBATCH --partition=broadwl
    paths:
      home: /home/gaow
    queue_type: pbs
    status_check_interval: 20
    status_cmd: squeue --job {job_id}
    submit_cmd: sbatch {job_file}
    submit_cmd_output: Submitted batch job {job_id}
localhost: localhost

I'm on the latest trunk. I believe at some point yesterday it worked, but not this version.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 20, 2018

Did you update sos on midway?

@gaow
Copy link
Member Author

gaow commented Jul 20, 2018

I'm running directly on that machine, as indicated by address: localhost. I'm not running from my local computer.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 20, 2018

I see, this is a local PBS queue, might be the reason.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 20, 2018

Ok, just pushed a change to sos-pbs. I am not terribly comfortable at merging everything but let us first see if everything works and if there is a proper fix for #999.

@gaow
Copy link
Member Author

gaow commented Jul 20, 2018

Okay it seems to be running again.

I'm wondering for large variables such as output_files in my previous example, should we save their signature instead? Because currently,

[MW] du -h tasks/
48M	tasks/
[MW] ll tasks/*.task | wc -l
1500

1500 tasks leaves behand 48 Mb data. Does it sound an issue to you?

@gaow
Copy link
Member Author

gaow commented Jul 20, 2018

I am not terribly comfortable at merging everything

But this should put us in a good position to use some efficient database solutions when it comes to that point?

@BoPeng
Copy link
Contributor

BoPeng commented Jul 20, 2018

Do not know...Travis passed but AppVeyor actually failed so I am not done with the current change yet.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 22, 2018

The new_format branch uses a completely new format for .task file so that it can better handle task status query and embedding of other files such as .res and .out. It also absorbs file signatures so that execution of tasks on remote hosts will not leave any signature file there. This branch passes all tests but should still be considered experimental. In particular, this format is NOT compatible with existing task files so you have to remove all existing task files to use it.

@gaow
Copy link
Member Author

gaow commented Jul 23, 2018

Great! I'm curious about what happens under the hood -- what file format is it and how is it optimized to respond efficiently to sos status commands? Is it now one signature per output file?

@BoPeng
Copy link
Contributor

BoPeng commented Jul 23, 2018

https://github.com/vatlab/SoS/blob/new_format/src/sos/tasks.py#L137

There is a fixed size header for task files. sos can read/write important information to the header without loading the entire file. E.g. get status by reading a few bytes from the file although I am changing the 10 byte string to an integer soon.

The header also have "size" of different blocks so that task parameters and stdout, stderr can be retrieved separately. stdout, stderr are compressed to save diskspace.

With this change, task status are written directly to the task file and there is less a need to "guess" the status of the task under different situations, and information such as time for status change is also recorded so that it is possible to learn the history of the task.

@gaow
Copy link
Member Author

gaow commented Jul 23, 2018

Okay!

A related note is do we want to keep all the script files generated? That can accumulate fast over time, too. I was initially under the impression that we have them in system tmp folder and copy over to local .sos unless there is an exception captured. But it does not look like the case. I think I still see all intermediate scripts generated yet not removed after.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 23, 2018

I am generally happy with the new format and will merge it to the trunk when relevant changes in sos-notebook is made. I should perhaps add a warning for old task files because the new code does not read them at all and it is very difficult to make it backward compatible due to completely new way of storing and retrieving information.

For your question, yes the scripts are supposed to only save when error happens but I think at least docker is an exception because docker cannot read /tmp. Not sure about other cases though.

@BoPeng
Copy link
Contributor

BoPeng commented Jul 26, 2018

It was a lot of work to overhaul the task handling mechanism but it is done now. I want to repeat again that the new version is incompatible with the old versions so you will have to upgrade everything (sos, sos-notebook, sos-pbs) and remove all existing task files in order to use them. In particular, new version will error out with old task file and old version will hang forever for new task file.

@BoPeng BoPeng closed this as completed Jul 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants