-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slowness with large input #991
Comments
BTW I killed it after 30min. I guess the following can reproduce it?
(takes 1 minute on my computer). and
I guess the reason is because none of those If indeed file signature is an issue, then this may be the advantage of using time stamps as signature? |
The first thing I notice is that |
I do not get it ... I had to use |
I just notice that it take sos a lot sooner to reach the processing step if you use But the processing step is also slow so this might not change anything significantly. Still checking. |
The biggest problem is the checking of the input files against the DAG. Let me know if the patch helps. |
It certainly has helped. The 40K processing now takes a minute! |
Well, on the cluster it still takes some time to preprocess it, about 5min.
using |
I know there are a lot of debug trace messages, but at least now these messages are helpful when things go wrong (like this ticket). I would say 5 min is not too bad for 40k substeps so let us leave further optimization for later. |
I might have said it too soon. It took < 5min to go through those lines for 43K files, which indeed is not bad. But then it got stuck on this message for >30min:
(see the last line). In the mean time, nothing got sent to the job schedular. Is it preparing for something else? At this point there are no new trace or debug information displayed. |
This message sends a task to a workflow worker... frankly I do not know what is going on. I would suggest that you reduce the number of files (e.g. |
I can confirm that this hang can be reproduced regardless of how many tasks are there. I left in only 3 tasks but it is still hanging. I then ran |
So removal of |
I hope it is the case. Currently the To give you an idea of what's going on now:
and files removed per second: > n = c(107798,107167,105271,104789)
> tm = c(18,24,34,38)
> d_n = -(n[2:4] - n[1:3])
> d_t = tm[2:4] - tm[1:3]
> d_n/d_t
[1] 105.1667 189.6000 120.5000 So let's say 120 files is removed per second, and for 3600 seconds it's 432,000 files in my |
I once read that modern file systems are already "database-like" so there is no really need to manage files by ourselves in databases to achieve better performance, but I guess I am wrong here. When you have some time, perhaps you can send me a list files under .sos with type, date and size, and think of ways to improve the situation. |
Sure ... Now my
Will have to dig into what's going on ... but I'll need to get some analysis results so I'm moving the 43K jobs to my 40 threads desktop which should complete all analysis in 6hrs anyways. I'll test again on the cluster later and report back. |
BTW, there was a nextflow tweet saying XXX million tasks have been completed by nextflow. Not sure how it came up with this number but SoS is quickly catching up with your small tasks. 😄 |
Well I've been analyzing some data since last Thursday and my local |
I've got frustrating feedback from others using SoS to analyze many small tasks. Also as my own analysis scale up (77K analysis unit for 40 phenotypes), it seems I have to remove Having global My questions are:
I think as long as we can make it strictly per file per output (or per file-set per output) things will be more under control. I understand you might not have these large scale test applications, but I guess we can at least start checking the code and think about at which steps we can reduce number of files stored? |
BTW I now end up having to rely on my single desktop to analyze these data. The local |
Given that we now have a |
That sound a reasonable general idea. How about using local |
That will not work because tasks on the server do not have the concept of local |
Okay that makes sense ... but is there a way to only delete project specific task signatures? I am thinking about making directories under |
Task tags are designed for that.
On Tue, Jul 17, 2018 at 6:32 AM gaow ***@***.***> wrote:
Okay that makes sense ... but is there a way to only delete project
specific task signatures? I am thinking about making directories under
~/.sos/tasks that somehow can be identified by project and be removed
altogether when necessary. Potentially that will also make it faster to
look up tasks if there is the concept of project for tasks on the server?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#991 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJbmIG_KHHSW0GJlAOTdq8Eqm-CoRiiPks5uHcswgaJpZM4VEA-0>
.
--
Sent from Gmail Mobile
|
Just did a quick summary of my task files on cluster:
|
Well, this is because you've got 436 tasks while in my case each analysis is 77K tasks, each task ranges from 5min to 40min computation depending on how well the algorithm converges ... so I did not consolidate multiple tasks to one job. And I've got ~40 such analysis to run. (my analysis involves computation for various molecular QTL analysis on different cell types) |
I am simply trying to see what files can be removed. Note that |
Great! I just tested this version. as suggested I started from scratch -- removed the
Do you need a MWE to reproduce? |
I cannot identify the problem when checking the source code. Previously, I used Problems will arise if you run
at the same time but I suppose this is a very rare condition that we can handle later. |
I removed the |
Okay here is a simple test that reproduces the behavior:
Here is my config file:
I'm on the latest trunk. I believe at some point yesterday it worked, but not this version. |
Did you update sos on |
I'm running directly on that machine, as indicated by |
I see, this is a local PBS queue, might be the reason. |
Ok, just pushed a change to |
Okay it seems to be running again. I'm wondering for large variables such as
1500 tasks leaves behand 48 Mb data. Does it sound an issue to you? |
But this should put us in a good position to use some efficient database solutions when it comes to that point? |
Do not know...Travis passed but AppVeyor actually failed so I am not done with the current change yet. |
The |
Great! I'm curious about what happens under the hood -- what file format is it and how is it optimized to respond efficiently to |
https://github.com/vatlab/SoS/blob/new_format/src/sos/tasks.py#L137 There is a fixed size header for task files. sos can read/write important information to the header without loading the entire file. E.g. get status by reading a few bytes from the file although I am changing the 10 byte string to an integer soon. The header also have "size" of different blocks so that task parameters and With this change, task status are written directly to the task file and there is less a need to "guess" the status of the task under different situations, and information such as time for status change is also recorded so that it is possible to learn the history of the task. |
Okay! A related note is do we want to keep all the script files generated? That can accumulate fast over time, too. I was initially under the impression that we have them in system |
I am generally happy with the new format and will merge it to the trunk when relevant changes in sos-notebook is made. I should perhaps add a warning for old task files because the new code does not read them at all and it is very difficult to make it backward compatible due to completely new way of storing and retrieving information. For your question, yes the scripts are supposed to only save when error happens but I think at least docker is an exception because docker cannot read |
It was a lot of work to overhaul the task handling mechanism but it is done now. I want to repeat again that the new version is incompatible with the old versions so you will have to upgrade everything (sos, sos-notebook, sos-pbs) and remove all existing task files in order to use them. In particular, new version will error out with old task file and old version will hang forever for new task file. |
This topic has been discussed before but perhaps not the same context. I've got a couple of workflow steps like this:
I run it in 2 separate sequential SoS commands:
You see the first step takes a single file
file.gz
, pair it with differentchroms
then create many smallrds
dynamic output. The actual output length at the end of the pipeline isNow when I run the 2nd step it got stuck at the single SoS process to prepare for the run, for 10 minutes (i started writing this post 5 min ago), and it is still working on it ... not yet analyzing the data.
~43K files does not sound a big deal right? But this is indeed the first time I use dynamic output of a previous step as the input of the next, in separate commands. I am wondering what is going on maybe in this context? and if we can do something about it.
The text was updated successfully, but these errors were encountered: