Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal queues: task in multiple queues behaviour #2701

Open
matthewrmshin opened this issue Jun 22, 2018 · 18 comments
Open

Internal queues: task in multiple queues behaviour #2701

matthewrmshin opened this issue Jun 22, 2018 · 18 comments
Milestone

Comments

@matthewrmshin
Copy link
Contributor

From @paulcresswell via email:

We've noticed that if a task appears in two queues (other than the default queue), it adopts the queue limit of the first one it hits (the suite logs give warnings about ignoring subsequent queues). Is this really desirable? I would have thought that if a user defined two queues, both being picked up through family inheritance, a task should adhere to the smaller of the two values for safety.

See also #1196 #2616.

@matthewrmshin matthewrmshin added this to the some-day milestone Jun 22, 2018
@hjoliver
Copy link
Member

hjoliver commented Jul 5, 2018

a task should adhere to the smaller of the two values for safety.

Should we implement this as described, or the more usual override behaviour: if a task is a member of two queues, the second assignment overrides the first?

@challurip
Copy link
Contributor

Should I change the code for the task to be queued to smaller value or the later assignments?

@hjoliver
Copy link
Member

hjoliver commented Sep 6, 2018

I vote for override semantics, not smaller value. Let's see if @matthewrmshin agrees ...

@matthewrmshin
Copy link
Contributor Author

How do we do override if a task is in two families which are in different queues?

@hjoliver
Copy link
Member

hjoliver commented Sep 6, 2018

Family names can be used as shorthand to assign member tasks to queues, but other than that I don't think families are relevant here. So it is the order of queue assignment that matters.

Below, task x is is being assigned to both queues qA and qB (and it doesn't particularly matter if it got there as written or via family names). Currently x gets assigned to qA, and the second assignment is ignored with a warning. For this issue, if order matters x should end up in qB (2nd assignment overrides 1st); or if "safety" matters, it should end up in qA (lower limit).

[scheduling]
   [[queues]]
      [[[qA]]]
          members = f1, f2, f3, x
          limit = 1
     [[[qB]]]
          members = a1, a2, x
          limit = 2

I can understand the safety argument, but it seems rather unorthodox to make decisions like that for users. Normally it is down to you, if you want to assign a new value to something, the new value should stick.

Does that make sense?

@hjoliver hjoliver modified the milestones: some-day, later Sep 6, 2018
@matthewrmshin
Copy link
Contributor Author

@paulcresswell do you want to comment on this?

@paulcresswell
Copy link

Yes, thanks. The problem with the "usual" override behaviour is that a task may need to inherit other families A and B in a particular order, which themselves are members of qA and qB respectively, which may then force the task to be in one queue when the user intended it to be in another. This is a very hard thing to notice when you get it wrong, as the task will still run but may cause system problems unbeknown to the user. Consequently my vote would be to take the smaller value, but I can understand the desire not to have special rules (so it'd need to be clearly documented).

As an alternative suggestion then, could we consider allowing a task to be a member of multiple queues? E.g. in Hilary's example, if qA and qB both currently contain one running task, then task x would not be allowed to start (as this would exceed qA's limit). However, if task x was the only running task in either queue, then task b in queue qB would be allowed to start, but task a in queue qA would not. That would allow the maximum possible throughput without potentially dangerous behaviour.

@hjoliver
Copy link
Member

hjoliver commented Sep 12, 2018

@paulcresswell - I"m having a bit of trouble getting my head around your argument. Are you suggesting that the order of inheritance of families (A, B vs B, A) implies (or should imply) order of queue assignment, when families - rather than individual tasks - are assigned to queues? Note that queue membership is not a property of a family, in the runtime hierarchy - which is why queues are defined under "scheduling" and not "runtime". As I said above, use of a family name in queue definition is merely as convenient shorthand for "all the family member tasks". Using your example (I think):

[scheduling]
  [[queues]]
      [[[qA]]]
          members = A
          limit = 1
      [[[qB]]]
          members = B
          limit = 2
  [[dependencies]]
        graph = foo
[runtime]
  [[A]]
  [[B]]
  [[f1, f2, f3]]
    inherit = A
  [[a1, a2]]
    inherit = B
  [[x]]
    inherit = B, A 

Under queues the family names expand to list of members, so the above is equivalent to this:

   [[[qA]]]
       members = f1, f2, f3, x
       limit = 1
    [[[qB]]]
       members = a1, a2, x
       limit = 2

It makes no difference whether x inherits from A, B or from B, A.

What matters is that x is assigned to qA first, and qB second (simply because qA appears first in the suite definition) so I would argue that x should end up as a member of qB (by "usual" override semantics).

Maybe we just need to document exactly what queue membership assignment by family name means, and warn to be careful about tasks that inherit from multiple families??

@hjoliver
Copy link
Member

hjoliver commented Sep 12, 2018

(I still need to think about your "alternative suggestion" ... is it better than simpler parse-time assignment to the single lower-limit queue?)

@hjoliver
Copy link
Member

@matthewrmshin @paulcresswell - @challurip is interested in working on this, but we need to come to some resolution on the discussion above!

@dwsutherland
Copy link
Member

dwsutherland commented Nov 23, 2018

Perhaps we could allow for tasks to be in multiple queues, although inevitably it ups the complexity..

One solution would be for Cylc to create internal queues (i.e. qC) whose members (moved from qA & qB) are the intersections of and, while running, occupies 1 slot of each of the user defined queues..

The intersection queues go first. Where there are overlaps, the highest intersectionality (number of component queues) gets priority (no preference on equal number (perhaps queue size))..

Just a thought (^.^);

@paulcresswell
Copy link

I misunderstood your earlier explanation Hilary, which is why you couldn't get your head round my reply; sorry. Thinking it over cleanly, my preferred option is the 'multiple queues' method because it avoids unnecessarily holding up tasks that could run, but I appreciate that's a much more complex thing to implement. Ultimately I think I and anyone else could cope with any of the methods described once they were done and documented; by rearranging the order the queues are defined in if necessary (smallest last!). The only thing I'd be averse to is any solution that meant the order of family inheritance controlled which queue something ended up in.

(I do wonder how many users don't appreciate the difference between scheduling overrides and runtime order of inheritance though - I certainly didn't when I asked this question. It's not something most of us come across as frequently, I think.)

@matthewrmshin
Copy link
Contributor Author

As discussed off line... The purpose of a queue is to limit access to a given resource. If we have queues to control access to multiple resources, we have an issue when tasks are in both queues, and I have no good answer on how to solve the potential race condition.

@TomekTrzeciak
Copy link
Contributor

I think allowing one task in multiple queues only make things overly complicated. I wonder if the root cause of this issue isn't simply because of conflating unrelated settings into one family, which then forces unwanted queue membership. Let's say you have a setup like this:

[scheduling]
  [[queues]]
    [[[QA]]]
      limit = X
      members = FAM_QA
    [[[QB]]]
      limit = Y
      members = FAM_QB
[runtime]
  [[FAM_QA]]
  [[FAM_QB]]

In the above, any task inheriting from FAM_QX will end up queued in QX. Since the 'queue families' do not carry any settings, there is no need for a task to inherit from more than one such family. While that could technically still happen, it should be treated as a suite design flaw. That is, my preferred solution would be to get an error instead of a warning when this happens, as it is a sign that the family inheritance in the suite is not quite right. This, combined with a paragraph or two of advice in CUG on best practice with regard to queues and inheritance would be enough IMO. Or perhaps even more general advice on correct design of inheritance trees, as there are also other pitfalls[*].

[*] One such pitfall is this:

[runtime]
  [[FAM]]
  [[foo]]
    inherit = FAM
  [[bar]]
    inherit = None, FAM

FAM will show up in GUI as a level of task grouping, but selecting this group and, say, succeeding it will also succeed the bar task due to inheritance, even though this task was not explicitly selected. To avoid this, if a given family is first parent to some task, then it should not be used as a secondary parent to another task (I would even argue that this should fail validation).

@matthewrmshin
Copy link
Contributor Author

I forgot to record this latest off line conversation at the work shop. Consider 2 queues with a common task tX:

Q1: [t1, t2, tX, t3]
Q2: [t4, tX, t6, t7]

We'll allow other tasks to jump queue until the common task is at the head of the queue, so we may have:

Q1: [t1, t2, tX, t3]
Q2: [tX, t6, t7]
# ...
Q1: [t2, tX, t3]
Q2: [tX, t7]
# ...
Q1: [tX, t3]
Q2: [tX, t7]
# ... no more queue jumping until tX is submitted

This logic works as long as we don't have something ridiculous like this:

Q1: [tX, tY, tZ]
Q2: [tY, tZ, tX]
Q3: [tZ, tX, tY]

@TomekTrzeciak
Copy link
Contributor

This logic works as long as we don't have something ridiculous like this:

Q1: [tX, tY, tZ]
Q2: [tY, tZ, tX]
Q3: [tZ, tX, tY]

Assuming each task is added to queues atomically, the relative order of tasks should be preserved under the current FIFO semantics. By if queues are ever extended to allow setting per queue priorities and reordering, such a situation is certainly possible.

@ColemanTom
Copy link
Contributor

Just to throw another idea into the mix, and another use case as way of examples.

Say in my suite I want to limit the number of jobs sending data to external locations

  [[queues]]
    [[[SEND_DATA]]]
      limit = 20
      members = SEND_DATA

But, say also that some servers are particularly slow and struggle when more than a couple of processes are sending them data at a time, so we want to limit them.

    [[[SLOW_SERVER]]]
      limit = 2
      members = SLOW_SERVER_TASKS

The tasks sending data to the SLOW_SERVER should still be limited by the SEND_DATA queue, as they are still also sending data externally. But, they should also be limited by the SLOW_SERVER queue.

Whilst I do generally agree with @TomekTrzeciak in #2701 (comment) I do not think the use case above is covered by that discussion point. This is where a multiple limits would be useful, in my opinion.

Perhaps another way of looking at this would be for queues to be sub-queues within a parent queue? Child queues would accept jobs, and then submit them add them into the parent queue, clearing out when they clear out of the parent queue. This approach might be easier to track mentally.

    [[[SLOW_SERVER]]]
      inherit = SEND_DATA
      limit = 2
      members = SLOW_SERVER_TASKS

But in this case, if a task was by chance in both queues, it would end up in the lowest child queue, and only indirectly be in the parent queue. Task gets added to SLOW_SERVER. That then adds it up the chain into the SEND_DATA queue. When SEND_DATA actually runs it, it clears it out of SEND_DATA and then out of SLOW_SERVER.

@hjoliver
Copy link
Member

(For the record, I had a working implementation of overlapping queues, with no race conditions, in #4035 - but backed out due to performance concerns plus the need to convince others that overlapping queues are a good thing 😁 ) We hope to investigate other queuing implementations again soon, however).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants