-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
time: avoid traversing entries in the time wheel twice #6584
time: avoid traversing entries in the time wheel twice #6584
Conversation
bd79cf8
to
d57c0cb
Compare
tokio/src/runtime/time/mod.rs
Outdated
// Gets the number of entries in this slot, which will be | ||
// the maximum number of times we traverse. | ||
// Limiting the maximum number of iterations is very important, | ||
// because it's possible that those entries might need to be | ||
// reinserted into the same slot. | ||
|
||
// This happens on the highest level, when an entry is inserted | ||
// more than MAX_DURATION into the future. When this happens, we wrap | ||
// around, and process some entries a multiple of MAX_DURATION before | ||
// they actually need to be dropped down a level. We then reinsert them | ||
// back into the same position; we must make sure we don't then process | ||
// those entries again or we'll end up in an infinite loop. | ||
|
||
// This can also happens when other threads concurrently add entries to | ||
// this currently traversing slot. | ||
let count = lock.get_entries_count(&expiration); | ||
for _ in 0..count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we usually handle this kind of thing is to move all entries in the list to another list (using the guarded linked list), and then we wake entries from the guarded list in batches. Can we not do the same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I understand you correctly.
What I want to address is to avoid moving all entries in the list to another list at once.
I currently see GuardedLinkedList
used in broadcast and notify, and it seems that it must remove all entries from the original linked list at once through calling std::mem::take
.
Using the std::mem::take
, we have no way to address the goal of traversing the Linked List only once. Because once the TimerHandle is removed from the original linked list, other threads need to know that it is no longer in the original linked list (they can be dropped concurrently). Here, synchronization must be achieved through locks. After modifying the state of the TimerHandle, we can not wake it up immediately because the lock is held at this time, and we need to avoid deadlock.
In short,
- Because all entries are moved at once through
std::mem::take
, we must traverse the entire list and mark them as no longer in the original linked list. - And wakeup does not allow holding locks, so we must temporarily store these entries in the pending linked list.
The master version is:
- Get the lock
- Remove all entries from the slot linked list at once through calling
std::mem::take
. - Traverse the slot linked list, mark all entries as
STATE_PENDING_FIRE
. - Release the lock
- After that, other threads can know their
TimerHanle
s are stored in the pending linked list instead of the original one. - Traverse the pending linked list in batch:
- Get the lock
- Traverse the pending linked list, mark some entries as
STATE_DEREGISTERED
in batch, and get their waker (32 in batch). - Release the lock
- After that, other threads can know their
TimerHanle
s are not in the pending linked list or the slot linked list. - Wake them up.
- Get the lock again ...
The current Version is:
Traverse the slot linked list in batch:
- Get the lock
- Traverse the slot linked list, mark some entries as
STATE_FIRING
and thenSTATE_DEREGISTERED
in batch, and get their waker (32 in batch). - Release the lock
- After that, other threads can know their
TimerHanle
s are not in the slot linked list. - Wake them up.
- Get the lock again ...
tokio/src/runtime/time/mod.rs
Outdated
Err(expiration_tick) => { | ||
lock.add_entry(entry, expiration.clone(), expiration_tick); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There must be some operation somewhere that moves timers into lower levels. Is this it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is the exactly code block which do such things. The position of TimerEntry
will recaculate here.
Drive-by comments. I have no idea if the maintainers will accept this idea. On the surface, this is what it looks like is happening with these changes. A timer wheel's slot's entries are now counted. This increases the timer wheel sizes in memory but is meant to allow cycling through them once. When running through a timer wheel slot list, the original count is used to try to know when to stop popping from the back of the list, because logic is now added that pushes back to the front. But because the lock has to be released and then reacquired when the local stack-frame slice of temporary expired events has become full, another thread may have come in and added an entry (that conceivably has a 'now' deadline) so it seems this new entry will not be seen by the loop. This is going to be a very rare event, as it also must coincide with the local stack-frame's slice having filled up, so it is conceivable a solution is to punt out of the looping at that point and restart at the top of the function, getting the new total size of the list, starting with an empty stack frame slice too, and just starting again to look for entries that expired 'now'. It is nice that the poll function logic is simplified. (If that's actually correct - I haven't groked why that would be the case but I'm probably not seeing the whole picture.) The There is a new use of The idea is commendable. It seems in the common case, the work the runtime has to do is being streamlined. But there's at least the one corner case I mentioned above and the logic is spread out over four files so someone with a clearer idea of all the workings would have to look at this and say that the timer wheel logic was still tight because the corner cases seem too hard to test for exhaustively and even fuzz testing isn't going to be a guarantee the whole thing still hangs together. But also, is the added size of the timer wheels worth it? The sharding and the lazy initialization changes that went in recently already made the timer wheel much more efficient for busy systems. |
Thanks for your comments.
Yes. The current approach comes at the cost of increased space usage, which is the cost. :)
Because of we
Yes, the
Yes, we should add SAFETY comments to the new function I rename it
Yes. The benefit here depends on how we use the Timer. But if there is a chance to traverse the list only once instead of twice, it is definitely a performance win. |
The benchmark code is as follows: #[tokio::main]
async fn main() {
let now = tokio::time::Instant::now();
let five_seconds_later = now + tokio::time::Duration::from_secs(5);
let mut handles = vec![];
for _ in 0..1024 * 1024 {
handles.push(tokio::spawn(async move {
tokio::time::sleep_until(five_seconds_later).await;
}));
}
for h in handles {
let _ = h.await.unwrap();
}
println!("{:?}", now.elapsed());
} The benchmark results of the above are as follows:
On Linux with the This shows there is a time reduction of |
Why are times 5.8 and 5.7 in one part and 0.8 and 0.7 in another? |
Here, I want to measure the time taken for time wheel related operations (which may mainly include traversing the time wheel and waking up and executing tasks).
|
As FrankReh worried, introducing counters for all levels will lead to additional memory consumption. This memory problem is exacerbated because the timer wheel is sharded by worker threads. I decide to only use a counter for highest level. Now, if our time wheel shard size is 8, the total memory increment of timer wheels would be 3968 bytes. As far as I know, 8 work threads will take 16MB(2MB * 8) as least, so this 4KB extra memory might be acceptable. |
@wathenjiang Apologies if I have it backwards or if I read the new code incorrectly. Don't you mean to provide counters only for the shortest intervals? I think that would mean level 0, not level 5. But I could easily be missing something. Also, I do like the idea of treating the most frequently accessed level differently from the rest. It made me wonder if the whole sharding idea could have been done differently, and only shard the first one, two or three levels. But that would come at the cost of more code that needs to be maintained. And I'm not a maintainer. Hardly even a contributor these days. But I hold timer wheel ideas very dear and I haven't been asked not to thrown in my 2 cents now and then. Also, it would be nice if you got some indications from @Darksonn if you might even be on the right track or if you should just stop. Best I can tell from the comment left by @Darksonn above, your fundamental idea hasn't been given much of a go-ahead yet and you shouldn't take the wrong impression by my comments that your efforts will be rewarded. |
I appreciate your comments. If I have been unclear, let me make it clear: we should only deal with the highest level's infinite loop issue. The idea of the counter is to avoid infinite loops, if there are other solutions, I'm open to them. Not using counters for the low level is implementation specific, not design specific. I'd love to see the memory savings. On the other hand, if the memory usage is not affected much, we can also use the previous version. Yes, feedback is important. |
Thank you @FrankReh for prodding me. I will try to summarize my thoughts on this PR. The idea of using a counter to avoid the infinite loop makes me a bit worried, and I can't really tell whether it's correct. It seems to me that the counter is being used to solve a problem that we have encountered many times in Tokio. Previous times we encountered this problem, we solved it using Let me try to summarize my understanding of the PR. Currently we move items to the However, as far as I understand, the items that are moved correspond exactly to those that are stored in a specific bucket, correct? Could we move all of those items into However, in this comment you object that when we move all of the items into the However, the documentation for tokio/tokio/src/util/linked_list.rs Lines 192 to 200 in 8e15c23
That is to say, you are allowed to call |
Thanks! I have never noticed the comment of the I believe this is the best way to address our purpose. |
I looked it over and actually didn't have anything to add. The Just wait for @Darksonn to find time. There are a lot of changes so probably takes more than a few minutes to decide if it looks right. But from my perspective .. nicely done. |
Overall looks reasonable to me. |
// Wake a batch of wakers. To avoid deadlock, | ||
// we must do this with the lock temporarily dropped. | ||
drop(lock); | ||
waker_list.wake_all(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about panics here. What happens to entries still in the guarded list?
I think some other examples of this wrap the guarded list in something that removes all entries from the list without waking their wakers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We shoud ensure the list is empty before dropping it and no double panic, even if it occurs panic. I will wrap it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, we use EntryWaitersList
instead. It will address our purpose of memory safety.
tokio/src/runtime/time/mod.rs
Outdated
let unguarded_list = lock.take_entries(&expiration); | ||
// It is critical for `GuardedLinkedList` safety that the guard node is | ||
// pinned in memory and is not dropped until the guarded list is dropped. | ||
let guard = TimerShared::new(id); | ||
pin!(guard); | ||
let guard_ptr = NonNull::from(guard.as_ref().get_ref()); | ||
// GuardedLinkedList ensures that the concurrent drop of Entry in this slot is safe. | ||
let mut guarded_list = unguarded_list.into_guarded(guard_ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the call to lock.take_entries
so that we immediately call into_guarded
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add the method get_waiters_list
to replace it. Is this design reasonable enough for us?
tokio/src/runtime/time/entry.rs
Outdated
// This entry is in the `guarded_list`, so it can not be removed | ||
// from the entry list with the same `level` and `slot`. | ||
// Because its state is STATE_DEREGISTERED, it has been fired. | ||
if cur_state == STATE_DEREGISTERED { | ||
break Err(cur_state); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment. How do you know that it's in the guarded list? And why is that a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the Entry which is pop_back
ed from guarded_list can be mark_firing
. So we can ensure this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we fail to mark a Entry as STATE_FIRING
, then we get a problem.
This entry has been fired, so we fail to fire it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code seems reasonable, I just found the comment confusing. Why is it important that it can't be removed from the entry list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This entry is in the `guarded_list`, so it can not be removed
// from the entry list with the same `level` and `slot`.
Sorry, this took me some time to think about why I wrote the above comments.
What I means is that if the entry list with the same level
and slot
is empty, we can not remove the entry from the entry list, but the Entry is in the guarded list.
The comments are very confusing, I decide to delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is actually my explanation of why the state value is STATE_DEREGISTERED here.
Because it can not be removed, but can be fired, so we meet this Entry here.
Please see the clear_entry
method of the Handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can make this more clear by adding some comments.
8e88893
to
17fc129
Compare
77245fd
to
7043eb2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
This change has been reverted in #6715. |
Motivation
Currently we use a pending linked list to temporarily store the timer entry to be fired.
If there are a large number of timers that need to be waken at once, then there will be a cache miss issue here.
Solution
For synchronizing the state between threads, when we want to modify the state of entries in wheel, we must hold the lock of wheel. However, when we hold a lock, in order to avoid the deadlock issue, we are not allowed wake the walker of entries.
Furthermore, to avoid the issue of the infinite loop, we need to take all of the entries off the slot before processing any of them. And, throughout this process, we always hold the lock.
The main reasons for using a
pending
linked list currently mentioned above.In addition to supporting random deletion,
slot
linked list only have FIFOinsert_head
andpop_back
. Given this, if a counter is introduced for each slot, the maximum number ofpop_back
s can be determined before traversing the slot, which solves the issue of the infinite loop.Furthermore, by solving the infinite loop issue, we no longer need to take all of the entries off the slot before processing any of them. Finally, in this PR, we removed the
pending
linked list.Expected benefits are as follows