Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r2r] Fix a race condition in AbortableQueue #1528

Merged
merged 5 commits into from
Nov 10, 2022
Merged

Conversation

sergeyboyko0791
Copy link

Thanks to @borngraced for funding and reporting this bug.
There was a race condition in AbortableQueue, so the following code was leading to the panic:

'index out of bounds: the len is 0 but the index is 1'

spawner.spawn(futures::future::ready(()));
abortable_system.abort_all();

// This sleep allows to poll the `select(abortable_fut.boxed(), wait_till_abort.boxed()).await` future.
block_on(Timer::sleep(0.01));

spawner.spawn(futures::future::ready(()));

@sergeyboyko0791 sergeyboyko0791 changed the title Fix a race condition in AbortableQueue [r2r] Fix a race condition in AbortableQueue Nov 2, 2022
shamardy
shamardy previously approved these changes Nov 2, 2022
Copy link
Collaborator

@shamardy shamardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Copy link
Member

@borngraced borngraced left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! one suggestion.

@@ -153,32 +160,28 @@ impl QueueInner {
/// Inserts the given future `handle`.
fn insert_handle(&mut self, handle: oneshot::Sender<()>) -> FutureId {
match self.finished_futures.pop() {
Some(finished_id) => {
// We can reuse the given `finished_id`.
Ok(finished_id) => {
self.abort_handlers[finished_id] = handle;
Copy link
Member

@borngraced borngraced Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about doing a check here before indexing ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good note! If there is a wrong FutureId in finished_futures, we can get 2 problems:

  1. If finished_id < abort_handlers.len(), we'll reset a valid future handle at abort_handlers[finished_id], and it will be aborted - that can lead to a very complicated debugging process;
  2. If finished_id >= abort_handlers.len(), we'll get a panic.

If the 2) option happens, it means that we are in a wrong state, and later some other wrong finished_id can appear and lead to the 1) option. I think it's worth to panic in order to let the developers to quick fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright then. Thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still an open topic, so if there is a way to improve it, I'll be glad to read 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I think we can handle the case for 2) I mean we should do nothing if option 2) is true and log something like current index doesn't belong to any future instead of panic. WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting! Will consider implementing it tomorrow :) Thank you!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the checking, thank you for the note!

self.abort_handlers[finished_id] = handle;
// The freed future ID.
finished_id
},
None => {
// There are no finished future IDs.
Err(_) => {
self.abort_handlers.push(handle);
Copy link
Member

@borngraced borngraced Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here ..line 165

Copy link
Member

@onur-ozkan onur-ozkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Just a question.

@@ -1,6 +1,7 @@
use crate::executor::abortable_system::{AbortableSystem, InnerShared, InnerWeak, SystemInner};
use crate::executor::spawner::{SpawnAbortable, SpawnFuture};
use crate::executor::{spawn, AbortSettings, Timer};
use crossbeam::queue::SegQueue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SegQueue is linked list implementation with dynamic allocation which causes significant performancei impacts. Is this must-have for this implementation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SeqQueue is definitely not the best solution, but in comparison with Arc<PaMutex<Vec<FutureId>>>, SeqQueue is not that bad.
I performed a few tests with a fixed size ArrayQueue and PaMutex<Vec<FutureId>>, and here the results:

  1. ITEMS_NUMBER = 5_000_000:

push_values_lock_free_dynamic took 337.420958ms: items=5000000
pop_values_lock_free_dynamic took 371.537791ms: items=5000000, missing=732
====
vvvv
pop_values_lock_free_array took 328.117666ms: items=5000000, missing=15717
push_values_lock_free_array took 328.12ms: items=5000000
====
vvvv
push_values_mutex took 3.156110208s: items=5000000
pop_values_mutex took 3.154702583s: items=5000000, missing=24710287
====

  1. ITEMS_NUMBER = 500_000:

push_values_lock_free_dynamic took 37.359958ms: items=500000
pop_values_lock_free_dynamic took 41.183875ms: items=500000, missing=0
====
push_values_lock_free_array took 37.030666ms: items=500000
pop_values_lock_free_array took 37.027375ms: items=500000, missing=4524
====
push_values_mutex took 368.443375ms: items=500000
pop_values_mutex took 369.029791ms: items=500000, missing=3175914

The code is available here.

Unfortunately, we can't approximate total number of futures, but the difference between ArrayQueue and SegQueue is not so high to use an approximation.
Also I agree with you that it's not an optimal solution because when we spawn a future, we need to lock the PaMutex<Inner>, mutex and then try to pop a FutureId from the Inner::finished_futures lock-free collection.

But I couldn't find a better solution to avoid using the second thread-synchronization primitive.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by #1528 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by #1528 (comment)

Awesome :)

@sergeyboyko0791 sergeyboyko0791 changed the title [r2r] Fix a race condition in AbortableQueue [wip] Fix a race condition in AbortableQueue Nov 8, 2022
* This allows us to avoid using `SegQueue`
@sergeyboyko0791
Copy link
Author

While I tried to find another solution to avoid using SegQueue, I realized that we don't need to allow to spawn futures if an abortable system has been aborted. It led to the stuck futures if they were spawned after stop or disable_coin RPCs.

I decided to refactor the abortable system the way that it can't be longer used once AbortableSystem::abort_all is fired.

@artemii235 @ozkanonur please review the changes.

Copy link
Member

@onur-ozkan onur-ozkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd review iteration

Comment on lines 94 to 104
impl From<EnableSlpError> for InitTokensAsMmCoinsError {
fn from(e: EnableSlpError) -> Self {
match e {
EnableSlpError::GetBalanceError(balance_err) => {
InitTokensAsMmCoinsError::CouldNotFetchBalance(balance_err.to_string())
},
EnableSlpError::UnexpectedDerivationMethod(internal) | EnableSlpError::Internal(internal) => {
InitTokensAsMmCoinsError::Internal(internal)
},
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this too specific code block for generic platform_coin_with_tokens? I would do this under the slp modules.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good note! Moved to bch_with_tokens_activation.rs

match self.futures.entry(future_id) {
let futures = match self {
SimpleMapInnerState::Ready { futures, .. } => futures,
SimpleMapInnerState::Aborted => return false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about refactoring the result type to some meaningful enum? There are multiple spots where false can be returned, with enums we can clarify the reason more specificly. Just a suggestion, not a blocker.

Copy link
Author

@sergeyboyko0791 sergeyboyko0791 Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're totally right! Done
false on SimpleMapInnerState::Aborted actually led to an unnecessary spawning in lp_ordermatch.rs, now it should be fine.

* Move `impl From<EnableSlpError> for InitTokensAsMmCoinsError` to bch_with_tokens_activation.rs
* Return `AbortedError` on `AbortableSimpleMap` operations
@sergeyboyko0791 sergeyboyko0791 changed the title [wip] Fix a race condition in AbortableQueue [r2r] Fix a race condition in AbortableQueue Nov 9, 2022
onur-ozkan
onur-ozkan previously approved these changes Nov 9, 2022
Copy link
Member

@onur-ozkan onur-ozkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing 🔥

# Conflicts:
#	mm2src/coins/lightning.rs
#	mm2src/coins/lightning/ln_errors.rs
Copy link
Member

@artemii235 artemii235 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@artemii235 artemii235 merged commit 2c1524d into dev Nov 10, 2022
@artemii235 artemii235 deleted the fix-abortable-queue branch November 10, 2022 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants