Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a single dynamic allocation (making RingBuffer a DST) #75

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

mgeier
Copy link
Owner

@mgeier mgeier commented Dec 16, 2021

I'm not sure yet if that's a good idea ...

Most of the single-threaded benchmarks improve with this, but it doesn't seem to have an effect on the (more relevant) multi-threaded benchmarks.

This needs a MSRV bump.

Closes #67.

src/lib.rs Outdated
// to `is_abandoned`. This is accomplished with Acquire/Release.
if buffer
.as_ref()
.is_abandoned
Copy link

@alcore alcore Mar 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you could simply use a counter here. fetch_add/fetch_sub can be faster than CAS and the order of the drops shouldn't be relevant at all, unless I'm missing something (didn't review the entire lib). Essentially you just want to deallocate on the second increment and no further increments can happen anyway.

Might even get away with relaxed ordering, because the drop calls happen before the increments -- on the second increment both the producer and consumer no longer use the backing memory, so even if one of the threads gets interrupted during its drop, it knows the other thread is no longer using the memory anyway and it's safe to deallocate.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this hint!

I wasn't aware that fetch_sub() is faster than CAS.
I'm open to changing this.

But I still think that Relaxed is not enough here.

I'm thinking about this scenario:

  • the Producer writes some data into the ring buffer
  • the Producer is destroyed
  • the Consumer is destroyed

If we were to use Relaxed, the Producer writing some data would not be sequenced before the destruction of the Producer, right?
Therefore, the write could be moved after the Producer decrementing the counter and in the other thread, the Consumer could decrement the counter (and free the memory), leading to the Producer's thread writing to freed memory.

So I think we need a Release+Acquire combo here. I've suggested this in this PR for compare_exchange(), but we could of course also do it with fetch_sub(Release) and load(Acquire).

I was looking at Rust's Arc destructor for this, but this is a quite confusing story ...

The Rust implementation uses an Acquire fence, but mentions that it should probably be replaced by an Acquire load():
https://github.com/rust-lang/rust/blob/379e94f5a4aebe7dc2d8742653ca244d92b06f3d/library/alloc/src/sync.rs#L1650-L1689

The mentioned PR (rust-lang/rust#41714) mentions that Gecko uses and Acquire load().

However, the funny thing is that in the meantime, Gecko has replaced this Acquire load() with and Acquire fence: https://hg.mozilla.org/mozilla-central/rev/d65d668da45fc2e4bb4fbaf74ee3c25ce7f3cd8a

And even funnier, now the Gecko people argue that that's what Rust uses: https://bugzilla.mozilla.org/show_bug.cgi?id=1397052#c13

So we have come full circle!

They argue that the fence is cheaper (https://bugzilla.mozilla.org/show_bug.cgi?id=1397052#c16), while previously the Rust PR argued the opposite. I don't know which is true.
This is supposed to argue that a fence is better (or that load and fence are basically the same?): rust-lang/rust#41714 (comment). I don't fully understand it TBH.

At a later point it seems to have turned out that the fence caused a false positive in the Thread Sanitizer, so this might be something to consider as well.
See also rust-lang/rust#65097.

This might also be related: servo/servo#26227


Here I found an example in some library, using fetch_sub(Release) + load(Acquire):

https://github.com/thomcc/arcstr/blob/70ba2fac19d3efe8d2e0231daaf74f9987c04b8a/src/arc_str.rs#L696-L723

Here's a related issue: thomcc/arcstr#16


Sorry for the long comment, but there is a lot of confusing information out there ...

Copy link

@alcore alcore Mar 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning here was the comment about Mutexes in std's source for Arc. It can't possibly generically rely on synchronization external to itself. But I was overzealous - and misread the order of synchronization in push/pop (and the interactions with local heads/tails), i.e. I was mistaken.

So yes, while astronomically unlikely, I think the scenario you mentioned could happen.

The point about fetch_sub still seems valid to me though. Be it either like Arc (with an atomic::fence), or an AcqRel should work too.

A third option comes to my mind: retain the counter increment as Relaxed, but promote the Relaxed loads in the actual drop code to Acquire loads (not sure if it's necessary on both head and tail, or one should suffice to synchronize). This should not penalize the general case where the drops happen uncontended but will still synchronize if necessary and prevent any memory shenanigans. This works, because push/pop store head/tail respectively with a Release, so that synchronization would pair up. Loads in general also have the lowest latency of all the atomic ops, so that would be the best case scenario, I think.

That said - this is extreme microoptimization territory. Unlike Arc, which might need to run that drop glue an infinite number of times, we only run it twice at most, so if unsure, pick the safer option. Edit: Also, which one is actually 'faster' heavily depends on the platform (and a myriad of runtime factors). I should've stressed the can in "can be faster".

Sidenote: The circle is even fuller than you might think - I started out with the original SPSC implementation that you forked - realized that I could drop the Arcs for something simpler and smaller, read through exactly the same references that you just linked, implemented my own with said fetch_add, and then decided to come back here to basically say: hey, this could work in your case as well :)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, while astronomically unlikely, I think the scenario you mentioned could happen.

Yeah, I think that's how we should approach atomics. It doesn't matter how likely it is, we should always consider the worst case.

The point about fetch_sub still seems valid to me though. Be it either like Arc (with an atomic::fence), or an AcqRel should work too.

I've re-based this PR and added a commit: 0b6127e

Here I'm using fetch_or() instead of fetch_sub(), because we still don't really need a counter, right?

I would like to avoid the fence because I don't really understand how it works. But I guess there is nothing to be gained by using a fence in this situation anyway.

A third option comes to my mind: retain the counter increment as Relaxed, but promote the Relaxed loads in the actual drop code to Acquire loads (not sure if it's necessary on both head and tail, or one should suffice to synchronize).

I guess that would depend on whether the destructor runs on the consumer or the producer thread.
But I believe it wouldn't be correct anyway, see below.

This should not penalize the general case where the drops happen uncontended but will still synchronize if necessary and prevent any memory shenanigans.

Yeah, this sounds like a good idea, but I think it isn't correct.

This works, because push/pop store head/tail respectively with a Release, so that synchronization would pair up.

I don't think it would work correctly.

Let's consider this scenario: consumer pops, consumer is dropped, producer is dropped. This translates to the following sequence of operations:

  • C: store head in pop() (this is already using Release)
  • C: check (and update!) is_abandoned
  • P: check is_abandoned
  • P: load head in destructor

I think in this case the threads are synchronized by means of is_abandoned, head cannot really be used for synchronization here.

If the consumer would update is_abandoned using Relaxed, this would mean that storing head could be moved after that! The Release of head.store() wouldn't avoid that.
That means (AFAICT) that writing is_abandoned has to be Release.

OTOH, if the producer would read is_abandoned using Relaxed, loading head could be moved before that.
That means (AFAICT) that reading is_abandoned has to be Acquire.

If either of them were Relaxed, the final value of head could be wrong (even if storing and loading head is using Release and Acquire), which I guess might lead to a double free (and maybe other problems?).

So I think the fetch_or() needs Ordering::AcqRel and the loads in the destructor can remain Relaxed (even though we wouldn't mind making them stricter).

That said - this is extreme microoptimization territory. Unlike Arc, which might need to run that drop glue an infinite number of times, we only run it twice at most,

Exactly. And the second time is slow anyway, so it is really about the first time only.

so if unsure, pick the safer option.

I agree. But I still would like to try to make the Ordering as relaxed as possible if we can convince ourselves that it is still correct.

Edit: Also, which one is actually 'faster' heavily depends on the platform (and a myriad of runtime factors). I should've stressed the can in "can be faster".

OK, I don't know anything about that. Apart from performance, I think that fetch_or() looks simpler than compare_exchange().

Sidenote: The circle is even fuller than you might think - I started out with the original SPSC implementation that you forked - realized that I could drop the Arcs for something simpler and smaller, read through exactly the same references that you just linked, implemented my own with said fetch_add, and then decided to come back here to basically say: hey, this could work in your case as well :)

Cool, thanks for stopping by!

Is your code available somewhere?

I would also appreciate any further help from you!
For example: did you make any benchmarks? For this PR, the benchmarks mostly improve, but not all of them. And I don't know it that's because the benchmarks should be improved or if there is still some missing optimization somewhere ...

Copy link
Owner Author

@mgeier mgeier Mar 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot the actual optimization ... I guess we could use this:

unsafe fn abandon<T>(buffer: NonNull<RingBuffer<T>>) {
    if buffer
        .as_ref()
        .is_abandoned
        .fetch_or(true, Ordering::Release)
    {
        let _ = buffer.as_ref().is_abandoned.load(Ordering::Acquire);
        drop_slow(buffer);
    } else {
        // nothing
    }
}

Again, I don't know how to correctly use a fence here.

Either way, would it be worth it?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've read https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence#Atomic-fence_synchronization again, and I think this would be a correct use of the fence:

unsafe fn abandon<T>(buffer: NonNull<RingBuffer<T>>) {
    // The "store" part of `fetch_or()` has to use `Release` to make sure that any previous writes
    // to the ring buffer happen before it (in the thread that abandons first).
    if buffer
        .as_ref()
        .is_abandoned
        .fetch_or(true, Ordering::Release)
    {
        // The flag was already set, i.e. the other thread has already abandoned the RingBuffer
        // and it can be dropped now.

        // However, since the load of `is_abandoned` was `Relaxed`,
        // we have to use an `Acquire` fence here to make sure that reading `head` and `tail`
        // in the destructor happens after this point.
        fence(Ordering::Acquire);
        drop_slow(buffer);
    } else {
        // The flag wasn't set before, so we are the first to abandon the RingBuffer
        // and it should not be dropped yet.
    }
}

I still have to check how ThreadSanitizer behaves with this.

Copy link

@alcore alcore Mar 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case the threads are synchronized by means of is_abandoned, head cannot really be used for synchronization here.

When is_abandoned is a Relaxed increasing/decreasing counter, there is no actual synchronization going on in there. Relaxed is not for synchronization to begin with - it only ensures atomicity of the operation. Which was my proposal: Use relaxed for the counter, but synchronize on head and tail.

If the consumer would update is_abandoned using Relaxed, this would mean that storing head could be moved after that! The Release of head.store() wouldn't avoid that.

Correct, but... isn't that ok, actually?

In the scenario you described, if the pop in C's thread happens after its increment of the counter, then P's thread - if it is the one that actually ends up on the second iteration of the counter - will pick up the write to head in the actual drop glue and won't proceed with freeing until that write Release happens.

At least that's how I understand Rel/Acq semantics - but I might be entirely wrong.

The order of the counter still doesn't matter - regardless of which happens before, the drop won't be executed twice, given only one possible value triggers it. Which thread then actually drops it does not seem relevant to me, so long as all writes to head and tail happen before and none of them can happen after.

Note that if despite the intuitive order in your scenario (C pops, C drops, P drops -> queue drops) the C thread instead of the P thread actualy ends up being the second to increase the counter and take care of the drop, then ordering is even less relevant - it can't be concurrent with itself. Same thing the other way round.

Is your code available somewhere?

Unfortunately not (yet, anyway). But the logic is basically 1:1 the same and the initial purpose too (audio buffers).

My case isn't generic, though. I have a bounded pool with 8 buffers (4096 x i16 each). Decoding thread - gets from pool, writes to buffer, pushes buffer into queue. RT playback thread pops from queue, copies the contents to OS buffer, drops them - and their drop glue returns them to the pool. RT thread unparks decoding thread when queue is nearing depletion. Rinse, repeat.

There is so little contention there, that I actually removed the cache line padding on head/tail in order to get some constructive sharing on the queue slots instead of trying to avoid false sharing on the indices.

My 'queue' merely holds smart pointers (Buffer is a smart pointer to memory allocated in the Pool), 16 bytes each on 64-bit, so the cache line padding was hurtful to performance, because more memory needed to be read on each push/pop to access the actual queue slot. Given that head/tail are also locally cached, removing the padding had a positive impact, but I have not performed benchmarks for any generic case under high contention, so I wouldn't draw too much from that. As said, my case is rather simple.

In my case, then, because those are smart pointers, I don't need to Acquire head in drop (which is why I was unsure whether Acquire on both tail and head are necessary), because I don't care about the queue at that stage anymore, and the memory my smart pointer points to is still valid regardless. However, in your case, both would need to be Acquire I think, since T could be a Vec - or any other type holding its own allocation - and not synchronizing the pop could therefore deallocate it while it's handed out, because the queue might still believe that it owns it.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case the threads are synchronized by means of is_abandoned, head cannot really be used for synchronization here.

When is_abandoned is a Relaxed increasing/decreasing counter, there is no actual synchronization going on in there. Relaxed is not for synchronization to begin with - it only ensures atomicity of the operation.

Yeah, but I think we need synchronization.

Which was my proposal: Use relaxed for the counter, but synchronize on head and tail.

I don't think that'll work, see below.

If the consumer would update is_abandoned using Relaxed, this would mean that storing head could be moved after that! The Release of head.store() wouldn't avoid that.

Correct, but... isn't that ok, actually?

I might be totally wrong, but if I'm right, this is very much not OK.

The consumer thread could do this:

  • update is_abandoned using Relaxed
  • go to sleep for a long time
  • much later, wake up and write to head using Release

The producer thread might do this:

  • check is_abandoned using Relaxed and realize that the consumer has been dropped
  • run the destructor
  • read head using Acquire, but it hasn't been written by the consumer thread, which is still sleeping!
  • free the wrong number of items

That's bad, isn't it?
Or am I missing something?

A similar problem could happen in the producer when reading head is moved before the is_abandoned check:

  • read head using Acquire
  • sleep for a long time, while the consumer can still write to head (which will never be observed by the producer thread) and then be dropped
  • wake up
  • check is_abandoned using Relaxed and realize that the consumer has been dropped
  • run the destructor with a totally wrong value of head

In the scenario you described, if the pop in C's thread happens after its increment of the counter, then P's thread - if it is the one that actually ends up on the second iteration of the counter - will pick up the write to head in the actual drop glue and won't proceed with freeing until that write Release happens.

I don't think it works that way.

What you are describing sounds like waiting for a mutex (or a semaphore or something like that). the "won't proceed with freeing" would imply that the thread is going to sleep or something?

At least that's how I understand Rel/Acq semantics - but I might be entirely wrong.

That's also how I understand it, but with an important difference: the Acquire operation is not waiting for anything. It's not a blocking operation.

If an Acquire operation happens before the corresponding Release operation, I guess that means that there is a bug somewhere, but there is no way that the thread could wait for the Release operation.

The order of the counter still doesn't matter - regardless of which happens before, the drop won't be executed twice, given only one possible value triggers it.

I agree that it only happens once, but the dropping still has to be synchronized.

Which thread then actually drops it does not seem relevant to me, so long as all writes to head and tail happen before and none of them can happen after.

Exactly.
But I think we can only accomplish this by using is_abandoned with Acquire/Release.

Note that if despite the intuitive order in your scenario (C pops, C drops, P drops -> queue drops) the C thread instead of the P thread actualy ends up being the second to increase the counter and take care of the drop, then ordering is even less relevant - it can't be concurrent with itself. Same thing the other way round.

I agree. That's the simple case and I think it doesn't need any special treatment.

Copy link

@alcore alcore Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think we can only accomplish this by using is_abandoned with Acquire/Release.

I'm aware there's no 'waiting'. I was just thinking in terms of a concurrent race - not in terms of interrupts/pre-emption/sleeps, i.e. that due to the memory barrier, the correct order of the operations would remain preserved.

Although strictly speaking there's a bit of 'waiting' going on when 2 or more CPUs need to synchronize.

I'll try a different angle:

Note opinions like: https://www.reddit.com/r/rust/comments/e14gl6/can_two_orderingacquirerelease_pairs_cause/:

(...) So, when the documentation says "all previous operations" and "all subsequent operations", it means not just operations touching this specific atomic, but all operations touching data. (...)

For...

read head using Acquire, but it hasn't been written by the consumer thread, which is still sleeping!

... to happen:

  1. The compiler would've needed to re-order the store instruction after the drop glue call (which intuitively seems incorrect to begin with).

On an instruction level, one of the branches would contain an Acquire load on the value that the compiler is trying to store (and maybe re-order more efficiently). But the other one doesn't contain such load.

a) If the compiler re-ordered the store after the branch, it would need to do this for both cases. I.e. insert the same instruction(s) twice, where it could've done it once. I'm fairly sure compilers don't do that.
b) In one of the branches there's an established happens-before relationship between the release store and acquire load. That relationship is supposed to guarantee the memory ordering of all the operations that happened before, and the release store happens before the acquire load.

Our branch in the drop glue is between those two - and it happens between that store and load.

  1. The hardware could theoretically reorder it too. However:

https://en.cppreference.com/w/cpp/atomic/memory_order

(...) atomic operations can be given an additional std::memory_order argument to specify the exact constraints, beyond atomicity, that the compiler and processor must enforce for that operation (...)

Acquire:

(...) no reads or writes in the current thread can be reordered before this load (...)

Release:

(...) no reads or writes in the current thread can be reordered after this store (...)

Our branch is between that 'after' and 'before'. And this includes our relaxed loads.

Further:

(...) All writes in the current thread are visible in other threads that acquire the same atomic variable (...).

It does not say "all writes to the same atomic variable". It says all writes.

But we're primarily concerned about the happens-before causality, not the visibility.

Basically my argument turned around in that: yes, you'd be right that if an interrupt happened and the store was re-ordered after the relaxed load, it'd be bad. I'm now, however, arguing that we were both wrong to begin with in assuming it could get re-ordered like that.

Think about it:

unsafe fn abandon<T>(buffer: NonNull<RingBuffer<T>>) {
    // The "store" part of `fetch_or()` has to use `Release` to make sure that any previous writes
    // to the ring buffer happen before it (in the thread that abandons first).
    if buffer
        .as_ref()
        .is_abandoned
        .fetch_or(true, Ordering::Release)
    {
        // The flag was already set, i.e. the other thread has already abandoned the RingBuffer
        // and it can be dropped now.

        // However, since the load of `is_abandoned` was `Relaxed`,
        // we have to use an `Acquire` fence here to make sure that reading `head` and `tail`
        // in the destructor happens after this point.
        fence(Ordering::Acquire);
        drop_slow(buffer);
    } else {
        // The flag wasn't set before, so we are the first to abandon the RingBuffer
        // and it should not be dropped yet.
    }
}

You're not touching head nor tail here either. In other words - if we were right and that Release store could be reordered, what's to prevent the compiler or CPU from putting it just before the 'drop_slow' in here and cause the very same issue? The critical section (between the release fetch_or and acquire fence) is even shorter here.

Same issue then even applies to the very initial compare exchange that you had in the first place.

Which in turn implies atomics are actually useless for synchronization.

And of course they are not.

Which is why I'm saying our assumption about the reordering appears wrong to me (now). But they're only useful for synchronization when we apply the mental model in that the ordering of the operations between the Acq/Rel fences is preserved and that it applies to all loads/stores that happen before.

Also, since we started with Arc, the comments in its drop glue:

// > It is important to enforce any possible access to the object in one
// > thread (through an existing reference) to happen before deleting
// > the object in a different thread. This is achieved by a "release"
// > operation after dropping a reference (any access to the object
// > through this reference must obviously happened before), and an
// > "acquire" operation before deleting the object.

In Arc's case, the "object" is the reference counter. In our case that would be the head and tail, not anything else.

Edit: I.e. Arc can't synchronize on anything else, because the atomic counter is the one value that all threads possibly access and during a drop - if the only thing it did was a Release store (fetch_sub) of the counter, then it would not prevent a release store (fetch_add) in another thread from happening after. That's what the Acquire fence is for - it establishes the ordering where otherwise there might possibly only be releases (and relaxed increments in Arc::clone(), for that matter). In our case, we can have that Acquire in the actual drop.

If the relaxed increment in Arc::clone() could be reordered to happen after the Release in Arc::drop() (before the Acquire and the actual drop), then the situation would be identical to ours. I trust that this code - having a legacy in Boost and in common use - is sound, and therefore it stands to reason that our assumption about the possibility of it being reordered this way was wrong.

Although I do think that the comments in Arc::clone() are at the very least confusing (they refer to relaxed being fine due to knowledge about the reference - but that seems wrong, because it involves only 2 threads, not N threads, where the actual crucial part to relaxed being fine there is the drop glue, on which N threads synchronize - and it makes no mention of that).

The docs further state:

 // In particular, while the contents of an Arc are usually immutable, it's
    // possible to have interior writes to something like a Mutex<T>. Since a
    // Mutex is not acquired when it is deleted, we can't rely on its
    // synchronization logic to make writes in thread A visible to a destructor
    // running in thread B.

In our case we would acquire the "object" (head/tail) "when it is deleted", even if a merely relaxed counter inc/dec happens before.

The counter is not used for synchronization. It's merely used for branching - it's atomic for the atomic access, not synchronization, because the order of the increments is irrelevant so long as the order of the writes/loads to head/tail is preserved. That's what I meant when I wrote:

When is_abandoned is a Relaxed increasing/decreasing counter, there is no actual synchronization going on in there. Relaxed is not for synchronization to begin with - it only ensures atomicity of the operation.

Yeah, but I think we need synchronization.

And I never said we don't. But I said that we would in fact have that synchronization even if the counter was relaxed, so long as the loads of head and tail in the actual drop glue were ordered as Acquire.

For the record: I'm not actually strongly arguing for nor against something. This is more of a learning experience/excercise/puzzle to me and I hope you don't mind the walls of text. Also: I don't think I can come up with any further logic to make my argument - I'm spent now.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically my argument turned around in that: yes, you'd be right that if an interrupt happened and the store was re-ordered after the relaxed load, it'd be bad. I'm now, however, arguing that we were both wrong to begin with in assuming it could get re-ordered like that.

It is very much a possibility that we are both wrong.
I hope once our discussion is resolved we will agree and we will be both right!

Think about it:

unsafe fn abandon<T>(buffer: NonNull<RingBuffer<T>>) {
    // The "store" part of `fetch_or()` has to use `Release` to make sure that any previous writes
    // to the ring buffer happen before it (in the thread that abandons first).
    if buffer
        .as_ref()
        .is_abandoned
        .fetch_or(true, Ordering::Release)
    {
        // The flag was already set, i.e. the other thread has already abandoned the RingBuffer
        // and it can be dropped now.

        // However, since the load of `is_abandoned` was `Relaxed`,
        // we have to use an `Acquire` fence here to make sure that reading `head` and `tail`
        // in the destructor happens after this point.
        fence(Ordering::Acquire);
        drop_slow(buffer);
    } else {
        // The flag wasn't set before, so we are the first to abandon the RingBuffer
        // and it should not be dropped yet.
    }
}

You're not touching head nor tail here either. In other words - if we were right and that Release store could be reordered, what's to prevent the compiler or CPU from putting it just before the 'drop_slow' in here and cause the very same issue?

I'm not sure if I understand the question.

In this code example, the load and the store happen atomically in fetch_or().
They cannot be reordered relative to each other.

Let's look at this situation, where they are not contained in the same atomic operation:

    if buffer
        .as_ref()
        .is_abandoned
        .fetch_or(true, Ordering::Release)
    {
        let _ = buffer.as_ref().is_abandoned.load(Ordering::Acquire);
        drop_slow(buffer);
    }

According to the Release/Acquire rules we were talking about, the operations would be allowed to be reordered, but those are not the only rules!

There are still good old data dependencies!

The compiler/CPU is not allowed to re-order operations within a thread if one is writing to a memory location followed by another one reading from the same location.

A simplified code example would look like this:

a = x;
y = a;

Those are not allowed to be switched, regardless whether a is atomic or not.

But I might still not fully understand your question, because in the thread that drops the ring buffer, the store operation is entirely irrelevant, because is_abandoned is unchanged, it just stays true.

Same issue then even applies to the very initial compare exchange that you had in the first place.

I don't follow. I don't see a problem there.

I think my original compare/exchange as well the fetch_or() in 0b6127e are correct, they are just not optimal.

Which in turn implies atomics are actually useless for synchronization.

And of course they are not.

Indeed.

Which is why I'm saying our assumption about the reordering appears wrong to me (now).

If we consider data dependencies (and release sequences) on top of the sequenced-before and sequenced-after rules, I think we have the right assumptions.

For the record: I'm not actually strongly arguing for nor against something. This is more of a learning experience/excercise/puzzle to me

It's the same for me, and I've already learned a lot from our interaction!

and I hope you don't mind the walls of text.

Not at all.

However, I've skipped the head/tail situation for now, but I can of course answer at a later point, if necessary.

But I think first we should agree on our assumptions ...

Also: I don't think I can come up with any further logic to make my argument - I'm spent now.

I'm not in a hurry, but I would really like it if we could come to a consensus, which I guess would make both of us a bit more comfortable with atomics.

@mgeier
Copy link
Owner Author

mgeier commented Apr 18, 2022

I tried using fence(), but as suggested elsewhere, it is not supported by ThreadSanitizer.

So instead I added a dummy load(Acquire), as mentioned in #75 (comment): 32ccd67

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Weird allocation method
2 participants