-
Notifications
You must be signed in to change notification settings - Fork 107
Use atomic operations and a read lock instead of a write lock #945
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Nice catch that you noticed that this can all be done with atomic operations and hence the write lock is not necessary. I'm still surprised by how large the difference is that you measured.
i haven't looked in depth at this yet, but FWIW I'm pretty sure you can't just do atomic writes to integers while elsewhere reading (even when holding the read lock which in this case isn't relevant).
we may want to pursue tracking of partition/lastUpdate in a separate structure that is entirely based on atomics, just thinking out loud. |
It seems that it would be easy enough to create an AtomicUint32 (and the whole family) to encapsulate these actions. The trick is that the MetricDefinition would need to use it all over... |
related: the in-memory index (including its types and MetricDefinition) are due for a make-over anyway. |
3f91288
to
426815f
Compare
So, we were running the "unsafe" version of this (mixing atomic/non-atomic accesses) for over a month and didn't see issues, but I recently rolled out a new build without this branch as I didn't want race conditions in there, and I wasn't ready to commit to this change. However, without this change I saw about a 30% slowdown in backfill processing without it. So, I decided to just bite the bullet and look through all the |
Opened #969 for failed test |
0c15f36
to
f4b7793
Compare
|
if existing.LastUpdate < int64(point.Time) { | ||
existing.LastUpdate = int64(point.Time) | ||
if atomic.LoadInt64(&existing.LastUpdate) < int64(point.Time) { | ||
atomic.SwapInt64(&existing.LastUpdate, int64(point.Time)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that this is racey.
let's say existing.LastUpdate is very old (30 days ago)
then a point comes in for 29 days ago, and concurrently another one for a day ago via a different kafka partition, and then no more points.
in that case, we can have concurrent Update calls, resulting in the LastUpdate field being updated to 29 days ago, but never to a day ago.
note that for any kafka given partition, carbon stream or prometheus POST we never have overlap in our Update calls.
so in practice, doesn't seem like an issue, but perhaps we should document this something under "tradeoffs and extremely rare edge cases" or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we can solve it by either:
- doing CompareAndSwap in a loop until we're able to swap for the value we wanted to swap
- confirming the swapped out value (return value of SwapInt64) is smaller than what we swapped in. if not, put the old value back, check that we didn't swap for an ever higher value (placed by a concurrent Update call), etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation is racey at that point even with a write lock. It's dependent on the order individual threads hit that lock call which ( with the hypothetical assumption that data can come for the same series from different threads) can be out of order from the kafka ingest step.
I'm hesitant to add anything overly complex to Update
for no real world benefit, but I'll defer to your preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. if you use a write lock, you can lock, check value, if we have a larger one, update, unlock. this works regardless of the order between two concurrent update operations. (the most recent will always survive).
I think my proposal above will also solve it, and at almost no additional cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was extending it to the Partition
update as well. Perhaps Partition
should only be updated when we have a newer timestamp as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the partition property is really only to shard the index in cassandra, so nodes on startup know which metrics they're responsible for.
i'm not sure if we even properly support live partition changes of metrics (i.e. whether after the change we properly display both the old and new data)
under concurrent updates it's probably ok for the partition to be ambiguous ("under transition"), but once update operations are serialized, the later one should probably always win, even when data is sent out of order. I think MT's behavior in these scenarios is so undefined that probably either way works.
Yes, this PR definitely got us back where we wanted to be. Ingest rate is not very consistent, but we average ~90k dp/s/core (on 8 cores, we see spikes up to 1M dp/s but average ~700k). Without this change, we were barely breaking 400k dp/s. We have been running this in production for about 2 months now, with no noticeable issues. I really look forward to seeing if it benefits your speeds as well (the trade-off, I suppose, is greater CPU usage during ingest). |
I have a new branch:
then i filled up kafka with some MetricData data and tested ingestion with each version (twice)
https://snapshot.raintank.io/dashboard/snapshot/hMzSp4LGcBaJ5iKDrvrWMueMWzkxTUL6?orgId=2 cpu difference looks fine (tiny. if anything, proportional to the increased ingest but perhaps even less) sound good @shanson7 ? |
Yeah, looks great to me! I'm excited to see if you see a difference in ingest speed as pronounced as we did. |
This PR was an attempt to reduce the exclusive lock section in the ingest path.
The idea is that the map isn't truly being modified, so we don't need to hold a write lock. The behavioral change is that if the same point is
Update
d by two threads, the partition/LastUpdate is not guaranteed to match. In practice, I believe thatLastUpdate
should pretty much be near realtime (and is mostly heuristic anyway). Thepartition
shouldn't change frequently anyway and should be eventually consistent.Similar changes could be made to
AddOrUpdate
(optimistically acquiring a read lock) but I wasn't sure how many calls toAddOrUpdate
actually resulted in a write.In our setup, we saw a 30%-40% bump in our backlog processing from this change.