-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConcurrentModificationException FileManager::saveToFile #3752
Comments
It looks like
The Here is the offending code that updates while iterating:
Confirmed the bug on v1.2.3 to show it wasn't a regression. The following patch causes the bug to happen:
I also found this comment in AddressEntryList that seems to have found this issue before. But, the way it solves the problem isn't thread-safe either because the ArrayList constructor is not thread-safe. It is actually worse than that because modifies-while-creating won't even throw an exception.
|
I went through and audited all the users to figure out the best way to solve this in the short and long term. Here are the current users of the FileManager and the data structures they use when iterating for persistence. Likely Unsafe Likely Safe Unknown |
DaoStateStore is stored at each snapshot (each 20 blocks) as well as at sync (resync). So I assume it has the same issue. What would you recommend to fix the issues? Using concurrent lists/maps? |
I spent the afternoon running some tests and profiling the impact of a few ways to handle this. I wanted to share the options I came up with to get some feedback from others who may have encountered this in the past or have thought about it for more than a few hours. Use concurrent data structuresReplace all data structures with their concurrent options. ArrayList -> CopyOnWriteArrayList & HashMap -> ConcurrentHashMap This will fix the immediate problem by ensuring that during iteration a change in the underlying data structure won't cause an exception. But, there are two issues to consider: First, Second, the real problem is that the FileManager is running arbitrary methods ( Create protobuf objects on the UserThread instead of the FileManager writer threadThis seems like the cleanest design moving forward. Current and future The But, the overhead for building the protobuf message on each save is pretty significant. On average, it takes around 1500ms for Hybrid solutionThere may be a hybrid solution where we don't generate protobuf messages on the Additionally, it might make sense to just whitelist the really expensive persistent objects, roll our own synchronization, and find a way for "most" users to not care which would guarantee correctness and allow profiling to determine where to optimize instead of bugs. |
Thanks for looking into it. It is a tricky topic.... If mapping to I think its probably best to first analyse which objects cost a lot of cpu time for the protobuffer conversion ( To let the storage objects decide how the threading is handled would be good. For the large objects ( If this all turns out to a larger and difficult project we should also consider if a database solution would be better, but that is probably also a bigger project... |
I got a little too deep into the perf work here and wanted to share the summary of information for the person that picks this up. I think guaranteeing correctness outweighs performance at a high-level. The current code "works" because there isn't a lot of state change once Bisq is up and running and even if there is corruption it is likely overridden quickly enough with the next save. All of this persistence is just a network bandwidth optimization so worst case, refetching everything from the seed nodes is an option. But, fixing the correctness and keeping it performant is not a small project. The persisted objects have multiple fields and mutable getters/setters which make it extremely difficult to understand the calling contexts for each object and how to efficiently synchronize them. There are plenty of ways this can be done at the design-level, but translating that into actually PRs will require quite a bit of work. With that said, I have a few small diffs and traces I want to share to help future viewers understand the performance. Make it correctThis patch just does the [BUGFIX]_Call_toProtoMessage_on_UserThread.txt Make it fast (sort of...)If the Why is DaoStateStore slow? Millions of "fast" calls. Having to serialize every transaction and every block is expensive. Can we cache Block/Tx toProtoMessage? No. It turns out that the overhead of caching those objects not only increases the memory footprint but generating the Can we cache serialized DaoState for faster iterative saves? Can we use concurrency? Yes. Using the So overall, these are the "best" option that I found in my investigation that doesn't require an extended dev effort. Performance is tricky and there isn't enough testing of the persistable objects where I would feel confident making changes without a lot of manual testing overhead or high-risk bug introduction. See #3773 for my comments on similar perf optimizations in the same layer. I am also just documenting this here as opposed to PRs because I don't think I am the right person to be responsible for this change and the performance impacts moving forward. I'm happy to help investigate, but someone with much more context of the system should be responsible for the maintenance of it. |
Thanks @julianknutsen for the work on that! I think performance optimisation in the DaoState is required anyway (and I saw some work has started there). For the nr. 2 and 3 of the performance bottleneck candidates (tradestats, account age witness) I think we should look into a solution to split up the historical data and only keep the most recent data as dynamic data structure which would become much smaller (I estimate a few % of the current data size). As those data are "append only" and carry much less complexity than the DAO domain I think that might be a good candidate for improvement. SequenceNyMap uses a max size, we might consider to lower that so it decreases the object size and with it the CPU costs for persisting. Also tuning of the persistence intervals can help here. Preferences should not be written to disk often, if so, we should find out why and fix it if its not required. Also here we can increase peristence interval. |
Add toProtoMessageSynchronized() default method to PersistableEnvelope, which performs (blocking) protobuf serialisation in the user thread, regardless of the calling thread. This should prevent data races like the ConcurrentModificationException observed in bisq-network#3752, under the reasonable assumption that shared persistable objects are only mutated in the user thread. Also add a ThreadedPersistableEnvelope sub-interface overriding the default method above, to let objects which are expensive to serialise (like DaoStateStore) be selectively serialised in the 'save-file-task-X' thread as before, but directly synchronised with each mutating op. As most objects are cheap to serialise, this avoids a noticeable perf drop without having to track down every mutating method for each store. In all cases but one, classes implementing ThreadedPersistableEnvelope are stores like TradeStatistic2Store, with a single ConcurrentHashMap field. These require no further serialisation, since the map entries are immutable, so the only mutating operations are map.put(..) calls which are already synchronised with map reads. (Even if map.values().stream() sees updates @ different keys happen out-of-order, it should be benign.) The remaining case is DaoStateStore, which is only ever reset or modified via a single persist(..) call with a cloned DaoState instance and hash chain from DaoStateSnapshotService, so there is no aliasing risk from the various DAO state mutations done in DaoStateService and elsewhere. This should fix bisq-network#3752.
Description
BallotList
is not thread-safe, but is used by theFileManager
to persist to disk. All users that extendPersistableList
can have the same race condition.Regression: No
Version
v1.2.4
v1.2.3 as well
Steps to reproduce
Only seen it once when trying to reproduce
./bisk-desktop
See comments for patch that can be applied to v1.2.3 to cause bug.
Expected behaviour
All data structures used with FileManager are thread-safe
Actual behaviour
Some data structures used with FileManager are not thread-safe
Screenshots
Device or machine
Ubuntu 19.10
Additional info
bisq.log
The text was updated successfully, but these errors were encountered: