-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
irmin-pack: inverstigate performance regression with GC #1959
Comments
When the worker runs, it creates a new prefix file (sparse file) and a new suffix file. The new suffix file is created by copying directly from the current suffix file (from some chosen offset to the end): the worker asks for the length of the suffix and then copies the bytes from the offset to the end of the current suffix, into the new suffix. But, during this copy, the suffix is "live" and being written to by the main process. If the copy is long (and it is with Tezos), the current suffix may grow substantially. So, the worker should repeatedly copy from the current suffix, until either there are very few bytes left (if any) to copy (or until some small finite limit on the number of copies is reached - say 7 iterations to be safe... we don't want an infinite loop where the worker and the main process continue in parallel forever). Now, it is always the case that after the worker terminates, the current suffix is still live, so potentially updated, so the next suffix (that was just created by the worker) is then out of date. So the main process must also copy any extra data at the end of the current suffix to the next suffix, before switching over. In the current code, the worker copies data from the current suffix to the new suffix in this file: https://github.com/mirage/irmin/blob/main/src/irmin-pack/unix/gc.ml The copy happens at step 7 "transfer to the next suffix". But from the code it looks like the copy is performed only once. What this means is that, when the worker terminates and transfers back to the main process, the main process might have to copy a lot more data than if the "step 7 transfer" had been repeated. This is likely a source of the slow down. (I expect there are other sources as well, which I will come back to tomorrow.) |
This is the main unknown for the release - we need to at least understand what is happening. @tomjridge's current assumption is that is related to copying the suffix in the main process that takes time (as it's a lot of data). One possible fix would be to make the child process do more work: we might need a loop in the child to copy data if there were concurrent writes in the parent (if the max offset has changed since the data copying has started). |
Another possible cause of slowdown: when the main process switches to the new prefix (sparse file) it has to load a mapping file. Previously this was loaded very quickly, but the new code may be slower. |
Another possibility: suppose at the end of a commit, we issue an fsync (I don't know whether we do or not - likely we don't, but let's assume we do...), and suppose the filesystem executed an fsync by flushing all filesystem buffers (not just those related to the fsync'ed file... this is quite common in filesystem impls), and suppose there was (concurrently) a large copy in progress (the GC copying the suffix file to make the new suffix file); then the fsync would take much longer (compared to the scenario where there was no GC running). So the overall time for the commit would be longer. I don't think this possibility is likely. I'm just recording it here so I don't forget to investigate it. |
The It is clearly A zoom in for the transfer remaining newies by the main process (this is without #1986). So in this benchmarks at least, not doing a transfer loop in the child is not costly - the main process spends less than 100 ms for the transfer. It still might be useful to have the transfer loop in the child though, as there is a correlation between newies left and time to transfer. |
Now that the mapping doesn't need to be read at finalise time, unlink dominates the finalisation time
|
I ran a benchmark with latest Here are the latest results from different runs (including one without gc for a baseline comparison):
|
Could you also compare with 3.3.2? |
3.3.2 vs 3.4 (what is on
|
I ran a fresh 3.2.2 benchmark to compare with 3.4's new IO. Fully Summary. See collapsed sections below for a few excerpted areas of the summary and light commentary. I think the original motivation for opening this issue has been mostly addressed. We still have room for performance improvement, but lets follow up with specific issues.
Commit TimesThe tail latency is noticeably higher for 3.4, but on average and at best, 3.4 appears to be an improvement.
MemoryWe are using more (for the main process -- still need to evaluate the child process properly) but are still within a good range. A notable stat is `minflt` increased significantly for the gc run vs no gc and 3.2.2.
DiskThere is a significant reduction in disk bytes read and a slight increase in disk bytes written (although again this does not take into account the gc process).
|
Running the
lib_context
benchmarks on the trace with the first 2 months of hanghzou (so what we use for the release benchmarks). The Gc is called every cycle (8100 blocks), it simulates the tezos scenarios for a bootstrap. For this run, gc was called 17 times.The logs also report two times for the GC - the start_gc and finalise_gc. It is the finalise_gc that is the culprit:
which is probably due to transferring the latest newies, but not sure. However, if the gc is run at a slower pace (not for bootstrapping) then maybe its possible that this
finalise_gc
does not take so long, but either way, i think this is something we need to look into.The text was updated successfully, but these errors were encountered: