-
-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix results from race condition in MD5 list generation #1471
fix results from race condition in MD5 list generation #1471
Conversation
That's an interesting test case. I'll try to reproduce that. In the meantime, does this happen before P99 completes? If so, I'll try to see what's taking memory there. Otherwise, the bug might be elsewhere. What I mean: assume some Sxx module uses a lot of memory to process busybox and that's causing what you experienced. It's possible that without the patch that Sxx module does not process busybox because of duplicate MD5 (as it occurred with S09 in issue #1464 for example, but in another module). But with the patch, busybox and its MD5 is there only once, and now that Sxx module processes it and triggers the memory-hungry processing. |
Yes it is during the p99 run. Memory starts filling up very soon and after a bit of time the system hangs and starts killing processes. |
Identified cause: given a separate process is spawned for each file to process (as per existing code), thousands of process are created on large firmwares, and with this PR, most of them end up waiting for their turn on Compound effect: the larger the firmware, the larger the "done MD5s" temp file grows and the longer the critical section takes to execute (not by much, but still...). This makes other processes in queue waiting for longer and more prone to pile up. Fix in latest commit:
Overall, for this PR at this point, I noticed a performance degradation compared to before the PR (for example, on a FW where P99 reports 36000 files to process, it went from 7 to 17 minutes on my system), but at least the result is correct. Rest of the analysis takes several hours, so 10 minutes more is not that much, but still not much welcomed. If performance really is an issue, there might be options, although I can't tell how much (if at all) they would improve performance:
RemarksOn filtering PIDs: I extracted the filtering part from For the main loop in this PR, I implemented the same pattern than in other modules, that looks increasingly common across modules. I mean something like for lBINARY in "${ALL_BINS_ARR[@]}" ; do
process_that_file_in_a_new_process "${lBINARY}" &
local lTMP_PID="$!"
store_kill_pids "${lTMP_PID}"
lTASK_PIDS_ARR+=( "${lTMP_PID}" )
if [[ "${#lTASK_PIDS_ARR[@]}" -gt "${MAX_MOD_THREADS}" ]]; then
filter_out_dead_pids lWAIT_PIDS_P99_ARR
if [[ "${#lTASK_PIDS_ARR[@]}" -gt "${MAX_MOD_THREADS}" ]]; then
max_pids_protection $(( "${MAX_MOD_THREADS}"*FUDGE_FACTOR )) "${lTASK_PIDS_ARR[@]}"
fi
fi
wait_for_pid "${lTASK_PIDS_ARR[@]}" Some modules use only |
Regarding the longer runtime I would give this a quick test:
On my testing scenario this looks quite fast:
|
I have also seen that field 8 is sometimes corrupt and we need to ensure the csv file remains intact in the P99 module. The file output needs to get filtered for ';' characters in here: emba/helpers/helpers_emba_prepare.sh Line 209 in b471564
I included a quick filter in the call in this PR #1473 |
I tried with/without
Also, without Therefore, I reverted all changes and added a post-processing Although |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing out this issue and fixing it
I totally agree. A lot of code and ideas are massively grown and were never refactored. I hope you are interested in bringing up the refactoring of the process handling in a dedicated PR again. Thank you for your effort. Great work! |
Fixes issue #1464
See issue #1464
A critical section is implemented through
flock
and prevents the race condition. MD5 sums inp99_md5sum_done.tmp
andP99_CSV_LOG
are now unique.No.
Before the fix, with test "firmware" in issue #1464 , duplicate MD5 would occur once every few runs. After the fix it does not seem to occur at all, but I don't know how to devise a test to demonstrate that a race condition can't occur at that point.