fix results from race condition in MD5 list generation #1471

gluesmith2021 · 2025-02-17T21:32:38Z

What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

Fixes issue #1464

What is the current behavior? (You can also link to an open issue here)

See issue #1464

What is the new behavior (if this is a feature change)? If possible add a screenshot.

A critical section is implemented through flock and prevents the race condition. MD5 sums in p99_md5sum_done.tmp and P99_CSV_LOG are now unique.

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No.

Other information:

Before the fix, with test "firmware" in issue #1464 , duplicate MD5 would occur once every few runs. After the fix it does not seem to occur at all, but I don't know how to devise a test to demonstrate that a race condition can't occur at that point.

m-1-k-3 · 2025-02-18T12:30:07Z

I tested these modifications but my testing system is now running out of memory.

Without these mods no memory issue happens.
For my testing scenario I just have a bigger firmware with ~9000 files and I created multiple copies of busybox with different names and the same MD5 hash.

gluesmith2021 · 2025-02-18T14:59:20Z

I tested these modifications but my testing system is now running out of memory.

Without these mods no memory issue happens. For my testing scenario I just have a bigger firmware with ~9000 files and I created multiple copies of busybox with different names and the same MD5 hash.

That's an interesting test case. I'll try to reproduce that.

In the meantime, does this happen before P99 completes? If so, I'll try to see what's taking memory there.

Otherwise, the bug might be elsewhere. What I mean: assume some Sxx module uses a lot of memory to process busybox and that's causing what you experienced. It's possible that without the patch that Sxx module does not process busybox because of duplicate MD5 (as it occurred with S09 in issue #1464 for example, but in another module). But with the patch, busybox and its MD5 is there only once, and now that Sxx module processes it and triggers the memory-hungry processing.

m-1-k-3 · 2025-02-18T15:06:29Z

That's an interesting test case. I'll try to reproduce that.

In the meantime, does this happen before P99 completes? If so, I'll try to see what's taking memory there.

Yes it is during the p99 run. Memory starts filling up very soon and after a bit of time the system hangs and starts killing processes.

gluesmith2021 · 2025-02-18T23:06:30Z

Identified cause: given a separate process is spawned for each file to process (as per existing code), thousands of process are created on large firmwares, and with this PR, most of them end up waiting for their turn on flock. Each process takes a few MB of reserved memory at that point, and together they potentially exhaust memory.

Compound effect: the larger the firmware, the larger the "done MD5s" temp file grows and the longer the critical section takes to execute (not by much, but still...). This makes other processes in queue waiting for longer and more prone to pile up.

Fix in latest commit:

Use max_pids_protection to limit the number of concurrent process. Effectively fixes the memory issue
Filter PIDs list for performance, as with S09 in Effective (and fast) S09 threading #1462. This does not improve performance compared to before the PR, but prevents degradation caused by now using max_pids_protection on long PID lists.

Overall, for this PR at this point, I noticed a performance degradation compared to before the PR (for example, on a FW where P99 reports 36000 files to process, it went from 7 to 17 minutes on my system), but at least the result is correct. Rest of the analysis takes several hours, so 10 minutes more is not that much, but still not much welcomed.

If performance really is an issue, there might be options, although I can't tell how much (if at all) they would improve performance:

Remove the lock and perform post-filtering as first suggested in issue Race condition in P99 / helpers_emba_prepare.sh #1464
- the slight refactor of max_pids_protection in this PR could still have benefits (see Remarks below)
Instead of a file with "done MD5s" and a grep, this could be a folder with a bunch of 0-byte files, one per MD5. Maybe checking if a file exist is faster than a grep (assuming the OS keep in memory those "hot" parts of the file system)

Remarks

On filtering PIDs: I extracted the filtering part from max_pids_protection to make it available. I tried to move debug prints along in a sensible way. Also, code added in S09 in #1462 could now reuse the new filter_out_dead_pids instead of recover_wait_pids. But given 1462 isn't a parent of this PR, I didn't want to complicate merges. This can be done later (this is in fact code duplication, not a behavior issue)

For the main loop in this PR, I implemented the same pattern than in other modules, that looks increasingly common across modules. I mean something like

    for lBINARY in "${ALL_BINS_ARR[@]}" ; do
      process_that_file_in_a_new_process "${lBINARY}" &
      local lTMP_PID="$!"
      store_kill_pids "${lTMP_PID}"
      lTASK_PIDS_ARR+=( "${lTMP_PID}" )

      if [[ "${#lTASK_PIDS_ARR[@]}" -gt "${MAX_MOD_THREADS}" ]]; then
        filter_out_dead_pids lWAIT_PIDS_P99_ARR
        if [[ "${#lTASK_PIDS_ARR[@]}" -gt "${MAX_MOD_THREADS}" ]]; then
          max_pids_protection $(( "${MAX_MOD_THREADS}"*FUDGE_FACTOR )) "${lTASK_PIDS_ARR[@]}"
        fi
      fi
    wait_for_pid "${lTASK_PIDS_ARR[@]}"

Some modules use only max_pids_protection and might benefit from filter_out_dead_pids (as discussed in #1462). Also, maybe this whole pattern could be isolated in a helper function and reused in many places. Again, this is code duplication, not a behavior issue, so that's an optional task for the backlog I guess.

m-1-k-3 · 2025-02-19T07:50:13Z

Regarding the longer runtime I would give this a quick test:

sort -u -t';' -k8,8 ~/firmware-analysis/emba_logs_test/csv_logs/p99_prepare_analyzer.csv

On my testing scenario this looks quite fast:

└─$ wc -l /home/m1k3/firmware-analysis/emba_logs_test1/csv_logs/p99_prepare_analyzer.csv                           
8522 /home/m1k3/firmware-analysis/emba_logs_test1/csv_logs/p99_prepare_analyzer.csv
                                                                                                                                                                                                                                            
┌──(m1k3㉿kali-emba)-[~/firmware-analysis/emba_forked]
└─$ time sort -u -t';' -k8,8 /home/m1k3/firmware-analysis/emba_logs_test1/csv_logs/p99_prepare_analyzer.csv | wc -l
8219

real    0.03s
user    0.02s
sys     0.01s
cpu     93%

m-1-k-3 · 2025-02-19T09:32:46Z

I have also seen that field 8 is sometimes corrupt and we need to ensure the csv file remains intact in the P99 module. The file output needs to get filtered for ';' characters in here:

emba/helpers/helpers_emba_prepare.sh

Line 209 in b471564

    
           write_csv_log "${lBINARY}" "${lD_CLASS}" "${lD_DATA}" "${lD_MACHINE}" "${lD_FLAGS_CNT}" "${lD_ARCH_GUESSED}" "${D_FILE_OUTPUT}" "${lMD5SUM}" &

I included a quick filter in the call in this PR #1473

gluesmith2021 · 2025-02-19T22:18:48Z

I tried with/without flock and with/without limiting the number of PIDs. I couldn't get reliable benchmarks (a run can take 8 minutes for P99 and a second identical run take 11 minutes) but in all cases:

Using flock is slower than a simple post-processing sort.
The "limit concurrent PIDs" logic itself (required by flock, but not inversely) accounts for about half of the performance loss.

Also, without flock, the PIDs limiting logic doesn't do much. Most spawned processes in that loop are very short lived and do not pile up when PIDs count is left unguarded (32-core CPU was mostly hovering around 50%). At least, not in a case I could reproduce (as well as for other people, or someone would create an issue about it)

Therefore, I reverted all changes and added a post-processing sort instead.

Although max_pids_protection refactoring was "nice", it is not necessary for this fix and removed as well. I could be useful later on to tidy up code (such as not having the same "filter dead PIDs" logic in S09).

m-1-k-3

Thank you for pointing out this issue and fixing it

m-1-k-3 · 2025-02-20T08:02:08Z

Although max_pids_protection refactoring was "nice", it is not necessary for this fix and removed as well. I could be useful later on to tidy up code (such as not having the same "filter dead PIDs" logic in S09).

I totally agree. A lot of code and ideas are massively grown and were never refactored. I hope you are interested in bringing up the refactoring of the process handling in a dedicated PR again.

Thank you for your effort. Great work!

fix race condition in MD5 list generation

e98db7b

gluesmith2021 mentioned this pull request Feb 17, 2025

Race condition in P99 / helpers_emba_prepare.sh #1464

Closed

formatting: remove trailing spaces

30a7ea2

m-1-k-3 added bug Something isn't working EMBA labels Feb 18, 2025

m-1-k-3 self-requested a review February 18, 2025 11:08

fix regression: 1000s of processes blocked on flock could deplete memory

d2ac85a

Emmanuel Giasson added 2 commits February 19, 2025 13:58

formating: correct spaces

7c0801b

Revert previous commits, add post-process sort instead

113e884

gluesmith2021 changed the title ~~fix race condition in MD5 list generation~~ fix results from race condition in MD5 list generation Feb 19, 2025

m-1-k-3 approved these changes Feb 20, 2025

View reviewed changes

m-1-k-3 merged commit 732574a into e-m-b-a:master Feb 20, 2025
15 checks passed

gluesmith2021 deleted the fix_p99_md5_race_condition branch February 20, 2025 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix results from race condition in MD5 list generation #1471

fix results from race condition in MD5 list generation #1471

gluesmith2021 commented Feb 17, 2025 •

edited

Loading

m-1-k-3 commented Feb 18, 2025

gluesmith2021 commented Feb 18, 2025

m-1-k-3 commented Feb 18, 2025

gluesmith2021 commented Feb 18, 2025 •

edited

Loading

m-1-k-3 commented Feb 19, 2025 •

edited

Loading

m-1-k-3 commented Feb 19, 2025

gluesmith2021 commented Feb 19, 2025

m-1-k-3 left a comment

m-1-k-3 commented Feb 20, 2025

fix results from race condition in MD5 list generation #1471

fix results from race condition in MD5 list generation #1471

Conversation

gluesmith2021 commented Feb 17, 2025 • edited Loading

m-1-k-3 commented Feb 18, 2025

gluesmith2021 commented Feb 18, 2025

m-1-k-3 commented Feb 18, 2025

gluesmith2021 commented Feb 18, 2025 • edited Loading

Remarks

m-1-k-3 commented Feb 19, 2025 • edited Loading

m-1-k-3 commented Feb 19, 2025

gluesmith2021 commented Feb 19, 2025

m-1-k-3 left a comment

Choose a reason for hiding this comment

m-1-k-3 commented Feb 20, 2025

gluesmith2021 commented Feb 17, 2025 •

edited

Loading

gluesmith2021 commented Feb 18, 2025 •

edited

Loading

m-1-k-3 commented Feb 19, 2025 •

edited

Loading