Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix results from race condition in MD5 list generation #1471

Merged
merged 5 commits into from
Feb 20, 2025

Conversation

gluesmith2021
Copy link

@gluesmith2021 gluesmith2021 commented Feb 17, 2025

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

Fixes issue #1464

  • What is the current behavior? (You can also link to an open issue here)

See issue #1464

  • What is the new behavior (if this is a feature change)? If possible add a screenshot.

A critical section is implemented through flock and prevents the race condition. MD5 sums in p99_md5sum_done.tmp and P99_CSV_LOG are now unique.

  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No.

  • Other information:

Before the fix, with test "firmware" in issue #1464 , duplicate MD5 would occur once every few runs. After the fix it does not seem to occur at all, but I don't know how to devise a test to demonstrate that a race condition can't occur at that point.

@m-1-k-3 m-1-k-3 added bug Something isn't working EMBA labels Feb 18, 2025
@m-1-k-3 m-1-k-3 self-requested a review February 18, 2025 11:08
@m-1-k-3
Copy link
Member

m-1-k-3 commented Feb 18, 2025

I tested these modifications but my testing system is now running out of memory.

image

Without these mods no memory issue happens.
For my testing scenario I just have a bigger firmware with ~9000 files and I created multiple copies of busybox with different names and the same MD5 hash.

@gluesmith2021
Copy link
Author

I tested these modifications but my testing system is now running out of memory.

image

Without these mods no memory issue happens. For my testing scenario I just have a bigger firmware with ~9000 files and I created multiple copies of busybox with different names and the same MD5 hash.

That's an interesting test case. I'll try to reproduce that.

In the meantime, does this happen before P99 completes? If so, I'll try to see what's taking memory there.

Otherwise, the bug might be elsewhere. What I mean: assume some Sxx module uses a lot of memory to process busybox and that's causing what you experienced. It's possible that without the patch that Sxx module does not process busybox because of duplicate MD5 (as it occurred with S09 in issue #1464 for example, but in another module). But with the patch, busybox and its MD5 is there only once, and now that Sxx module processes it and triggers the memory-hungry processing.

@m-1-k-3
Copy link
Member

m-1-k-3 commented Feb 18, 2025

That's an interesting test case. I'll try to reproduce that.

In the meantime, does this happen before P99 completes? If so, I'll try to see what's taking memory there.

Yes it is during the p99 run. Memory starts filling up very soon and after a bit of time the system hangs and starts killing processes.

@gluesmith2021
Copy link
Author

gluesmith2021 commented Feb 18, 2025

Identified cause: given a separate process is spawned for each file to process (as per existing code), thousands of process are created on large firmwares, and with this PR, most of them end up waiting for their turn on flock. Each process takes a few MB of reserved memory at that point, and together they potentially exhaust memory.

Compound effect: the larger the firmware, the larger the "done MD5s" temp file grows and the longer the critical section takes to execute (not by much, but still...). This makes other processes in queue waiting for longer and more prone to pile up.

Fix in latest commit:

  • Use max_pids_protection to limit the number of concurrent process. Effectively fixes the memory issue
  • Filter PIDs list for performance, as with S09 in Effective (and fast) S09 threading #1462. This does not improve performance compared to before the PR, but prevents degradation caused by now using max_pids_protection on long PID lists.

Overall, for this PR at this point, I noticed a performance degradation compared to before the PR (for example, on a FW where P99 reports 36000 files to process, it went from 7 to 17 minutes on my system), but at least the result is correct. Rest of the analysis takes several hours, so 10 minutes more is not that much, but still not much welcomed.

If performance really is an issue, there might be options, although I can't tell how much (if at all) they would improve performance:

  • Remove the lock and perform post-filtering as first suggested in issue Race condition in P99 / helpers_emba_prepare.sh #1464
    • the slight refactor of max_pids_protection in this PR could still have benefits (see Remarks below)
  • Instead of a file with "done MD5s" and a grep, this could be a folder with a bunch of 0-byte files, one per MD5. Maybe checking if a file exist is faster than a grep (assuming the OS keep in memory those "hot" parts of the file system)

Remarks

On filtering PIDs: I extracted the filtering part from max_pids_protection to make it available. I tried to move debug prints along in a sensible way. Also, code added in S09 in #1462 could now reuse the new filter_out_dead_pids instead of recover_wait_pids. But given 1462 isn't a parent of this PR, I didn't want to complicate merges. This can be done later (this is in fact code duplication, not a behavior issue)

For the main loop in this PR, I implemented the same pattern than in other modules, that looks increasingly common across modules. I mean something like

    for lBINARY in "${ALL_BINS_ARR[@]}" ; do
      process_that_file_in_a_new_process "${lBINARY}" &
      local lTMP_PID="$!"
      store_kill_pids "${lTMP_PID}"
      lTASK_PIDS_ARR+=( "${lTMP_PID}" )

      if [[ "${#lTASK_PIDS_ARR[@]}" -gt "${MAX_MOD_THREADS}" ]]; then
        filter_out_dead_pids lWAIT_PIDS_P99_ARR
        if [[ "${#lTASK_PIDS_ARR[@]}" -gt "${MAX_MOD_THREADS}" ]]; then
          max_pids_protection $(( "${MAX_MOD_THREADS}"*FUDGE_FACTOR )) "${lTASK_PIDS_ARR[@]}"
        fi
      fi
    wait_for_pid "${lTASK_PIDS_ARR[@]}"

Some modules use only max_pids_protection and might benefit from filter_out_dead_pids (as discussed in #1462). Also, maybe this whole pattern could be isolated in a helper function and reused in many places. Again, this is code duplication, not a behavior issue, so that's an optional task for the backlog I guess.

@m-1-k-3
Copy link
Member

m-1-k-3 commented Feb 19, 2025

Regarding the longer runtime I would give this a quick test:

sort -u -t';' -k8,8 ~/firmware-analysis/emba_logs_test/csv_logs/p99_prepare_analyzer.csv

On my testing scenario this looks quite fast:

└─$ wc -l /home/m1k3/firmware-analysis/emba_logs_test1/csv_logs/p99_prepare_analyzer.csv                           
8522 /home/m1k3/firmware-analysis/emba_logs_test1/csv_logs/p99_prepare_analyzer.csv
                                                                                                                                                                                                                                            
┌──(m1k3㉿kali-emba)-[~/firmware-analysis/emba_forked]
└─$ time sort -u -t';' -k8,8 /home/m1k3/firmware-analysis/emba_logs_test1/csv_logs/p99_prepare_analyzer.csv | wc -l
8219

real    0.03s
user    0.02s
sys     0.01s
cpu     93%

@m-1-k-3
Copy link
Member

m-1-k-3 commented Feb 19, 2025

I have also seen that field 8 is sometimes corrupt and we need to ensure the csv file remains intact in the P99 module. The file output needs to get filtered for ';' characters in here:

write_csv_log "${lBINARY}" "${lD_CLASS}" "${lD_DATA}" "${lD_MACHINE}" "${lD_FLAGS_CNT}" "${lD_ARCH_GUESSED}" "${D_FILE_OUTPUT}" "${lMD5SUM}" &

I included a quick filter in the call in this PR #1473

@gluesmith2021
Copy link
Author

I tried with/without flock and with/without limiting the number of PIDs. I couldn't get reliable benchmarks (a run can take 8 minutes for P99 and a second identical run take 11 minutes) but in all cases:

  • Using flock is slower than a simple post-processing sort.
  • The "limit concurrent PIDs" logic itself (required by flock, but not inversely) accounts for about half of the performance loss.

Also, without flock, the PIDs limiting logic doesn't do much. Most spawned processes in that loop are very short lived and do not pile up when PIDs count is left unguarded (32-core CPU was mostly hovering around 50%). At least, not in a case I could reproduce (as well as for other people, or someone would create an issue about it)

Therefore, I reverted all changes and added a post-processing sort instead.

Although max_pids_protection refactoring was "nice", it is not necessary for this fix and removed as well. I could be useful later on to tidy up code (such as not having the same "filter dead PIDs" logic in S09).

@gluesmith2021 gluesmith2021 changed the title fix race condition in MD5 list generation fix results from race condition in MD5 list generation Feb 19, 2025
Copy link
Member

@m-1-k-3 m-1-k-3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out this issue and fixing it

@m-1-k-3
Copy link
Member

m-1-k-3 commented Feb 20, 2025

Although max_pids_protection refactoring was "nice", it is not necessary for this fix and removed as well. I could be useful later on to tidy up code (such as not having the same "filter dead PIDs" logic in S09).

I totally agree. A lot of code and ideas are massively grown and were never refactored. I hope you are interested in bringing up the refactoring of the process handling in a dedicated PR again.

Thank you for your effort. Great work!

@m-1-k-3 m-1-k-3 merged commit 732574a into e-m-b-a:master Feb 20, 2025
15 checks passed
@gluesmith2021 gluesmith2021 deleted the fix_p99_md5_race_condition branch February 20, 2025 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working EMBA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants