Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage watchdog #1056

Merged
merged 15 commits into from
Oct 12, 2023
Merged

Storage watchdog #1056

merged 15 commits into from
Oct 12, 2023

Conversation

vringar
Copy link
Contributor

@vringar vringar commented Oct 11, 2023

Original Implementation done by @gridl0ck.
With modifications by @vringar.

@codecov
Copy link

codecov bot commented Oct 11, 2023

Codecov Report

Attention: 57 lines in your changes are missing coverage. Please review.

Comparison is base (761e46d) 46.20% compared to head (5909090) 45.08%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1056      +/-   ##
==========================================
- Coverage   46.20%   45.08%   -1.13%     
==========================================
  Files          34       35       +1     
  Lines        3398     3476      +78     
==========================================
- Hits         1570     1567       -3     
- Misses       1828     1909      +81     
Files Coverage Δ
openwpm/config.py 94.69% <100.00%> (+0.16%) ⬆️
openwpm/deploy_browsers/deploy_firefox.py 24.09% <0.00%> (-0.61%) ⬇️
openwpm/browser_manager.py 48.50% <40.00%> (-1.25%) ⬇️
openwpm/task_manager.py 71.24% <60.00%> (-1.20%) ⬇️
openwpm/utilities/storage_watchdog.py 23.43% <23.43%> (ø)

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vringar
Copy link
Contributor Author

vringar commented Oct 11, 2023

Hey @gridl0ck,
When checking out the code locally I noticed a couple of things that made me reconsider some of the choices you had made.

  1. The observer from the watchdog was not being used. It was only created and stopped

  2. Starting a thread and immedetly rejoining it doesn't allow for more concurrency or parallelism. So this work can also just be done in the main process in the execute_command_sequence thread

I also accidentally pushed these changes directly to master and then had to force push over them, since I wasn't sure that all tests were passing. This is why your PR got closed.

@gridl0ck
Copy link
Contributor

Oh @vringar I completely missed that. That is most definitely left over from an early design I had intended to use but I have since moved away from using it. Do you need me to remove it and push those changes?

@vringar
Copy link
Contributor Author

vringar commented Oct 11, 2023

@gridl0ck I hit send too early and still need to update the previous message with the rest of my feedback
I was hoping, it be able to fix my mistake before you saw it 😅

Do you need me to remove it and push those changes?

As I have rewritten a large part of your original implementation and have a couple of open questions, I'd rather have you as a reviewer than a contributor.

@gridl0ck
Copy link
Contributor

As I have rewritten a large part of your original implementation and have a couple of open questions, I'd rather have you as a reviewer than a contributor.

Dang ok. Let me know what, if anything, I need to do to get this added because I do think it is a helpful addition.

@vringar
Copy link
Contributor Author

vringar commented Oct 11, 2023

My primary question right now is:
What made you decide to force the checks after every command sequence?

The memory_watchdog just checks at a random time, sets the flag and then the BrowserManagerHandle checks the flag after a CommandSequence has completed.

Is this Scenario unacceptable to you?:

  1. Profile<Max_Size
  2. CS1 runs to completion
  3. Profile > Max_Size but the Watchdog hasn't noticed that yet
  4. CS2 starts running
  5. Watchdog notices and sets reset=True
  6. CS2 completes
  7. Browser gets restarted

Please note that I'm not disagreeing with doing the checks synchronously after the CS. I might even pull out the memory_watchdog check to the same location, because it makes it easier to reason about what can cause a browser to reset. I'm just genuinely curious.

@gridl0ck
Copy link
Contributor

gridl0ck commented Oct 11, 2023

My primary question right now is: What made you decide to force the checks after every command sequence?

When I created this for my capstone, the amount of data generated by each crawl varied per website so I needed to check the size of the folder. As to why its at the end of the CS, I ran into problems with the StorageController not saving the data to the database before the watchdog got to it (or thats how I interpreted the problem at the time).

The memory_watchdog just checks at a random time, sets the flag and then the BrowserManagerHandle checks the flag after a CommandSequence has completed.

Is this Scenario unacceptable to you:

  1. Profile<Max_Size
  2. CS1 runs to completion
  3. Profile > Max_Size but the Watchdog hasn't noticed that yet
  4. CS2 starts running
  5. Watchdog notices and sets reset=True
  6. CS2 completes
  7. Browser gets restarted

This goes back what I was saying earlier about the StorageController not saving the data because I did have this running asynchronously at first (trying to queue up restarts for when its most convenient) but I didn't know how to communicate that internally so running after each CS ensured that the data from each CS was stored and then if the resulting crawl pushed the profile directory over the threshold, then the browser would be restarted.

I originally had a function that would go in and simply wipe all non-essential files (It wouldnt touch configuration files or anything but it was a very hacky way of cleaning that ended up slowing down everything as time went on) but realized that having the browser just restart after a threshold reached cleared those files and did the necessary setup for each browser because you built that functionality in.

The StorageWatchdog essentially just monitors the size of the browser_profile directories in each of the BrowserManager threads and uses your built-in reset functionality in moderation. Before, when you set the reset flag, you would get a browser restart after every CS, which slowed down our crawls and part of our project was crawling a certain number of websites in a timely manner so this was inconvenient. With the StorageWatchdog, you can let the crawls run with little impact to speed because the browsers arent being reset after each CS, but you can also work with limited space.

@vringar vringar enabled auto-merge (squash) October 12, 2023 19:28
@vringar
Copy link
Contributor Author

vringar commented Oct 12, 2023

Okay, so I wanted to write tests to ensure this functionality keeps working, but seeing as our other two watchdogs also don't have any test and I can't think of a good way to test it (as restarts are supposed to be transparent/invisible to the user anyway) I'll just set this to automerge and way for the tests to pass.

I'll create a new release with this feature in the next couple of days.
Thank you for your contribution @gridl0ck !

@vringar vringar merged commit c27643a into master Oct 12, 2023
@vringar vringar deleted the storage-watchdog branch October 12, 2023 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants