Storage watchdog #1056

vringar · 2023-10-11T22:05:02Z

Original Implementation done by @gridl0ck.
With modifications by @vringar.

…n for increased compatibility

…wpm-master

needed

…StorageWatchdog backend.

codecov · 2023-10-11T22:08:10Z

Codecov Report

Attention: 57 lines in your changes are missing coverage. Please review.

Comparison is base (761e46d) 46.20% compared to head (5909090) 45.08%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1056      +/-   ##
==========================================
- Coverage   46.20%   45.08%   -1.13%     
==========================================
  Files          34       35       +1     
  Lines        3398     3476      +78     
==========================================
- Hits         1570     1567       -3     
- Misses       1828     1909      +81

Files	Coverage Δ
openwpm/config.py	`94.69% <100.00%> (+0.16%)`	⬆️
openwpm/deploy_browsers/deploy_firefox.py	`24.09% <0.00%> (-0.61%)`	⬇️
openwpm/browser_manager.py	`48.50% <40.00%> (-1.25%)`	⬇️
openwpm/task_manager.py	`71.24% <60.00%> (-1.20%)`	⬇️
openwpm/utilities/storage_watchdog.py	`23.43% <23.43%> (ø)`

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

vringar · 2023-10-11T22:12:07Z

Hey @gridl0ck,
When checking out the code locally I noticed a couple of things that made me reconsider some of the choices you had made.

The observer from the watchdog was not being used. It was only created and stopped
Starting a thread and immedetly rejoining it doesn't allow for more concurrency or parallelism. So this work can also just be done in the main process in the execute_command_sequence thread

I also accidentally pushed these changes directly to master and then had to force push over them, since I wasn't sure that all tests were passing. This is why your PR got closed.

gridl0ck · 2023-10-11T22:16:07Z

Oh @vringar I completely missed that. That is most definitely left over from an early design I had intended to use but I have since moved away from using it. Do you need me to remove it and push those changes?

vringar · 2023-10-11T22:28:09Z

@gridl0ck I hit send too early and still need to update the previous message with the rest of my feedback
I was hoping, it be able to fix my mistake before you saw it 😅

Do you need me to remove it and push those changes?

As I have rewritten a large part of your original implementation and have a couple of open questions, I'd rather have you as a reviewer than a contributor.

gridl0ck · 2023-10-11T22:31:38Z

As I have rewritten a large part of your original implementation and have a couple of open questions, I'd rather have you as a reviewer than a contributor.

Dang ok. Let me know what, if anything, I need to do to get this added because I do think it is a helpful addition.

vringar · 2023-10-11T22:40:42Z

My primary question right now is:
What made you decide to force the checks after every command sequence?

The memory_watchdog just checks at a random time, sets the flag and then the BrowserManagerHandle checks the flag after a CommandSequence has completed.

Is this Scenario unacceptable to you?:

Profile<Max_Size
CS1 runs to completion
Profile > Max_Size but the Watchdog hasn't noticed that yet
CS2 starts running
Watchdog notices and sets reset=True
CS2 completes
Browser gets restarted

Please note that I'm not disagreeing with doing the checks synchronously after the CS. I might even pull out the memory_watchdog check to the same location, because it makes it easier to reason about what can cause a browser to reset. I'm just genuinely curious.

gridl0ck · 2023-10-11T22:54:54Z

My primary question right now is: What made you decide to force the checks after every command sequence?

When I created this for my capstone, the amount of data generated by each crawl varied per website so I needed to check the size of the folder. As to why its at the end of the CS, I ran into problems with the StorageController not saving the data to the database before the watchdog got to it (or thats how I interpreted the problem at the time).

The memory_watchdog just checks at a random time, sets the flag and then the BrowserManagerHandle checks the flag after a CommandSequence has completed.

Is this Scenario unacceptable to you:

Profile<Max_Size

CS1 runs to completion

Profile > Max_Size but the Watchdog hasn't noticed that yet

CS2 starts running

Watchdog notices and sets reset=True

CS2 completes

Browser gets restarted

This goes back what I was saying earlier about the StorageController not saving the data because I did have this running asynchronously at first (trying to queue up restarts for when its most convenient) but I didn't know how to communicate that internally so running after each CS ensured that the data from each CS was stored and then if the resulting crawl pushed the profile directory over the threshold, then the browser would be restarted.

I originally had a function that would go in and simply wipe all non-essential files (It wouldnt touch configuration files or anything but it was a very hacky way of cleaning that ended up slowing down everything as time went on) but realized that having the browser just restart after a threshold reached cleared those files and did the necessary setup for each browser because you built that functionality in.

The StorageWatchdog essentially just monitors the size of the browser_profile directories in each of the BrowserManager threads and uses your built-in reset functionality in moderation. Before, when you set the reset flag, you would get a browser restart after every CS, which slowed down our crawls and part of our project was crawling a certain number of websites in a timely manner so this was inconvenient. With the StorageWatchdog, you can let the crawls run with little impact to speed because the browsers arent being reset after each CS, but you can also work with limited space.

vringar · 2023-10-12T19:31:55Z

Okay, so I wanted to write tests to ensure this functionality keeps working, but seeing as our other two watchdogs also don't have any test and I can't think of a good way to test it (as restarts are supposed to be transparent/invisible to the user anyway) I'll just set this to automerge and way for the tests to pass.

I'll create a new release with this feature in the next couple of days.
Thank you for your contribution @gridl0ck !

gridl0ck and others added 11 commits June 15, 2023 15:23

OpenWPM StorageWatchdog complete

59ad8df

Revised tmp_profile_dir member to use the tempfile.gettempdir functio…

ad7aa3f

…n for increased compatibility

Merge branch 'master' of https://github.com/openwpm/OpenWPM into open…

11a1ec6

…wpm-master

Merge branch 'openwpm-master'

4b9ca70

needed

Restored version changes added watchdog 3.0.0 as a requirement.

cedd0b2

Merge branch 'master' into master

528602f

Implemented changes as requested, with significant alteration to the …

58b28c2

…StorageWatchdog backend.

Implemented changes as requested, with significant alteration to the …

22bd2f2

…StorageWatchdog backend.

Merge branch 'master' into master

6e223c6

Merge branch 'master' into master

213a4c9

refactor(storage-watchdog): adjust storage watchdog implementation

a11420b

fix(dependencies): remove watchdog

6158ca5

docs(storage-watchdog): align wording

dba19b7

vringar added 2 commits October 12, 2023 21:17

fix(mypy): refactor type annotation on test

41f92a0

refactor(demo.py): show off maximum profile size

5909090

vringar enabled auto-merge (squash) October 12, 2023 19:28

vringar merged commit c27643a into master Oct 12, 2023

vringar deleted the storage-watchdog branch October 12, 2023 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage watchdog #1056

Storage watchdog #1056

vringar commented Oct 11, 2023

codecov bot commented Oct 11, 2023 •

edited

Loading

vringar commented Oct 11, 2023 •

edited

Loading

gridl0ck commented Oct 11, 2023

vringar commented Oct 11, 2023

gridl0ck commented Oct 11, 2023

vringar commented Oct 11, 2023 •

edited

Loading

gridl0ck commented Oct 11, 2023 •

edited

Loading

vringar commented Oct 12, 2023

Storage watchdog #1056

Storage watchdog #1056

Conversation

vringar commented Oct 11, 2023

codecov bot commented Oct 11, 2023 • edited Loading

Codecov Report

vringar commented Oct 11, 2023 • edited Loading

gridl0ck commented Oct 11, 2023

vringar commented Oct 11, 2023

gridl0ck commented Oct 11, 2023

vringar commented Oct 11, 2023 • edited Loading

gridl0ck commented Oct 11, 2023 • edited Loading

vringar commented Oct 12, 2023

codecov bot commented Oct 11, 2023 •

edited

Loading

vringar commented Oct 11, 2023 •

edited

Loading

vringar commented Oct 11, 2023 •

edited

Loading

gridl0ck commented Oct 11, 2023 •

edited

Loading