Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XPI files not deleted from /tmp #1090

Open
ndanner-wesleyancs opened this issue Jun 7, 2024 · 16 comments
Open

XPI files not deleted from /tmp #1090

ndanner-wesleyancs opened this issue Jun 7, 2024 · 16 comments

Comments

@ndanner-wesleyancs
Copy link

Preface: I don't think this is an OpenWPM bug, but I'm wondering if anybody has suggestions on how to work around it.

Platform: OpenWPM v0.28, Selenium 4.21.0, Geckodriver 0.34.0.

When the extension is added to a deployed Firefox instance in deploy_firefox.py, line 161, a copy of the XPI file is made in /tmp, presumably by either Selenium or geckodriver. The problem is that it is not deleted.

That's a problem, since when I was crawling a large number of sites, I ended up with enough copies of the XPI file in /tmp that it filled up.

As I said, I don't think this is the fault of OpenWPM. My understanding is that Selenium's WebDriver.quit() is supposed to delete temporary files. It does appear to delete profile directories, just not XPI files.

Has anyone else seen this? And if so, any suggested workarounds?

@vringar
Copy link
Contributor

vringar commented Jun 8, 2024

Hey, this sounds like smt that should be done in the shutdown code of the browser manager. Please file a pull request
Please be aware however, that the Firefox version used by OpenWPM is severely out of date and before running any large scale crawls you should probably take the steps described under docs/Release-Checklist to look closer to normal web traffic

@ndanner-wesleyancs
Copy link
Author

I haven't been able to find anything in the Selenium docs that would let me get the name of the temporary file that is created by install_addon in order to delete it, though I won't say that I've done a truly exhaustive search.

This issue seems likely related to SeleniumHQ/selenium#10841, and in the comments therein, one of the posters relates it to mozilla/geckodriver#299.

@ndanner-wesleyancs
Copy link
Author

Thanks for pointing out the issue with the FF version; I'll be sure to take that into account.

@ndanner-wesleyancs
Copy link
Author

Just to be sure that this isn't caused by something I'm doing in my crawl script, here is what I think is a pretty minimal crawl script that demonstrates the undesired behavior:

"""
MWE to demonstrate that a copy of the OpenWPM extension XPI file is left behind
in /tmp each time a fresh browser is invoked.

N. Danner
"""

import pathlib

from openwpm.command_sequence import CommandSequence
from openwpm.commands.browser_commands import GetCommand,ScreenshotFullPageCommand
from openwpm.config import BrowserParams, ManagerParams
from openwpm.storage.sql_provider import SQLiteStorageProvider
from openwpm.storage.leveldb import LevelDbProvider
from openwpm.task_manager import TaskManager

NUM_BROWSERS = 1

manager_params = ManagerParams(num_browsers=NUM_BROWSERS)
manager_params.data_directory = pathlib.PosixPath('/tmp/mwe-crawl')
manager_params.log_path = manager_params.data_directory.joinpath("openwpm.log")

browser_params = [
    BrowserParams() for _ in range(NUM_BROWSERS)
]

sqlite_db_provider = \
    SQLiteStorageProvider(
        manager_params.data_directory.joinpath("crawl-data.sqlite"))


with TaskManager(
    manager_params,
    browser_params,
    sqlite_db_provider,
    None
) as manager:
    pass

After running it, there is an XPI file leftover in /tmp/. Here is the result of running the script, with some of the OpenWPM logging elided:

OpenWPM$ ls -l /tmp/addon*
ls: cannot access '/tmp/addon*': No such file or directory
OpenWPM$ python3 mwe.py 
browser_manager      - INFO     - BROWSER 4022973883: Launching browser...
storage_controller   - INFO     - Initializing new handler for TaskManager
storage_controller   - INFO     - Initializing new handler for Browser-4022973883
storage_controller   - INFO     - Initializing new handler for StorageControllerHandle
storage_controller   - INFO     - Awaiting all tasks for visit_id -1
task_manager         - INFO     - 

OpenWPM Version: b'v0.28.0'
Firefox Version: b'126.0.1'

...

========== Input profile tar files ==========
  No profile tar files specified

========== Output (archive) profile dirs ==========
  No profile archive directories specified


storage_controller   - INFO     - Terminating handler for StorageControllerHandle, because the underlying socket closed
storage_controller   - INFO     - Terminating handler for Browser-4022973883, because the underlying socket closed
storage_controller   - INFO     - Terminating handler for TaskManager, because the underlying socket closed
storage_controller   - INFO     - Received shutdown signal!
storage_controller   - INFO     - Closing Server
storage_controller   - INFO     - Closed Server
storage_controller   - INFO     - Cancelling status_queue_update
storage_controller   - INFO     - Cancelled status_queue_update
storage_controller   - INFO     - Cancelling timeout_check
storage_controller   - INFO     - Cancelled timeout_check
storage_controller   - INFO     - Starting wait_closed
storage_controller   - INFO     - Completed wait_closed
storage_controller   - INFO     - Entering self.shutdown
storage_controller   - INFO     - structured_storage is shut down
Shutdown took 11.190859317779541 seconds
OpenWPM$ ls -l /tmp/addon*
-rw-rw---- 1 ndanger ndanger 130892 Jun 10 12:05 /tmp/addon-7d648897-3ebe-447d-9697-d662d428b7d2.xpi
OpenWPM$ diff -s /tmp/addon-7d648897-3ebe-447d-9697-d662d428b7d2.xpi Extension/openwpm.xpi 
Files /tmp/addon-7d648897-3ebe-447d-9697-d662d428b7d2.xpi and Extension/openwpm.xpi are identical

@ndanner-wesleyancs
Copy link
Author

So right now my workaround, which seems to do the job, is to:

  • Make a new temporary directory and save its name in browser_params;
  • Set TMPDIR in the environment before creating the geckodriver service, because it looks like geckodriver uses TMPDIR for its temporary directory.
  • Delete the temporary directory when the browser is shut down.

@ndanner-wesleyancs
Copy link
Author

Right now I create the temporary directory inBrowserManagerHandle.launch_browser_manager and delete it in BrowserManagerHandle.shutdown_browser. If that doesn't sound right, let me know.

@vringar
Copy link
Contributor

vringar commented Jun 23, 2024

Hey,
sorry for the delayed response.
I think you should handle this entirely within the BrowserManager.
Create and save the temp dir before Geckodriver gets started:

try:
# Start Xvfb (if necessary), webdriver, and browser
driver, browser_profile_path, display = deploy_firefox.deploy_firefox(
self.status_queue,
self.browser_params,
self.manager_params,
self.crash_recovery,
)
extension_socket: Optional[ClientSocket] = None

and then do the deletion in the finally block
finally:
if display is not None:
display.stop()
return

@ndanner-wesleyancs
Copy link
Author

OK, I will do as you suggest. One thing that I also have to do is to modify TMPDIR in the environment before invoking geckodriver. My inclination is to do that as close to the invocation as possible, which means in deploy_firefox.deploy_firefox. Is there an obvious problem with that?

@vringar
Copy link
Contributor

vringar commented Jun 26, 2024

There is no problem with that except that you will need to pass the name of the temp dir back up to the BrowserManager store that information until it is time to clean up

@ndanner-wesleyancs
Copy link
Author

Yes, I am already doing that.

@ndanner-wesleyancs
Copy link
Author

Also, why not do the configuration in BrowserManagerHandle? I'm really just doing some "pattern matching," and it looks like BrowserManagerHandle.launch_browser_manager is responsible for setting up the profile directory, which seems like a comparable task.

Similarly, BrowserManagerHandle.shutdown_browser is responsible for deleting the current profile directory, and that also seems analogous.

@palmeida
Copy link
Contributor

palmeida commented Jul 8, 2024

Hey, this sounds like smt that should be done in the shutdown code of the browser manager. Please file a pull request Please be aware however, that the Firefox version used by OpenWPM is severely out of date and before running any large scale crawls you should probably take the steps described under docs/Release-Checklist to look closer to normal web traffic

I was puzzled by this comment. Aren't there updates to the Firefox version in OpenWPM commits? I thought I'd seen them before. When you say severely out of date is that because there haven't been updates in a while (say, a few months)?

@vringar
Copy link
Contributor

vringar commented Jul 8, 2024

@palmeida as you can see at the bottom of this page about 70% of users upgrade within 4 weeks of a new firefox release happening. I can't find statistics about the breakdown of n-1 to n-4 but you are not representing the typical user if you are using a version of the browser that is 4 releases out of date (currently 123 vs 127)

When you say severely out of date is that because there haven't been updates in a while (say, a few months)?

By my definition not updating within 2 weeks of a new Firefox release happening is being out of date. But doing that every four weeks on a project that is a hobby for me during a time where I'm very busy is just not sustainable.
The update process is trivial and well-documented so if anyone in the research community wants to step in and do this, they are very welcome and I'm willing to help with any issues that come up. (But this conversation should probably happen in a GH discussion or in the matrix channel)

@palmeida
Copy link
Contributor

palmeida commented Jul 8, 2024

@vringar Thank you for the information, I wasn't aware of that statistic. Manually updating is absolutely fine, I just hadn't seen the need before.

@vringar
Copy link
Contributor

vringar commented Jul 8, 2024

@palmeida if you have done this, could you please also open a PR, so that the commons remain maintained.

@palmeida
Copy link
Contributor

palmeida commented Jul 8, 2024

@vringar I haven't done this, but maybe we can discuss it in a GH discussion, as you suggested. I might be able to maintain it. I'm bit hesitant because I've offered to help before and ended up not having time, but if you share the update process I can look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants