Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor commands to actually follow the Command pattern #743

Closed
vringar opened this issue Sep 11, 2020 · 19 comments · Fixed by #750
Closed

Refactor commands to actually follow the Command pattern #743

vringar opened this issue Sep 11, 2020 · 19 comments · Fixed by #750
Assignees
Labels
enhancement Not a bug or a feature request

Comments

@vringar
Copy link
Contributor

vringar commented Sep 11, 2020

The command pattern describes how to encapsulate behaviour and the necessary state for this behaviour in a class, that is opaque to the caller.
My suggestion is as follows:
https://github.com/mozilla/OpenWPM/blob/1ad7d5a41f6c0e87173bcfa8838e894d373b5b48/automation/BrowserManager.py#L502-L504
gets replaced by

 command.execute(driver, browser_settings, browser_params, manager_params, extension_socket)

This is enabled by removing the split between declaration of actions and their implementation.
For example the GetCommand turns into:

class GetCommand(BaseCommand):
    def __init__(self, url, sleep):
        self.url = url
        self.sleep = sleep

    def __repr__(self):
        return "GetCommand({},{})".format(self.url, self.sleep)

    def execute(webdriver, browser_settings, browser_params, manager_params, extension_socket: clientsocket):
    """
    goes to <url> using the given <webdriver> instance
    """

    tab_restart_browser(webdriver)

    if extension_socket is not None:
        extension_socket.send(this.visit_id)

    # Execute a get through selenium
    try:
        webdriver.get(this.url)
    except TimeoutException:
        pass

    # Sleep after get returns
    time.sleep(this.sleep)

    # Close modal dialog if exists
    try:
        WebDriverWait(webdriver, .5).until(EC.alert_is_present())
        alert = webdriver.switch_to_alert()
        alert.dismiss()
        time.sleep(1)
    except (TimeoutException, WebDriverException):
        pass

    close_other_windows(webdriver)

    if browser_params['bot_mitigation']:
        bot_mitigation(webdriver)

We could then also shrink the CommandSequence by removing
https://github.com/mozilla/OpenWPM/blob/1ad7d5a41f6c0e87173bcfa8838e894d373b5b48/automation/CommandSequence.py#L64-L69
and replacing it with a generalized add_command(self, command).
We'd also add a validate method, to check that we contain a get or browse before every other command.
This method would be called when a CommandSequence gets submitted to the TaskManager, where we would also add the IntitializeCommand and the FinalizeCommand.

This would simplify adding a new command to:

  1. Create Class MyCommand that derives from BaseCommand.
  2. Implement execute() (and maybe __repr__ and eventually name())
  3. Add command_sequence.add_command(MyCommand())

And we could remove the RunCustomFunctionCommand as it would be easier to just tell the user to create a new class instead.

@vringar vringar added discussion enhancement Not a bug or a feature request good-first-bug Bugs that are good for a first-time committer to tackle labels Sep 11, 2020
@vringar
Copy link
Contributor Author

vringar commented Sep 11, 2020

This issue is pretty well documented and I'm willing to mentor anybody that shows interest in working on this.

@englehardt
Copy link
Collaborator

This sounds great to me. So GetCommand would be fully defined in browser_commands.py? Perhaps we can do some renaming of the Commands directory files, such that a caller could do something like from commands.browser import GetCommand, and similarly for the current profile commands.

Allowing us to replace custom functions with commands that follow the same pattern is also a huge win.

@cyruskarsan
Copy link
Contributor

cyruskarsan commented Sep 13, 2020

Hi, I am interested in resolving this issue as my first bug. Where/how should I begin?

@vringar
Copy link
Contributor Author

vringar commented Sep 13, 2020

@cyruskarsan Thanks for showing interest in this issue.
This is a pretty major rewrite, so please expect that this will take some time.
Assuming you've never worked with OpenWPM before, I'd suggest first just running demo.py and then having a look at the following to pieces of code.

Follow the code paths down from the second bit of code to see everything that you'll need to change.

I'd suggest not concerning yourself too much with how the CommandSequence gets from here
https://github.com/mozilla/OpenWPM/blob/1ad7d5a41f6c0e87173bcfa8838e894d373b5b48/automation/TaskManager.py#L572
to the BrowserManager because that path is quite convoluted and doesn't need to be changed as part of this issue.

@cyruskarsan
Copy link
Contributor

@vringar So I took a look at demo.py and got a high level understanding of what you guys are doing. From a high level, it appears that you have modified firefox to log a customized set of variables in an effort to search for leaked information/bad practices and run multiple browsers in parallel. I ran demo.py locally using python3 on Ubuntu 20.04. Do I need to run it using docker or is docker for people not using Ubuntu?

To view the sqlite data, is there a certain software you use? Should I be worried about that?

From my understanding, you would like make it easier for researchers to create new commands and you have asked to replicate the functionality of get in command_sequence.py in browser_manager.py by creating a getCommand class?

To clarify, this code you wrote could be implemented to replace the get command, correct?

class GetCommand(BaseCommand):
    def __init__(self, url, sleep):
        self.url = url
        self.sleep = sleep

    def __repr__(self):
        return "GetCommand({},{})".format(self.url, self.sleep)

    def execute(webdriver, browser_settings, browser_params, manager_params, extension_socket: clientsocket):
    """
    goes to <url> using the given <webdriver> instance
    """

    tab_restart_browser(webdriver)

    if extension_socket is not None:
        extension_socket.send(this.visit_id)

    # Execute a get through selenium
    try:
        webdriver.get(this.url)
    except TimeoutException:
        pass

    # Sleep after get returns
    time.sleep(this.sleep)

    # Close modal dialog if exists
    try:
        WebDriverWait(webdriver, .5).until(EC.alert_is_present())
        alert = webdriver.switch_to_alert()
        alert.dismiss()
        time.sleep(1)
    except (TimeoutException, WebDriverException):
        pass

    close_other_windows(webdriver)

    if browser_params['bot_mitigation']:
        bot_mitigation(webdriver)

@vringar
Copy link
Contributor Author

vringar commented Sep 14, 2020

@cyruskarsan Conceptually correct. The Open Web Privacy Measurements project aims to enable researchers to collect data on website behaviour, to enable privacy researchers to focus on their research and not waste time with building yet another crawler. It does so by installing an extension Firefox that can be configured to collect a variety of data points and stream them back to the platform (aka the python code). The platform also allows you to run multiple FF instances in parallel and supports different storage backends.

If you can run the code natively then I'd encourage you to do so. We mostly use Docker for cloud crawls, where we deploy on a GCP Kubernetes cluster.

You can use any SQL client you like for interacting with the Database. I use DBeaver, but mostly out of habit. I don't see a reason to be concerned about a database. (I'm sorry if I'm misunderstanding your question here)

My intention was to consolidate all the separate pieces of implementations (the type, the construction point and the behaviour) into a single place and not have them be spread out through the entire code base, but the fact that this eases the development of new commands for researchers is a very welcome benefit.

To illustrate my point on how disjointed our current architecture is, I wrote up all the places that are relevant to a singular command.

The functionality of the GetCommand (which already exists as a type) is currently implemented here:
https://github.com/mozilla/OpenWPM/blob/1ad7d5a41f6c0e87173bcfa8838e894d373b5b48/automation/Commands/browser_commands.py#L108-L140

The current flow of commands is something like this:

So to add a single parameter to a GetCommand you'd have to touch four files:

  • CommandSequence.py
  • command_executor.py
  • Types.py
  • browser_commands.py

And after my proposed changes you should one have to modify browser_commands.py whereas Types.py and command_executor.py shouldn't exist anymore.

@vringar
Copy link
Contributor Author

vringar commented Sep 14, 2020

Would you be more comfortable with me creating the required changes to BrowserManager.py and CommandSequence.py and make the changes to GetCommand and you'd then replicate the changes made to all other commands?

@cyruskarsan
Copy link
Contributor

@vringar Thanks for your succinct explanation. I think I understand the issue now. In its current implementation, the execution of tasks (such as the GetCommand) is disjoint and fragmented across different files. My task is to consolidate the implementation of these commands into the browser_commands.py file.

To begin, I will take you up on your offer to do the first changes to GetCommand.

With regard to my SQL question, I was just asking to determine if the program was actually running correctly locally or if I was just seeing terminal output. I will look into DBBeaver. Thank you

@vringar
Copy link
Contributor Author

vringar commented Sep 17, 2020

Hey @cyruskarsan,
I've created the branch CommandRefactoring, feel free to create multiple PRs against this branch. I'll merge anything that passes our current tests.
I haven't changed the CommandSequence yet, but I might get on that eventually.

@vringar vringar removed good-first-bug Bugs that are good for a first-time committer to tackle discussion labels Sep 28, 2020
@vringar
Copy link
Contributor Author

vringar commented Sep 30, 2020

@cyruskarsan I've finally brought the branch into a state where all tests are passing.
Does this example help you or do you need any other assistance?

@cyruskarsan
Copy link
Contributor

@vringar I made an attempt at refactoring save_screenshot but when I attempt to push to the branch, git gives me this error:
remote: Permission to mozilla/OpenWPM.git denied to cyruskarsan. fatal: unable to access 'https://github.com/mozilla/OpenWPM.git/': The requested URL returned error: 403

How can I push my changes to the branch?

@vringar
Copy link
Contributor Author

vringar commented Oct 4, 2020

@cyruskarsan You don't have push access to this repository, so you'll have to create a Pull Request (PR) instead.
What you'll need to do instead is:

  1. Fork this repository to have a private copy of it on GitHub. You can do this by using the fork button in the top right corner of the screen.
  2. Add this fork as a remote to your repository. For that you'll need the URL from your repository, so after you cloned it go to the green download code button it your repository and copy the link.
    Then add it as a remote by typing
    git remote add private <URL goes here>
  3. Now push to your private fork by writing
    git push private CommandRefactoring
  4. You should now see your code in your private copy in the branch CommandRefactoring
  5. You can now create a PR from your branch CommandRefactoring to the Mozilla branch CommandRefactoring and then I can accept and merge your changes.

Once you have made further changes you'll only need to repeat the steps 3-5.

@cyruskarsan
Copy link
Contributor

@vringar Does stitch_screenshot_parts(visit_id, browser_id, manager_params) need to be refactored?
I don't see it in command_executor.py nor types.py

@vringar
Copy link
Contributor Author

vringar commented Oct 16, 2020

This method seems to only be called from screenshot_full_page so I'd suggest making it a method on the ScreenshotFullPageCommand.

@vringar
Copy link
Contributor Author

vringar commented Oct 29, 2020

Copying @cyruskarsan's question from #770 here so it doesn't get lost:

Are DumpProfCommand , RunCustomFunctionCommand, and ShutdownCommand the only commands left to refactor? If so, these didn't appear in browser_commands.py so would I just add the execute function to each of these classes in browser_commands.py?

  • DumpProfCommand exists here and can be turned into a command there
  • We want to get rid of RunCustomFunctionCommand and instead create an append_command method on CommandSequenceso that users can more easily write Commands on their own
  • ShutdownCommand currently doesn't have a function beyond being a signal here, so maybe we should rename it to ShutdownSignal and move it into the BrowserManager.py

@vringar
Copy link
Contributor Author

vringar commented Oct 29, 2020

@cyruskarsan I've just added append_command myself. So consider running git pull origin CommandRefactoring before continuing your work.
Once you have removed RunCustomFunctionCommand some tests will need updating. Do you want to do that or do you want leave that to me?

@cyruskarsan
Copy link
Contributor

@vringar I've never created pytests before but I would be interested in learning.

I removed the RunCustomFunctionCommand from the code and noticed that append_command is not in Command_Sequence.py. Does it need to be there like the other commands or no?

@vringar
Copy link
Contributor Author

vringar commented Oct 31, 2020

@cyruskarsan Did you run git pull origin CommandRefactoring?
Then you should hopefully see append_command

@cyruskarsan
Copy link
Contributor

@vringar You are right, it was there. I just didn't see it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Not a bug or a feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants