Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing data when using browser restarts #892

Closed
bkrumnow opened this issue Apr 17, 2021 · 4 comments · Fixed by #902
Closed

Missing data when using browser restarts #892

bkrumnow opened this issue Apr 17, 2021 · 4 comments · Fixed by #902

Comments

@bkrumnow
Copy link

bkrumnow commented Apr 17, 2021

Hey everybody,

I stumbled upon an inconsistency in the data collection produced by OpenWPM. In brief, I found that sometimes it takes a browser a bit longer to start all instruments, which can be long enough that a crawl starts before the recording has begun.

The setup:
I conducted an experiment where multiple crawlers visit the same site in parallel (nearly synchronised). Each browser restarts after visiting a single site thanks to the reset option of the command sequence.

cs = CommandSequence(
  site,
  site_rank=rank,
  reset=True,
  callback=callback,
)

I used multiple instruments: HTTP, JS cookies, navigations, javascript, and the call stack instrument.
My tests include Ubuntu with OpenWPM v0.13.0 as well as OpenWPM v0.14.0 (Firefox 86.0.1 on both).

The problem:
I saw that multiple sites in my data set had a deviating number of requests. For example, the number of requests for a site can look like this:

Screenshot 2021-04-17 at 14 18 32

While this is usually not surprising, there is quite a number of sites with a varying number of first-party requests, which raised my attention. I tracked down which are missing and found that requests at the beginning of a visit are often not recorded. For the example above, one crawler did not record the very first two requests.

Screenshot 2021-04-15 at 14 32 54

I checked how many times the URL of the first recorded request differs for crawls of the same site. In an experiment where I scanned 100 sites, I found that the results on 88 sites are affected (8 crawlers per site). However, when this happens and which browser is affected is inconsistent.

What is causing this?
The current logging makes it a bit difficult to spot such kind of problems. I added logging to the extension to see when an instrument is done loading.

https://github.com/mozilla/OpenWPM/blob/625d81460b25d68846de36ddeeab23c9731bbda6/openwpm/Extension/firefox/feature.js/index.js#L56-L93

Side note: OpenWPM-0.14.0 does not show any logging of the extension, but the logging in the extension is still the same. Was that done on purpose?

The log for a site looks like this:

2021-04-17 11:22:05,609 - MainProcess[Thread-186]- BrowserManager       - DEBUG   : BROWSER 1: Browser spawn sucessful!
2021-04-17 11:22:05,668 - MainProcess[Thread-186]- BrowserManager       - INFO    : BROWSER 1: EXECUTING COMMAND: IntitializeCommand()
2021-04-17 11:22:05,671 - MainProcess[Thread-186]- BrowserManager       - INFO    : BROWSER 1: EXECUTING COMMAND: GetCommand(http://studydemic.com,30)
2021-04-17 11:22:05,685 - MainProcess[Thread-1  ]- FirefoxExtension     - DEBUG   : Setting visit_id the legacy way
2021-04-17 11:22:05,710 - MainProcess[Thread-1  ]- FirefoxExtension     - DEBUG   : Setting visit_id the legacy way
2021-04-17 11:22:05,750 - MainProcess[Thread-1  ]- FirefoxExtension     - DEBUG   : Browser 1: Javascript instrumentation loaded -521
2021-04-17 11:22:05,751 - MainProcess[Thread-1  ]- FirefoxExtension     - DEBUG   : Browser 1: Starting HTTP instrumentation -521
2021-04-17 11:22:05,753 - MainProcess[Thread-1  ]- FirefoxExtension     - DEBUG   : Browser 1: HTTP instrumentation loaded -526
2021-04-17 11:22:05,753 - MainProcess[Thread-1  ]- FirefoxExtension     - DEBUG   : Browser 1: Starting Callstack instrumentation -526
2021-04-17 11:22:05,754 - MainProcess[Thread-1  ]- FirefoxExtension     - DEBUG   : Browser 1: Callstack instrumentation loaded -526

Here, the JavaScript and HTTP instrument started recording after the Get-Command was executed. In my setup, every browser restarts after finishing a visit. That means each visit leads to a new chance that a browser starts before the instruments are fully up.

I found two causes for the slow initialisation. Often the JavaScript extension needs 500ms up to over 2 seconds. Also, the loading of the config file from the local file system can take up to 3 seconds. Note that this is exceptional: On 800 conducted crawls, three browser starts showed such a large delay.

Final thoughts:
It would be great if someone can verify this problem to exclude my local setup and hardware as a factor. However, I have already tested this on two different machines.

I guess ensuring that all instruments are up is a bit tricky and may takes time to implement. Whoever needs a quick (and a bit dirty) fix, I experimented with a small timeout of 3 seconds before a browser fires the get command. This resolves most of the cases for me. Out of my 88 cases before, I had three left.

@vringar
Copy link
Contributor

vringar commented Apr 19, 2021

Hey, thanks for the awesome bug report.

Side note: OpenWPM-0.14.0 does not show any logging of the extension, but the logging in the extension is still the same. Was that done on purpose?

No, it was not. And I don't know what could be causing this.

I guess ensuring that all instruments are up is a bit tricky and may takes time to implement.

After looking around in the code, I don't think it should be that hard. Sockets work both ways, so we should be able to expand the ClientSocket to be able to receive data.
Then we could wait for the ClientSocket to receive a "Started Up" message here:
https://github.com/mozilla/OpenWPM/blob/625d81460b25d68846de36ddeeab23c9731bbda6/openwpm/browser_manager.py#L662-L663

which we'd send after this code has run and before calling main:
https://github.com/mozilla/OpenWPM/blob/625d81460b25d68846de36ddeeab23c9731bbda6/openwpm/Extension/firefox/feature.js/index.js#L56-L93

Then we'd have certainty that the instrumentation is started up before we start sending commands.

Whoever needs a quick (and a bit dirty) fix, I experimented with a small timeout of 3 seconds before a browser fires the get command.

I'd really appreciate a PR for this quick fix until I can land the more comprehensive workaround.

@vringar
Copy link
Contributor

vringar commented Apr 19, 2021

This could be considered part of #135

@bkrumnow
Copy link
Author

bkrumnow commented Apr 20, 2021

Short update: I also found missing data when not using the reset option. It's not a surprise to see it happening after a browser restarted (probably due to errors from the previous visit?).

I'd really appreciate a PR for this quick fix until I can land the more comprehensive workaround.

I see if I can find some time for that tonight.

@vringar
Copy link
Contributor

vringar commented Apr 26, 2021

We can just write a file to the profile dir after startup is complete and poll on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants