-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing data when using browser restarts #892
Comments
Hey, thanks for the awesome bug report.
No, it was not. And I don't know what could be causing this.
After looking around in the code, I don't think it should be that hard. Sockets work both ways, so we should be able to expand the which we'd send after this code has run and before calling Then we'd have certainty that the instrumentation is started up before we start sending commands.
I'd really appreciate a PR for this quick fix until I can land the more comprehensive workaround. |
This could be considered part of #135 |
Short update: I also found missing data when not using the reset option. It's not a surprise to see it happening after a browser restarted (probably due to errors from the previous visit?).
I see if I can find some time for that tonight. |
We can just write a file to the profile dir after startup is complete and poll on that. |
Hey everybody,
I stumbled upon an inconsistency in the data collection produced by OpenWPM. In brief, I found that sometimes it takes a browser a bit longer to start all instruments, which can be long enough that a crawl starts before the recording has begun.
The setup:
I conducted an experiment where multiple crawlers visit the same site in parallel (nearly synchronised). Each browser restarts after visiting a single site thanks to the reset option of the command sequence.
I used multiple instruments: HTTP, JS cookies, navigations, javascript, and the call stack instrument.
My tests include Ubuntu with OpenWPM v0.13.0 as well as OpenWPM v0.14.0 (Firefox 86.0.1 on both).
The problem:
I saw that multiple sites in my data set had a deviating number of requests. For example, the number of requests for a site can look like this:
While this is usually not surprising, there is quite a number of sites with a varying number of first-party requests, which raised my attention. I tracked down which are missing and found that requests at the beginning of a visit are often not recorded. For the example above, one crawler did not record the very first two requests.
I checked how many times the URL of the first recorded request differs for crawls of the same site. In an experiment where I scanned 100 sites, I found that the results on 88 sites are affected (8 crawlers per site). However, when this happens and which browser is affected is inconsistent.
What is causing this?
The current logging makes it a bit difficult to spot such kind of problems. I added logging to the extension to see when an instrument is done loading.
https://github.com/mozilla/OpenWPM/blob/625d81460b25d68846de36ddeeab23c9731bbda6/openwpm/Extension/firefox/feature.js/index.js#L56-L93
Side note: OpenWPM-0.14.0 does not show any logging of the extension, but the logging in the extension is still the same. Was that done on purpose?
The log for a site looks like this:
Here, the JavaScript and HTTP instrument started recording after the Get-Command was executed. In my setup, every browser restarts after finishing a visit. That means each visit leads to a new chance that a browser starts before the instruments are fully up.
I found two causes for the slow initialisation. Often the JavaScript extension needs 500ms up to over 2 seconds. Also, the loading of the config file from the local file system can take up to 3 seconds. Note that this is exceptional: On 800 conducted crawls, three browser starts showed such a large delay.
Final thoughts:
It would be great if someone can verify this problem to exclude my local setup and hardware as a factor. However, I have already tested this on two different machines.
I guess ensuring that all instruments are up is a bit tricky and may takes time to implement. Whoever needs a quick (and a bit dirty) fix, I experimented with a small timeout of 3 seconds before a browser fires the get command. This resolves most of the cases for me. Out of my 88 cases before, I had three left.
The text was updated successfully, but these errors were encountered: