Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visit_id is not set at the beginning of some page visits. #123

Closed
dreisman opened this issue Dec 15, 2016 · 6 comments
Closed

Visit_id is not set at the beginning of some page visits. #123

dreisman opened this issue Dec 15, 2016 · 6 comments
Labels
bug extension Relates to the WebExtension written in TS and JS needs-investigation

Comments

@dreisman
Copy link
Contributor

dreisman commented Dec 15, 2016

If a socket closes improperly, then calls to sock.send() will fail, and the error won't get caught. This will cause the TaskManager to die without failing gracefully (for instance, the TaskManager won't be able to save the browser profile).

An example of this in the wild. The first stack trace was caused by an error in the DataAggregator, and the second is the actual socket error. It is unknown if the two are related.

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/multiprocess/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/local/lib/python2.7/dist-packages/multiprocess/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/[REMOVED]/OpenWPM/automation/DataAggregator/DataAggregator.py", line 56, in DataAggregator
    process_query(query, curr, logger)
  File "/home/ubuntu/[REMOVED]/OpenWPM/automation/DataAggregator/DataAggregator.py", line 89, in process_query
    curr.execute(statement,args)
IntegrityError: NOT NULL constraint failed: http_requests.visit_id
Traceback (most recent call last):
  File "/home/ubuntu/[REMOVED]/census_crawl_id_detection.py", line 70, in <module>
    manager.execute_command_sequence(command_sequence)
  File "/home/ubuntu/[REMOVED]/OpenWPM/automation/TaskManager.py", line 528, in execute_command_sequence
    self._distribute_command(command_sequence, index)
  File "/home/ubuntu/[REMOVED]/OpenWPM/automation/TaskManager.py", line 367, in _distribute_command
    thread = self._start_thread(browser, command_sequence)
  File "/home/ubuntu/[REMOVED]/OpenWPM/automation/TaskManager.py", line 424, in _start_thread
    (self.next_visit_id, browser.crawl_id, command_sequence.url)))
  File "/home/ubuntu/[REMOVED]OpenWPM/automation/SocketInterface.py", line 133, in send
    sent = self.sock.send(msg[totalsent:])
socket.error: [Errno 32] Broken pipe
@englehardt
Copy link
Collaborator

This doesn't seem to be an issue with the socket (or the DataAggregator). The original error was IntegrityError: NOT NULL constraint failed: http_requests.visit_id which causes the DataAggregator to crash. Once that fails, any socket that tries to send information to the aggregator will throw a Broken pipe exception. Both of these are expected (and desired for critical errors like a null visit id).

So the question is, under what condition are we sending a null visit_id?

@englehardt
Copy link
Collaborator

Let's use the string "null" when the visit_id is null and see where in the new crawls it's set to null. I imagine this will largely be non-essential requests that we'd want to drop anyway (i.e. noise from the browser that we should disable). @dreisman, will you be able to commit this?

@englehardt
Copy link
Collaborator

This was implemented in 0ec97d1. Let's check the latest crawl data to see which urls have this issue.

@dreisman dreisman changed the title Exceptions when calling sock.send() can cause TaskManager to completely die, with no graceful recovery Visit_id is not set at the beginning of some page visits. Jun 19, 2017
@dreisman
Copy link
Contributor Author

In converting the March 2017 25k id detection crawl, I found exactly one http_requests row that had visit_id set to -1. It was the first request of a page load (visit_id=6052, top_url='http://supersport.com'). We believe this is the result of a race condition in the extension, where the visit_id is not set before it starts getting the first page when the browser restarts.

We also are afraid that there may be a case where the visit_id is set to the an old visit id for a new site_visit, and we aren't catching that.

Attempt to measure:

From the 2017-03 id detection, it looks like there are about 200 sites out of the 25,000 for which this MAY have happened. These are 200 sites for which we do not see a row where top_url == url in http_requests.

Ideally we may want the browser to block until the extension confirms it got a new visit_id, before visiting the new page. See #135.

In the meantime, it be best to drop these rows if it's happening on a small enough scale. The problem seems to only affect the first page load request.

@englehardt
Copy link
Collaborator

Proposed fix is in #135.

@vringar
Copy link
Contributor

vringar commented May 7, 2020

While this error report is still valid and unfixed the fix for it is begin discussed in another issue so this one doesn't provide any additional value. Especially seeing how the observable behaviour is completely different by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug extension Relates to the WebExtension written in TS and JS needs-investigation
Development

No branches or pull requests

3 participants