Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider to support retry mechanism #40

Open
mingyc opened this issue Sep 30, 2022 · 5 comments
Open

Consider to support retry mechanism #40

mingyc opened this issue Sep 30, 2022 · 5 comments
Labels
api Issue with API specs

Comments

@mingyc
Copy link
Collaborator

mingyc commented Sep 30, 2022

Offline discussion sugggests that the API should support some retry mechanism otherwise PendingBeacon is too unreliable.

  1. bool field option {retry: true}: this option let the browser decide how to retry the failed beacon request.

  2. number field option {retries: 10}: this option let the user specify how many times the browser should retry the failed request. However, it might be very difficult to implement as: by default the beacons are sent out on page discard, and there really isn't much time left for a page to retry that many number of times, unless we are willing to let browser queue a failed request longer enough (privacy concern).

Note to @fergald : Looking into the spec of Reporting API, it doesn't really support retry mechanism.

We don’t specify any retry mechanism here for failed reports. We may want to add one here, or provide some indication that the delivery failed.

@horo-t
Copy link

horo-t commented Oct 13, 2023

FYI: Chromium has a retry mechanism of Reporting API. See this CL and this max_report_attempts flag.

@fergald
Copy link
Collaborator

fergald commented Oct 13, 2023

Retries are difficult. We are restricted in when we are allowed start fetches. If there are no instances of the the origin alive or just recently closed, the it could violate user expectations for retry network request created by that origin. If background-sync is enabled for the origin, Chrome privacy team are OK with it but on other browsers we can only initiate a request while the origin has a live instance or just as one is closing.

It is not possible to retry a fetch initiated from an unload handler so this is equally (un)reliable.

I think the TL;DR of my argument is that if you show me some existing code that does a delayed fetch, I can show you how to add a fetchLater that will make it more reliable and never less reliable.


For sites that have not already implemented retries using fetch. I think replacing them with fetchLater results in an increase in reliability.

If someone cares about maximizing reliability, they presumably currently they are already

  1. doing everything via a service worker and maybe even storage
  2. manually managing timeouts and doing fetches with logic to retry and also paying attention to visibilitychange events

We will ignore 1 since these people don't need fetchLater

For 2, with fetchLater available, for every piece of data that they want to send after a timeout or visibilitychange event they would create a fetchLater that they later cancel. The fetchLater would only ever actually fetch in a crash.

So for 2, this gives a strict increase in reliability.


Adding automatic retries would allow those in 2 to remove retry logic. That seems nice but is not a requirement.

@mreinstein
Copy link

mreinstein commented Jan 8, 2024

If you show me some existing code that does a delayed fetch, I can show you how to add a fetchLater that will make it more reliable and never less reliable.

That is accurate but it's also a false dichotomy; delayed fetch vs fetchLater are not the only 2 strategies out there. Another option is people that are spamming events UDP-style to maximize delivery reliability. That's the one I've seen actually used because delayed send (on the web) doesn't really work given all the problems with page lifecycle combined with network failures.

The problem is network failures do happen, even in non-crash scenarios, especially in mobile contexts. Losing data sent via pending beacon (or the new fetchLater API) due to network failures isn't always recoverable.

In my use case, I need to report on how long someone stays on a given page. The solution I've inherited is the page sends a "heartbeat" gif request every X seconds. On the analytics side I look at the last one delivered and compare that to the first. It is extremely reliable, because it works similar to UDP (we spam so many .gif requests and most end up delivered.)

Obviously this sucks for performance reasons. I have (not exaggerating) a few billion of these requests that happen per year. I would love to stop doing this, but every attempt to replace it with beacon transport fails:

  • The page lifecycle events are not reliable, even with clever, well designed polyfills like https://github.com/GoogleChromeLabs/page-lifecycle
  • The 64kb origin limit means I can't just replace each http .gif request with a beacon, especially when there are several 3rd party scripts running on a page and sharing that 64kb quota.
  • I can't build my own retry logic because my integration is as a 3rd party script, and I can't declare service workers or put a lot of data into local storage on these domains. I also can't depend on a user returning to a given domain, some people may only visit once and not return.

I've run analytics for gif based and beacon based transports side-by-side, and it varies from day to day but the beacon messages are always less than their .gif counterparts, and it ranges anywhere from 1-10% when summing up billion messages. It's a big enough data loss that I can't get this solution past the data/management people.

If there are no instances of the the origin alive or just recently closed, the it could violate user expectations for retry network request created by that origin

I can appreciate that sentiment, but in the case of people building telemetry/analytics on a large scale, especially as a 3rd party vendor on a web page, I don't think pending beacon or fetchLater can work without having the browser make it's best effort to ensure these things send, and that includes some kind of retry mechanism related to network failures.

@mreinstein
Copy link

If there are no instances of the the origin alive or just recently closed, the it could violate user expectations for retry network request created by that origin

I think it's also important to add that the current constraints of the new fetchLater API already violate this expectation:

Screenshot 2024-01-08 at 2 39 12 PM

(from https://chromium.googlesource.com/chromium/src/+/main/docs/experiments/fetch-later.md#what_s-not-supported )

@mingyc
Copy link
Collaborator Author

mingyc commented Jan 19, 2024

If there are no instances of the the origin alive or just recently closed, the it could violate user expectations for retry network request created by that origin

I think it's also important to add that the current constraints of the new fetchLater API already violate this expectation:

There has been a long discussion around this topic (user expectation), and the current decision (#30 (comment)) is that fetchLater API should not send out any requests if no other same-origin site is open (in other tabs/iframes etc).

In Chromium implementation for OT, it is even stricter that the browser will flush out all pending fetchLater requests for a document on it entering BFCache.

Hence, the fact that it is not observable in DevTools (after the initiator document is closed) should not be related to a privacy issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Issue with API specs
Projects
None yet
Development

No branches or pull requests

4 participants