Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a headless browser to allow content to be rendered dynamically #138

Closed
Ndpnt opened this issue Sep 21, 2020 · 4 comments
Closed

Use a headless browser to allow content to be rendered dynamically #138

Ndpnt opened this issue Sep 21, 2020 · 4 comments
Labels
RFC Request for comments

Comments

@Ndpnt
Copy link
Member

Ndpnt commented Sep 21, 2020

For example, Vimeo renders its content with a script.
See https://github.com/ambanum/CGUs-versions/blob/master/Vimeo/Privacy%20Policy.md

@LVerneyPEReN
Copy link
Contributor

Note that using a headless browser is probably more user-friendly, but will require quite a lot of extra processing time to fetch the documents.

In this particular case, browsing the source code of https://vimeo.com/privacy, there is a #vimeo_onetrust_page_src item containing a data-url which points to the JSON endpoint https://appds8093.blob.core.windows.net/c7e704ed-33be-4f93-85f9-373c45916aeb/privacy-notices/a4ee449f-2400-479a-98b7-ba27050904b4.json. This one, on its turn, has a policyUrl key which points to https://appds8093.blob.core.windows.net/c7e704ed-33be-4f93-85f9-373c45916aeb/privacy-notices/a4ee449f-2400-479a-98b7-ba27050904b4-en-us.json which has the actual privacy policy content.

From my experience, there is probably a balance to be found between adding too much processing logic or having a user-friendly way of fetching policies (at the expense of performance). Considering the current code base, I'd rather be in favor of the latter.

@LVerneyPEReN
Copy link
Contributor

FYI, we noticed this is the same requirement for Wish, see https://github.com/TomHouriezDGE/CGUs/blob/f9818cfa6b218ff4cc6862b170a4a3285d05b1a8/services/Wish.json (requires a headless browser).

@MattiSG
Copy link
Member

MattiSG commented Oct 16, 2020

After discussing with @Ndpnt @LucasVerneyDGE @clementbiron @TomHouriezDGE, we agreed that:

  • Using a headless browser should be an opt-in, at document level.
  • The headless browser is only an alternative way to fetch the content, not to manipulate the DOM. The existing pipeline should be preserved and used for all subsequent handling (select / remove / filter). This could be done by loading the page in the headless browser and serialising the resulting DOM.
  • The key should be executeClientScripts: true, be optional and default to false, and should be written after the fetch property and before the select property in the document declaration.
  • For implementation, any headless browser can be used. Puppeteer was preferred by @TomHouriezDGE. More abstract layers such as Selenium or Playwright are needlessly complex for the given use case.
  • For stability, the best is probably to wait before serialising the DOM until all the elements we have selectors for (at least select, if possible remove as well) are present, with a timeout. Automatic retries could be added.
  • Documentation (in CONTRIBUTING.md) should be updated to include this new option.
  • We expect tests to be provided and are ready to help with designing them if needed 🙂

Great, can't wait to land this feature! 😃

@MattiSG MattiSG added the RFC Request for comments label Oct 16, 2020
@Ndpnt
Copy link
Member Author

Ndpnt commented Oct 28, 2020

Implemented in #183

@MattiSG MattiSG closed this as completed Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request for comments
Projects
None yet
Development

No branches or pull requests

3 participants