Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planet_n_spider v0.0.4 / dependency updates and CI/CD migration to Poetry v2.0 #125

Merged
merged 8 commits into from
Jan 7, 2025

Conversation

Criamos
Copy link
Contributor

@Criamos Criamos commented Jan 7, 2025

This PR includes the following changes:

  • planet_n_spider v0.0.4: hide cookie-banners of the "Simple Cookie Control" WordPress Plugin
  • QoL / logging:
    • replaced the built-in Python logger in most non-spider-classes (e.g., utility classes like the es_connector, license_mapper and other modules) with loguru to increase the readability and helpfulness of crawler logs, especially when encountering non-Scrapy-related log messages
    • (spiders were left untouched and can keep using the built-in scrapy.Spider.logger (see Scrapy Docs: Logging from Spiders))
  • Poetry v2.0 Migration
    • two days ago, Poetry v2.0.0 was released, which made updating our pyproject.toml file necessary (see: pyproject.toml specification)
    • please make sure to update your poetry installation to v2.0 the next time you open the project in your IDE! (restarting your IDE might be necessary afterwards)
      • (friendly reminder: this project requires Python 3.13)
  • fix: updated the browserless API Endpoint (see: converter/web_tools.py) for the most recent browserless-image in combination with playwright v1.49.1

- feat: hide cookiebanner in thumbnails (by forging the cookiebanner response)
  - Planet-N uses the "Simple Cookie Control" WordPress Plugin (see: https://wordpress.com/plugins/simple-cookie-control), which sets a cookie after accepting/declining the dialog (for 6 months)
- see: https://github.com/browserless/browserless/pkgs/container/chrome/327417313?tag=v2.24.1
  - changelog: https://github.com/browserless/browserless/blob/main/CHANGELOG.md#v2241
- change: to keep the docker-compose.yml in sync with the Kubernetes Helm Chart for Staging and Prod, switching back to "headless/chrome" (instead of the recently used "headless/chromium" image)
@Criamos Criamos added enhancement New feature or request dependencies Pull requests that update a dependency file labels Jan 7, 2025
@Criamos Criamos self-assigned this Jan 7, 2025
@Criamos Criamos force-pushed the planet_n_cookiebanner branch from c6e3e46 to f0f01bc Compare January 7, 2025 13:42
- this fixes the playwright wheel warning that first occurred with v1.49.0 (see: https://github.com/microsoft/playwright-python/releases/tag/v1.49.1)
- the recent browserless update apparently changed the API endpoint from "ws://localhost:3000" to "ws://localhost:3000/chrome/playwright", which can be accessed via playwright's "Browsertype.connect()"-method
  - the "connect_over_cdp()"-method seems no longer necessary

for future reference:
- the docs at https://docs.browserless.io/#connect-using-puppeteer-or-playwright appear to be outdated and don't reflect these recent changes
  - use "http://localhost:3000/docs" to look up the most-current documentation for the browserless playwright integration instead
@Criamos Criamos marked this pull request as ready for review January 7, 2025 16:50
@Criamos Criamos merged commit 2a88cf7 into develop Jan 7, 2025
3 checks passed
@Criamos Criamos deleted the planet_n_cookiebanner branch January 7, 2025 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant