Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AdaptivePlaywrightCrawler issues found during migration to Python #2798

Open
Pijukatel opened this issue Jan 8, 2025 · 0 comments
Open

AdaptivePlaywrightCrawler issues found during migration to Python #2798

Pijukatel opened this issue Jan 8, 2025 · 0 comments
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Milestone

Comments

@Pijukatel
Copy link

This is collection of issues or improvements that were discovered during re-implementation of AdaptivePlaywrightCrawler in Python.

  1. Ensure isolation of contexts for static / client only browsing.
    Example situation: Rendering type predictor decides that both crawling methods should be used. This means that user handler will run twice. User can modify context in user handler, for example "user_data". This can lead to situation where second handler is working on already modified context.

  2. Default result comparator checks only dataset changes. Maybe add comparison of added links. This on the other hand has to be done carefully as some sites when crawled with browser can generate additional options. Example of "same" link:

Static: https://sdk.apify.com/docs/guides/getting-started
Browser: https://sdk.apify.com/docs/guides/getting-started?__hsfp=1136113150&__hssc=7591405.1.1735494277124&__hstc=7591405.e2b9302ed00c5bfaee3a870166792181.1735494277124.1735494277124.1735494277124.1

  1. Document possible edge case of undesired mutation of global state.
    In situation where static crawling failed, browser crawling is used as backup. If context.use_state method was already used in static crawling, then global state can be modified.

TBD ... more will be added during migration

@Pijukatel Pijukatel added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 8, 2025
@Pijukatel Pijukatel added this to the 4.0 milestone Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

1 participant