Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize scraping with threads #192

Merged

Conversation

Crinibus
Copy link
Owner

@Crinibus Crinibus commented Nov 8, 2022

Update so that scraping at different domains can happen at the same time, but scraping a each domain is sequentially

E.g. if there is four products to be scraped from two different domains, then one product from each domain is scraped at the same time, then after the request_delay the next product from each domain is being scraped at the same time.

Function "get_products_grouped_by_domain":
- Create Scraper instances and group them in a dictionary where the key is the domain name and the value is Scraper instances that scrapes products from the domain name
Import functions:
- start_threads_sequentially
- get_products_df_grouped_by_domains
- get_products_grouped_by_domain
- group products df by domain
- create threads for the grouped products and group the threads by domain
- make master threads to run scraper threads for each domain sequentially
@Crinibus Crinibus marked this pull request as ready for review January 30, 2023 18:45
@Crinibus Crinibus merged commit 7fd8570 into master Jan 30, 2023
@Crinibus Crinibus deleted the update-scrape-with-threads-group-products-df-by-domains branch January 30, 2023 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant