Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

50x errors and open files limit #20

Open
nikolaevigor opened this issue Feb 4, 2019 · 2 comments
Open

50x errors and open files limit #20

nikolaevigor opened this issue Feb 4, 2019 · 2 comments

Comments

@nikolaevigor
Copy link

Hi,

I am using aquarium to scrape some data from websites. My configuration is:

  • 8 CPU/40GB RAM GCP instance
  • 8 splashes; 5000 MB maxrss limit; 5 slots

For the several list of sites I am experience some issues. Scrapy logs are showing following info:

<container_name> | 2019-02-04 13:23:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET <site_name> via http://172.22.0.1:8050/render.json> (failed 1 times): 503 Service Unavailable

for first 20-30 urls; then scrapy successfully scrapes for about 3 or 5 urls and then again 20 to 30 503 errors. Also there are 502 and 504 errors, but in smaller amounts.

At the same time I see following logs from aquarium:

splash0_1 | 2019-02-04 13:23:50.346828 [-] Open files limit: 1048576
splash0_1 | 2019-02-04 13:23:50.346965 [-] Can't bump open files limit

Also, idk if it's important, user, that starts docker process has 1024 and 4096 soft and hard limit respectively.

At the end of scraping there are following results:

{'categories': {'gallery': 3426, 'story': 11157},
 'downloader/exception_count': 3749,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 18,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 125,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3606,
 'downloader/request_bytes': 73857486,
 'downloader/request_count': 71893,
 'downloader/request_method_count/GET': 655,
 'downloader/request_method_count/POST': 71238,
 'downloader/response_bytes': 31062040791,
 'downloader/response_count': 68144,
 'downloader/response_status_count/200': 15258,
 'downloader/response_status_count/404': 3,
 'downloader/response_status_count/502': 8621,
 'downloader/response_status_count/503': 43785,
 'downloader/response_status_count/504': 477,
 'dupefilter/filtered': 666,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 2, 2, 14, 27, 19, 916564),
 'httperror/response_ignored_count': 16396,
 'httperror/response_ignored_status_count/404': 3,
 'httperror/response_ignored_status_count/502': 2399,
 'httperror/response_ignored_status_count/503': 13881,
 'httperror/response_ignored_status_count/504': 113,
 'item_scraped_count': 14583,
 'log_count/DEBUG': 102871,
 'log_count/ERROR': 685,
 'log_count/INFO': 17535,
 'log_count/WARNING': 148,
 'memusage/max': 192880640,
 'memusage/startup': 47513600,
 'request_depth_max': 2,
 'response_received_count': 31654,
 'retry/count': 39577,
 'retry/max_reached': 17055,
 'retry/reason_count/502 Bad Gateway': 6222,
 'retry/reason_count/503 Service Unavailable': 29904,
 'retry/reason_count/504 Gateway Time-out': 364,
 'retry/reason_count/twisted.internet.error.TimeoutError': 16,
 'retry/reason_count/twisted.web._newclient.ResponseFailed': 99,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 2972,
 'scheduler/dequeued': 103554,
 'scheduler/dequeued/memory': 103554,
 'scheduler/enqueued': 103554,
 'scheduler/enqueued/memory': 103554,
 'spider_exceptions/Exception': 23,
 'splash/render.json/request_count': 31661,
 'splash/render.json/response_count/200': 14606,
 'splash/render.json/response_count/502': 8621,
 'splash/render.json/response_count/503': 43785,
 'splash/render.json/response_count/504': 477,
 'start_time': datetime.datetime(2019, 2, 1, 19, 34, 43, 765605)}

At the same time with same setup another sites have been scraped successfully.

Also, on successful scraping there are only around 100k files in the output folder, so even if scrapy does not close all of the opened files I don't see the reason why 1 million limit on open files should be bumped.

What could be the issue?

@lopuhin
Copy link

lopuhin commented Feb 4, 2019

I think that Can't bump open files limit message from the logs is harmless and not related to 503s.
If 503s are coming from the site itself, not from splash (I presume this is the case?), then it could be due scraping too aggressively?

@nikolaevigor
Copy link
Author

I have concurrent_requests set to 16. Could 8*16=128 concurrent requests count as "too aggressively", keeping in mind, that aquarium uses tor (I suppose for bypassing emitting requests from one ip").

Either way, I reduced 16 concurrent requests to 8 and increased splash_wait from 10 to 50. At the first sight it reduced number of 50x errors. But still have some "rows" of 20-30 50x requests, but less often. Will keep an eye on final results after finish.

Another question: If I see this in logs:

2019-02-04 20:41:10 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET <url> via http://172.22.0.1:8050/render.json> (failed 3 times): 503 Service Unavailable

Does it mean, that resource has returned 503 or is it status code from splash? In aquarium logs I see, that containers are restarting pretty often. So could it be, that request is routed to container, that is in process of restarting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants