50x errors and open files limit #20

nikolaevigor · 2019-02-04T13:42:08Z

Hi,

I am using aquarium to scrape some data from websites. My configuration is:

8 CPU/40GB RAM GCP instance
8 splashes; 5000 MB maxrss limit; 5 slots

For the several list of sites I am experience some issues. Scrapy logs are showing following info:

<container_name> | 2019-02-04 13:23:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET <site_name> via http://172.22.0.1:8050/render.json> (failed 1 times): 503 Service Unavailable

for first 20-30 urls; then scrapy successfully scrapes for about 3 or 5 urls and then again 20 to 30 503 errors. Also there are 502 and 504 errors, but in smaller amounts.

At the same time I see following logs from aquarium:

splash0_1 | 2019-02-04 13:23:50.346828 [-] Open files limit: 1048576
splash0_1 | 2019-02-04 13:23:50.346965 [-] Can't bump open files limit

Also, idk if it's important, user, that starts docker process has 1024 and 4096 soft and hard limit respectively.

At the end of scraping there are following results:

{'categories': {'gallery': 3426, 'story': 11157},
 'downloader/exception_count': 3749,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 18,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 125,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3606,
 'downloader/request_bytes': 73857486,
 'downloader/request_count': 71893,
 'downloader/request_method_count/GET': 655,
 'downloader/request_method_count/POST': 71238,
 'downloader/response_bytes': 31062040791,
 'downloader/response_count': 68144,
 'downloader/response_status_count/200': 15258,
 'downloader/response_status_count/404': 3,
 'downloader/response_status_count/502': 8621,
 'downloader/response_status_count/503': 43785,
 'downloader/response_status_count/504': 477,
 'dupefilter/filtered': 666,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 2, 2, 14, 27, 19, 916564),
 'httperror/response_ignored_count': 16396,
 'httperror/response_ignored_status_count/404': 3,
 'httperror/response_ignored_status_count/502': 2399,
 'httperror/response_ignored_status_count/503': 13881,
 'httperror/response_ignored_status_count/504': 113,
 'item_scraped_count': 14583,
 'log_count/DEBUG': 102871,
 'log_count/ERROR': 685,
 'log_count/INFO': 17535,
 'log_count/WARNING': 148,
 'memusage/max': 192880640,
 'memusage/startup': 47513600,
 'request_depth_max': 2,
 'response_received_count': 31654,
 'retry/count': 39577,
 'retry/max_reached': 17055,
 'retry/reason_count/502 Bad Gateway': 6222,
 'retry/reason_count/503 Service Unavailable': 29904,
 'retry/reason_count/504 Gateway Time-out': 364,
 'retry/reason_count/twisted.internet.error.TimeoutError': 16,
 'retry/reason_count/twisted.web._newclient.ResponseFailed': 99,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 2972,
 'scheduler/dequeued': 103554,
 'scheduler/dequeued/memory': 103554,
 'scheduler/enqueued': 103554,
 'scheduler/enqueued/memory': 103554,
 'spider_exceptions/Exception': 23,
 'splash/render.json/request_count': 31661,
 'splash/render.json/response_count/200': 14606,
 'splash/render.json/response_count/502': 8621,
 'splash/render.json/response_count/503': 43785,
 'splash/render.json/response_count/504': 477,
 'start_time': datetime.datetime(2019, 2, 1, 19, 34, 43, 765605)}

At the same time with same setup another sites have been scraped successfully.

Also, on successful scraping there are only around 100k files in the output folder, so even if scrapy does not close all of the opened files I don't see the reason why 1 million limit on open files should be bumped.

What could be the issue?

The text was updated successfully, but these errors were encountered:

lopuhin · 2019-02-04T15:42:27Z

I think that Can't bump open files limit message from the logs is harmless and not related to 503s.
If 503s are coming from the site itself, not from splash (I presume this is the case?), then it could be due scraping too aggressively?

nikolaevigor · 2019-02-04T21:06:02Z

I have concurrent_requests set to 16. Could 8*16=128 concurrent requests count as "too aggressively", keeping in mind, that aquarium uses tor (I suppose for bypassing emitting requests from one ip").

Either way, I reduced 16 concurrent requests to 8 and increased splash_wait from 10 to 50. At the first sight it reduced number of 50x errors. But still have some "rows" of 20-30 50x requests, but less often. Will keep an eye on final results after finish.

Another question: If I see this in logs:

2019-02-04 20:41:10 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET <url> via http://172.22.0.1:8050/render.json> (failed 3 times): 503 Service Unavailable

Does it mean, that resource has returned 503 or is it status code from splash? In aquarium logs I see, that containers are restarting pretty often. So could it be, that request is routed to container, that is in process of restarting?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

50x errors and open files limit #20

50x errors and open files limit #20

nikolaevigor commented Feb 4, 2019

lopuhin commented Feb 4, 2019

nikolaevigor commented Feb 4, 2019

50x errors and open files limit #20

50x errors and open files limit #20

Comments

nikolaevigor commented Feb 4, 2019

lopuhin commented Feb 4, 2019

nikolaevigor commented Feb 4, 2019