Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-3072 Fetcher to stop QueueFeeder if aborting with "hung threads" #832

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sebastian-nagel
Copy link
Contributor

  • Fetcher signals QueueFeeder (via FetchItemQueues) to stop queuing and waits 1.5 seconds for this to happen
  • QueueFeeder (if still alive) discards remaining items with QueuingStatus "HIT_BY_TIMEOUT"
  • add number of items skipped by QueueFeeder to "FetcherStatus" counters
  • mark variables defined in FetchItemQueues but accessed from QueueFeeder thread as "volatile"
  • minor code cleanups

The solution was tested the following way:

  • the local Apache httpd was configured to respond with "HTTP 429 Too many requests" on a specific path
  • a seed list containing URLs with mostly URLs on this path on localhost was created and converted into a segment using the "freegen" tool
  • Fetcher was run with a single fetcher thread, a short queue and a high timeout divisor (= short timeout):
    nutch fetch -Dfetcher.queue.depth.multiplier=2 -Dfetcher.threads.timeout.divisor=20 path/to/segment -threads 1
    

Because most of the fetches failed with HTTP status code 429, the queue was throttled using the configured exponential backoff. When the delay was larger then the timeout, Fetcher stopped the QueueFeeder and finished. Below the log snippets from the hadoop.log.

2024-10-24 14:20:10,021 INFO o.a.n.f.Fetcher [main] Fetcher: starting
...
2024-10-24 14:20:10,934 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Fetcher: threads: 1
2024-10-24 14:20:10,954 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Fetcher: time-out divisor: 20
...
2024-10-24 14:20:11,008 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 51 Using queue mode : byHost
...
2024-10-24 14:20:11,009 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetching http://localhost/nutch/test-exception/19.html (queue crawl delay=3000ms)
...
2024-10-24 14:20:11,590 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/robots.txt - http/1.1 301 Moved Permanently
...
2024-10-24 14:20:11,594 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Following robots.txt redirect: http://localhost/nutch/robots.txt -> http://localhost/nutch/rd1-robots.txt
2024-10-24 14:20:11,595 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/rd1-robots.txt - http/1.1 302 Found
...
2024-10-24 14:20:11,596 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Following robots.txt redirect: http://localhost/nutch/rd1-robots.txt -> http://localhost/nutch/rd2-robots.txt
2024-10-24 14:20:11,597 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/rd2-robots.txt - http/1.1 302 Found
...
2024-10-24 14:20:11,597 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Following robots.txt redirect: http://localhost/nutch/rd2-robots.txt -> http://localhost/nutch/target-robots.txt
2024-10-24 14:20:11,601 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/target-robots.txt - http/1.1 200 OK
...
2024-10-24 14:20:11,602 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Fetched robots.txt for http://localhost/nutch/test-exception/19.html with status code 200
2024-10-24 14:20:11,606 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/test-exception/19.html - http/1.1 429 Too Many Requests
...
2024-10-24 14:20:11,708 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetch of http://localhost/nutch/test-exception/19.html failed with: Http code=429, url=http://localhost/nutch/test-exception/19.html
2024-10-24 14:20:11,708 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: localhost >> delayed next fetch by 3000 ms after 1 exceptions in queue
...
2024-10-24 14:20:17,711 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetching http://localhost/nutch/test-exception/ (queue crawl delay=3000ms)
2024-10-24 14:20:17,714 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/test-exception/ - http/1.1 429 Too Many Requests
...
2024-10-24 14:20:17,718 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetch of http://localhost/nutch/test-exception/ failed with: Http code=429, url=http://localhost/nutch/test-exception/
2024-10-24 14:20:17,718 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: localhost >> delayed next fetch by 6000 ms after 2 exceptions in queue
...
2024-10-24 14:21:11,787 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/test-exception/3.html - http/1.1 429 Too Many Requests
2024-10-24 14:21:11,797 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetch of http://localhost/nutch/test-exception/3.html failed with: Http code=429, url=http://localhost/nutch/test-exception/3.html
2024-10-24 14:21:11,798 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: localhost >> delayed next fetch by 48000 ms after 5 exceptions in queue
...
2024-10-24 14:21:42,095 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] -activeThreads=1, spinWaiting=1, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1
2024-10-24 14:21:42,095 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Timeout reached with no new requests since 30000 seconds.
2024-10-24 14:21:42,095 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Aborting with 1 hung threads (queue feeder still alive).
2024-10-24 14:21:42,096 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Thread #0 hung while processing http://localhost/nutch/test-exception/3.html
2024-10-24 14:21:42,096 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Stack of thread #0:
java.base@11.0.24/java.lang.Thread.sleep(Native Method)
app//org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:278)

2024-10-24 14:21:42,096 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Signaled QueueFeeder to stop, waiting 1.5 seconds before exiting.
2024-10-24 14:21:42,981 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder stopping, timeout reached.
2024-10-24 14:21:42,982 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder finished: total 8 records
2024-10-24 14:21:42,982 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder queuing status:
2024-10-24 14:21:42,982 INFO o.a.n.f.QueueFeeder [QueueFeeder] 	8	SUCCESSFULLY_QUEUED
2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder] 	0	ERROR_CREATE_FETCH_ITEM
2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder] 	0	ABOVE_EXCEPTION_THRESHOLD
2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder] 	0	HIT_BY_TIMELIMIT
2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder] 	16	HIT_BY_TIMEOUT
2024-10-24 14:21:43,597 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Aborting with 2 queued fetch items in 1 queues.
2024-10-24 14:21:43,597 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task Executor #0] * queue: localhost >> dropping!
2024-10-24 14:21:43,598 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task Executor #0] Emptied all queues: 1 queues with 2 items
2024-10-24 14:21:43,639 WARN o.a.h.m.i.MetricsSystemImpl [pool-6-thread-1] JobTracker metrics system already initialized!
2024-10-24 14:21:43,813 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 has no more work available
2024-10-24 14:21:43,814 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 -finishing thread FetcherThread, activeThreads=0
2024-10-24 14:21:43,851 INFO o.a.h.m.Job [main]  map 100% reduce 100%
2024-10-24 14:21:43,851 INFO o.a.h.m.Job [main] Job job_local288913864_0001 completed successfully
2024-10-24 14:21:43,857 INFO o.a.h.m.Job [main] Counters: 35
    ...
	FetcherStatus
		bytes_downloaded=120
		exception=5
		hitByTimeout=16
		hungThreads=1
		success=1
    ...
2024-10-24 14:21:43,858 INFO o.a.n.f.Fetcher [main] Fetcher: finished, elapsed: 93836 ms

- Fetcher signals QueueFeeder (via FetchItemQueues) to stop queuing
  and waits 1.5 seconds for this to happen
- QueueFeeder (if still alive) discards remaining items
  with QueuingStatus "HIT_BY_TIMEOUT"
- add number of items skipped by QueueFeeder to "FetcherStatus"
  counters
- mark variables defined in FetchItemQueues but accessed from
  QueueFeeder thread as "volatile"
- minor code cleanups
@sebastian-nagel
Copy link
Contributor Author

Update: this PR has been successfully tested in production as part of the November crawl of Common Crawl. For the source code (it's a fork of Nutch with some modifications), see https://github.com/commoncrawl/nutch/. If there are no objections, I'll merge this PR in a couple of days. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant