Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anansi sometimes fails to follow 303 redirects #68

Open
townxelliot opened this issue Jan 3, 2017 · 19 comments
Open

Anansi sometimes fails to follow 303 redirects #68

townxelliot opened this issue Jan 3, 2017 · 19 comments

Comments

@townxelliot
Copy link

If Anansi is asked to crawl a URI which returns a 303 redirect to a representation of the resource, the redirect is sometimes ignored. On other occasions, the redirect is queued and the resource is fetched. I couldn't discern any pattern to this behaviour.

The environment where I noted this behaviour is the Acropolis docker stack, and can be reproduced as follows:

  1. Clone and run the Acropolis docker stack, so you have a running Anansi. See https://github.com/bbcarchdev/acropolis for instructions. The short version is docker-compose build inside a clone of the Anansi project.
  2. Seed Anansi with a non-information resource URI which redirects to an information resource URI, e.g. http://dbpedia.org/resource/Dracula. The command for doing this with the docker stack is:
    docker exec acropolis_anansi_1 /usr/bin/crawler-add -f http://dbpedia.org/resource/Dracula
    
    You may need to repeat this command several times, as Anansi will sometimes follow the redirect correctly. Repeat until Anansi doesn't follow the redirect, effectively ignoring the resource you seeded it with. There is no information in the log about what has happened: the resource URI appears to be getting queued but is never processed.
@rjpwork
Copy link

rjpwork commented Jan 3, 2017

What are we expecting to happen in the "success" case? eg. on that attempt, it fetched Dracula.xml but then rejected it for license reasons.

anansi_1         | crawld[1]: LOD: examining <http://dbpedia.org/data/Dracula.xml> for licensing triples
anansi_1         | crawld[1]: %ANANSI-I-3100: REJECTED <http://dbpedia.org/data/Dracula.xml>: (LOD: no suitable licensing triple)
anansi_1         | crawld[1]: RDF: filter declined further processing of this resource

@rjpwork
Copy link

rjpwork commented Jan 3, 2017

On the second attempt, it didn't attempt to fetch Dracula.xml but that's possibly down to there being an entry for it in crawl_resource with REJECTED.

Indeed, deleting that record makes it fetch Dracula.xml again by following the redirect but then still rejects it via the LOD filter.

@townxelliot
Copy link
Author

I'll provide some more thorough steps to reproduce the issue tomorrow. When I was trying this before Christmas, I either got a silent failure or successful processing of the resource (I don't get a rejection due to licence as far as I remember).

@rjpwork
Copy link

rjpwork commented Jan 3, 2017

Ta. It's difficult to see what the intended result should be when it's logging that it's skipped the redirect at the same time as processing the result of the damned redirect.

@rjpwork
Copy link

rjpwork commented Jan 5, 2017

Do we want to special case 303 as a follow then? Since the code currently blocks them in a few places with this conditional:

if(status > 300 && status < 304) { return /* variants on SKIPPED */; }

@nevali
Copy link
Member

nevali commented Jan 5, 2017

301, 302 and 303 should all be treated equivalently (although in future it would be a nice-to-have to set a different TTL depending upon which kind of redirect it was); from Anansi's point of view there's no reason why 303 ought to be special-cased. If there's a problem with 303 redirect handling, then it'll also apply to 301s and 302s too.

The intended for result should be:

If status is (301, 302, 303) then

  • examine the Location header and queue the URL it contains
  • end processing of this resource, setting the status to SKIPPED (logging the message as we go)

Given a database empty besides the redirecting resource, this ought to result in the redirect target being queued, fetched, and processed almost immediately after the skip has occurred (allowing for root minimum fetch times). Note that the skip is logged at a different level than accepted/rejected because redirects just clutter the logs, so it may not be immediately obvious that this is what has occurred.

@rjpwork
Copy link

rjpwork commented Jan 6, 2017

Given a single URI which provides a 303 return that points to a valid RDF file (with nothing else to crawl), we expect

  1. An entry in the database for the original URI, marked SKIPPED
  2. An entry in the database for the redirected URI, marked ACCEPTED

Is that right?

@nevali
Copy link
Member

nevali commented Jan 6, 2017

Given a single URI which provides a 303 return, we expect

  1. an entry in the database for the original URI, marked SKIPPED
  2. an entry in the database for the redirect target URI, marked NEW, unless the redirect is to the original URI.

Tests like this should generally be executed in one-shot mode (-t).

At this point we know what the two URIs are (and whether they differ), and whether the target URI was successfully added.

Then, on a subsequent fetch, assuming the redirect target was indeed NEW and the resource is acceptable, we'd expect it to be marked ACCEPTED. As we have no particular need to test the basic operation of the database-based queue at this point (which would accounting for root TTLs and generally making things more complicated), you can further test that by executing another one-shot crawl of the target URI and retrieving the status from the database.

However, all this latter part is really testing is "is the fetch of the target URI somehow dependent upon the fetch of the original URI in any fashion other than fetching one triggering the existence in the queue of the other" — as that's not something we have any particular need to test for, I wouldn't bother with that, either.

Therefore we can do a one-shot test with the conditions at the top of this comment.

How's that?

@rjpwork
Copy link

rjpwork commented Jan 12, 2017

Currently it is not performing step 2 for http://sws.geonames.org/2643743/ which is a 303 to http://sws.geonames.org/2643743/about.rdf. I can't see any obvious reason for this in the logs though.

@nevali
Copy link
Member

nevali commented Jan 26, 2017

My logs suggest the early bail-out from the curl receiver callback is throwing a spanner in the works and the request ends up being treated as a failure instead:

crawld: [Debug] Fetching http://sws.geonames.org/2643743/
* About to connect() to sws.geonames.org port 80 (#0)
*   Trying 188.40.33.19...
* connected
* Connected to sws.geonames.org (188.40.33.19) port 80 (#0)
> GET /2643743/ HTTP/1.1
Host: sws.geonames.org
Accept: application/rdf+xml;q=1.0, text/rdf;q=0.6, application/n-triples;q=1.0, text/plain;q=0.1, text/turtle;q=1.0, application/x-turtle;q=1.0, application/turtle;q=1.0, text/n3;q=0.3, text/rdf+n3;q=0.3, application/rdf+n3;q=0.3, application/x-trig;q=1.0, text/x-nquads;q=1.0, */*;q=0.1
User-Agent: Mozilla/5.0 (compatible; Anansi; libcrawl; +https://bbcarchdev.github.io/res/)

* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 303 See Other
< Date: Thu, 26 Jan 2017 10:32:20 GMT
< Server: Apache/2.4.6 (Linux/SUSE)
< Location: http://sws.geonames.org/2643743/about.rdf
< Content-Length: 333
< Content-Type: text/html; charset=iso-8859-1
< 
* Failed writing body (0 != 333)
* Closing connection #0
* Failed writing received data to disk/application
crawld: [Debug] DB: UPDATE "crawl_resource" SET "updated" = '2017-01-26 10:32:15', "last_modified" = '2017-01-26 10:32:15', "status" = 303, "crawl_instance" = NULL, "state" = 'FAILED' WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663'
crawld: [Debug] DB: UPDATE "crawl_resource" SET "next_fetch" = '2017-01-27 10:32:15' WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663' AND "next_fetch" < '2017-01-27 10:32:15'
crawld: [Debug] DB: UPDATE "crawl_root" SET "last_updated" = '2017-01-26 10:32:15' WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493'
crawld: [Debug] DB: UPDATE "crawl_root" SET "earliest_update" = '2017-01-26 10:32:15' WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493' AND "earliest_update" < '2017-01-26 10:32:15'
crawld: [Debug] DB: UPDATE "crawl_resource" SET "error_count" = 0, "soft_error_count" = 0 WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663'
crawld: Notice: %ANANSI-N-2034: crawl thread terminating [dev] crawler 1/1 (thread 1/1)
crawld: Notice: %ANANSI-N-2026: terminating crawl threads...
crawld: %ANANSI-I-2027: waiting for crawl threads to terminate...
crawld: Notice: %ANANSI-N-2028: all crawl threads have stopped

@rjpwork
Copy link

rjpwork commented Jan 26, 2017

If you change the if(*status >= 300 && *status < 400) line in crawler/policy.c to allow 303 statuses, then it works fine for that URI - it gets added as SKIPPED and the .../about.rdf gets added to the queue and then eventually ACCEPTED.

Which implies it isn't anything to do with early bail-outs.

@nevali
Copy link
Member

nevali commented Jan 26, 2017

Early bail-outs happen because of that code (that's what—as the preceding comment indicates—the checkpoint callback is for)...

You seem to be saying that by changing the code to not return COS_SKIPPED, the result is that it gets set as SKIPPED in the queue, which is obviously undesirable, but it's hard to see what the effect of that is without the output of crawld -t URI with that change in place.

@nevali
Copy link
Member

nevali commented Jan 26, 2017

From the code, though, it should result in COS_SKIPPED being returned due to the content-type not matching the whitelist (as Geonames returns a text/html response), rather than the response code, but I can't see an obvious reason why returning COS_SKIPPED due to one rather than the other should make a difference.

@nevali
Copy link
Member

nevali commented Jan 31, 2017

workaround in e7d47c7

nevali referenced this issue Jan 31, 2017
Hacky way to provoke `anansi` into following 303s. Experimentation
has shown this to work for eg. `geonames.org` and `richardlight.org.uk`.
@nevali
Copy link
Member

nevali commented Jan 31, 2017

okay, a few minutes of tracing shows what's going on, at least partially.

the policy handler returning COS_SKIPPED results in the crawl object being marked for rollback; because of that, the processor_unchanged_handler_() ends up being invoked, when it should actually be processor_handler_() to handle the redirect properly.

I wonder if the correct fix is actually to distinguish between "skipped this resource, roll back" and "skipped this resource, but further processing should be allowed to occur"

@nevali
Copy link
Member

nevali commented Jan 31, 2017

okay, introducing that looks like it'll DTRT (first pass without the resource in the DB, second with):

[111003 mcrobm02@vm-10-100-0-17 git:develop anansi]$ ./libtool --mode=execute gdb --args crawler/crawld -t http://sws.geonames.org/2643743/
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/mcrobm02/res/acropolis/anansi/crawler/.libs/lt-crawld...done.
(gdb) r
Starting program: /home/mcrobm02/res/acropolis/anansi/crawler/.libs/lt-crawld -t http://sws.geonames.org/2643743/
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
crawld: [Debug] loading configuration file '/opt/res/etc/crawl.conf'
crawld: [Debug] policy:schemes:whitelist => 'http'
crawld: [Debug] policy:schemes:whitelist => 'https'
crawld: [Debug] libcluster: environment name now set to 'dev'
crawld: Notice: %ANANSI-N-2010: test mode enabled: ignoring cluster configuration
crawld: [Debug] libcluster: number of workers in this cluster member set to 1
crawld: [Debug] libcluster: static: this member's base index set to 0
crawld: [Debug] libcluster: static: total cluster worker count set to 0
crawld: [Debug] libcluster: static: successfully joined the cluster
crawld: [Debug] libcluster: re-balanced; this instance has base index 0 (1 workers) from a total of 1
crawld: Notice: %ANANSI-N-2011: cluster has re-balanced: instance 3487f8f04e6c4e689f5fcf3ad0f4ab10 single-thread index 1 from cluster anansi/dev of 1 threads
crawld: [Debug] creating 1 crawler threads
crawld: [Debug] launching thread 1
crawld: [Debug] S3: initialising cache at <s3://anansi-dev>
[New Thread 0x7fffed5dd700 (LWP 2192)]
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: CREATE TABLE IF NOT EXISTS "_version" ("ident" VARCHAR(64) NOT NULL, "version" INTEGER NOT NULL, "updated" TIMESTAMP NOT NULL, "comment" TEXT DEFAULT NULL, PRIMARY KEY ("ident"))
crawld: [Debug] DB: SELECT "version" FROM "_version" WHERE "ident" = 'com.github.nevali.crawl.db'
crawld: [Debug] DB: ROLLBACK
crawld: Notice: %ANANSI-N-5003: using test URI <http://sws.geonames.org/2643743/>
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: SELECT * FROM "crawl_resource" WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663'
crawld: [Debug] DB: UPDATE "crawl_resource" SET "next_fetch" = NOW(), "state" = 'FORCE' WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663'
crawld: [Debug] DB: COMMIT
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: SELECT * FROM "crawl_root" WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493'
crawld: [Debug] DB: ROLLBACK
crawld: Notice: %ANANSI-N-2029: new crawl thread ready [dev] crawler 1/1 (thread 1/1)
crawld: [Debug] thread 1 has started
crawld: [Debug] main thread is now sleeping
* About to connect() to s.ch.internal port 80 (#0)
*   Trying 10.10.10.109...
* connected
* Connected to s.ch.internal (10.10.10.109) port 80 (#0)
> GET /anansi-dev/2a6e9d9958f8fddc6957dd7147601663.json HTTP/1.1
Host: s.ch.internal
Accept: */*
Date: Tue, 31 Jan 2017 11:10:11 GMT
Authorization: AWS 715MF2410G60NFMNFTTZ:tCyIfaI3Kj22id+NAoVwh1TZGvQ=

* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Date: Tue, 31 Jan 2017 11:10:17 GMT
< Server: Apache/2.4.7 (Ubuntu)
< Accept-Ranges: bytes
< Last-Modified: Tue, 31 Jan 2017 11:08:03 GMT
< ETag: "a5c112b665d581854a263087eb66dbfd"
< Content-Length: 518
< Content-Type: application/json
<
* Connection #0 to host s.ch.internal left intact
* Closing connection #0
crawld: [Debug] Fetching http://sws.geonames.org/2643743/
* About to connect() to sws.geonames.org port 80 (#0)
*   Trying 188.40.33.19...
* connected
* Connected to sws.geonames.org (188.40.33.19) port 80 (#0)
> GET /2643743/ HTTP/1.1
Host: sws.geonames.org
Accept: application/rdf+xml;q=1.0, text/rdf;q=0.6, application/n-triples;q=1.0, text/plain;q=0.1, text/turtle;q=1.0, application/x-turtle;q=1.0, application/turtle;q=1.0, text/n3;q=0.3, text/rdf+n3;q=0.3, application/rdf+n3;q=0.3, application/x-trig;q=1.0, text/x-nquads;q=1.0, */*;q=0.1
User-Agent: Mozilla/5.0 (compatible; Anansi; libcrawl; +https://bbcarchdev.github.io/res/)

* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 303 See Other
< Date: Tue, 31 Jan 2017 11:10:17 GMT
< Server: Apache/2.4.6 (Linux/SUSE)
< Location: http://sws.geonames.org/2643743/about.rdf
< Content-Length: 333
< Content-Type: text/html; charset=iso-8859-1
<
* Connection #0 to host sws.geonames.org left intact
* About to connect() to s.ch.internal port 80 (#0)
*   Trying 10.10.10.109...
* connected
* Connected to s.ch.internal (10.10.10.109) port 80 (#0)
> PUT /anansi-dev/2a6e9d9958f8fddc6957dd7147601663.json HTTP/1.1
Host: s.ch.internal
Accept: */*
Content-Type: application/json
Date: Tue, 31 Jan 2017 11:10:11 GMT
Authorization: AWS 715MF2410G60NFMNFTTZ:diD9enzLZ8asQ0pawdABuJK0BGE=
Content-Length: 518

* upload completely sent off: 518 out of 518 bytes
* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Date: Tue, 31 Jan 2017 11:10:17 GMT
< Server: Apache/2.4.7 (Ubuntu)
< ETag: "029c13b8f0ef09b1a5072d4999dcfaaa"
< Accept-Ranges: bytes
< Content-Length: 0
< Vary: Accept-Encoding
< Content-Type: application/xml
<
* Connection #0 to host s.ch.internal left intact
* Closing connection #0
crawld: [Debug] Uploading payload to </2a6e9d9958f8fddc6957dd7147601663>
* About to connect() to s.ch.internal port 80 (#0)
*   Trying 10.10.10.109...
* connected
* Connected to s.ch.internal (10.10.10.109) port 80 (#0)
> PUT /anansi-dev/2a6e9d9958f8fddc6957dd7147601663 HTTP/1.1
Host: s.ch.internal
Accept: */*
Expect: 100-continue
Content-Type: text/html; charset=iso-8859-1
Content-Location: http://sws.geonames.org/2643743/
Date: Tue, 31 Jan 2017 11:10:12 GMT
Authorization: AWS 715MF2410G60NFMNFTTZ:KFNa5VNBhSKVnXlc70vVg3u+e0Q=
Content-Length: 333

* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 100 Continue
* additional stuff not fine transfer.c:1037: 0 0
* We are completely uploaded and fine
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Date: Tue, 31 Jan 2017 11:10:17 GMT
< Server: Apache/2.4.7 (Ubuntu)
< ETag: "bff2d24ad039d9e79755e6946ec4bbfc"
< Accept-Ranges: bytes
< Content-Length: 0
< Vary: Accept-Encoding
< Content-Type: application/xml
<
* Connection #0 to host s.ch.internal left intact
* Closing connection #0
* Closing connection #0
crawld: [Debug] processor_handler: URI is 'http://sws.geonames.org/2643743/', Content-Type is 'text/html; charset=iso-8859-1', status is 303
crawld: [Debug] processor_handler: following 303 redirect to <http://sws.geonames.org/2643743/about.rdf>
crawld: [Debug] Adding URI <http://sws.geonames.org/2643743/about.rdf> to crawler queue
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: SELECT * FROM "crawl_resource" WHERE "hash" = '26c08f9008b36968e406fe7cd49abe55'
crawld: [Debug] DB: ROLLBACK
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: SELECT * FROM "crawl_root" WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493'
crawld: [Debug] DB: ROLLBACK
crawld: [Debug] processor_handler: object has been updated
crawld: %ANANSI-I-3000: SKIPPED (redirect) <http://sws.geonames.org/2643743/> (RDF: HTTP status 303)
crawld: [Debug] DB: UPDATE "crawl_resource" SET "updated" = '2017-01-31 11:10:11', "last_modified" = '2017-01-31 11:10:11', "status" = 303, "crawl_instance" = NULL, "state" = 'SKIPPED' WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663'
crawld: [Debug] DB: UPDATE "crawl_resource" SET "next_fetch" = '2017-02-07 11:10:12' WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663' AND "next_fetch" < '2017-02-07 11:10:12'
crawld: [Debug] DB: UPDATE "crawl_root" SET "last_updated" = '2017-01-31 11:10:12' WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493'
crawld: [Debug] DB: UPDATE "crawl_root" SET "earliest_update" = '2017-01-31 11:10:12' WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493' AND "earliest_update" < '2017-01-31 11:10:12'
crawld: [Debug] DB: UPDATE "crawl_resource" SET "error_count" = 0, "soft_error_count" = 0 WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663'
crawld: Notice: %ANANSI-N-2034: crawl thread terminating [dev] crawler 1/1 (thread 1/1)
[Thread 0x7fffed5dd700 (LWP 2192) exited]
crawld: Notice: %ANANSI-N-2026: terminating crawl threads...
crawld: %ANANSI-I-2027: waiting for crawl threads to terminate...
crawld: Notice: %ANANSI-N-2028: all crawl threads have stopped
[Inferior 1 (process 2188) exited normally]
(gdb) r
Starting program: /home/mcrobm02/res/acropolis/anansi/crawler/.libs/lt-crawld -t http://sws.geonames.org/2643743/
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
crawld: [Debug] loading configuration file '/opt/res/etc/crawl.conf'
crawld: [Debug] policy:schemes:whitelist => 'http'
crawld: [Debug] policy:schemes:whitelist => 'https'
crawld: [Debug] libcluster: environment name now set to 'dev'
crawld: Notice: %ANANSI-N-2010: test mode enabled: ignoring cluster configuration
crawld: [Debug] libcluster: number of workers in this cluster member set to 1
crawld: [Debug] libcluster: static: this member's base index set to 0
crawld: [Debug] libcluster: static: total cluster worker count set to 0
crawld: [Debug] libcluster: static: successfully joined the cluster
crawld: [Debug] libcluster: re-balanced; this instance has base index 0 (1 workers) from a total of 1
crawld: Notice: %ANANSI-N-2011: cluster has re-balanced: instance 9c39c6d998d1470ead694b172f989cb2 single-thread index 1 from cluster anansi/dev of 1 threads
crawld: [Debug] creating 1 crawler threads
crawld: [Debug] launching thread 1
crawld: [Debug] S3: initialising cache at <s3://anansi-dev>
[New Thread 0x7fffed5dd700 (LWP 2197)]
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: CREATE TABLE IF NOT EXISTS "_version" ("ident" VARCHAR(64) NOT NULL, "version" INTEGER NOT NULL, "updated" TIMESTAMP NOT NULL, "comment" TEXT DEFAULT NULL, PRIMARY KEY ("ident"))
crawld: [Debug] DB: SELECT "version" FROM "_version" WHERE "ident" = 'com.github.nevali.crawl.db'
crawld: [Debug] DB: ROLLBACK
crawld: Notice: %ANANSI-N-5003: using test URI <http://sws.geonames.org/2643743/>
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: SELECT * FROM "crawl_resource" WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663'
crawld: [Debug] DB: UPDATE "crawl_resource" SET "next_fetch" = NOW(), "state" = 'FORCE' WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663'
crawld: [Debug] DB: COMMIT
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: SELECT * FROM "crawl_root" WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493'
crawld: [Debug] DB: ROLLBACK
crawld: Notice: %ANANSI-N-2029: new crawl thread ready [dev] crawler 1/1 (thread 1/1)
crawld: [Debug] thread 1 has started
crawld: [Debug] main thread is now sleeping
* About to connect() to s.ch.internal port 80 (#0)
*   Trying 10.10.10.109...
* connected
* Connected to s.ch.internal (10.10.10.109) port 80 (#0)
> GET /anansi-dev/2a6e9d9958f8fddc6957dd7147601663.json HTTP/1.1
Host: s.ch.internal
Accept: */*
Date: Tue, 31 Jan 2017 11:10:38 GMT
Authorization: AWS 715MF2410G60NFMNFTTZ:9HM0UZd7IF0StitSUdy+re9GCbI=

* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Date: Tue, 31 Jan 2017 11:10:44 GMT
< Server: Apache/2.4.7 (Ubuntu)
< Accept-Ranges: bytes
< Last-Modified: Tue, 31 Jan 2017 11:10:17 GMT
< ETag: "029c13b8f0ef09b1a5072d4999dcfaaa"
< Content-Length: 518
< Content-Type: application/json
<
* Connection #0 to host s.ch.internal left intact
* Closing connection #0
crawld: [Debug] Fetching http://sws.geonames.org/2643743/
* About to connect() to sws.geonames.org port 80 (#0)
*   Trying 188.40.33.19...
* connected
* Connected to sws.geonames.org (188.40.33.19) port 80 (#0)
> GET /2643743/ HTTP/1.1
Host: sws.geonames.org
Accept: application/rdf+xml;q=1.0, text/rdf;q=0.6, application/n-triples;q=1.0, text/plain;q=0.1, text/turtle;q=1.0, application/x-turtle;q=1.0, application/turtle;q=1.0, text/n3;q=0.3, text/rdf+n3;q=0.3, application/rdf+n3;q=0.3, application/x-trig;q=1.0, text/x-nquads;q=1.0, */*;q=0.1
User-Agent: Mozilla/5.0 (compatible; Anansi; libcrawl; +https://bbcarchdev.github.io/res/)

* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 303 See Other
< Date: Tue, 31 Jan 2017 11:10:44 GMT
< Server: Apache/2.4.6 (Linux/SUSE)
< Location: http://sws.geonames.org/2643743/about.rdf
< Content-Length: 333
< Content-Type: text/html; charset=iso-8859-1
<
* Connection #0 to host sws.geonames.org left intact
* About to connect() to s.ch.internal port 80 (#0)
*   Trying 10.10.10.109...
* connected
* Connected to s.ch.internal (10.10.10.109) port 80 (#0)
> PUT /anansi-dev/2a6e9d9958f8fddc6957dd7147601663.json HTTP/1.1
Host: s.ch.internal
Accept: */*
Content-Type: application/json
Date: Tue, 31 Jan 2017 11:10:38 GMT
Authorization: AWS 715MF2410G60NFMNFTTZ:m/AAKq8vByZoji9xY/poK60YPkU=
Content-Length: 518

* upload completely sent off: 518 out of 518 bytes
* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Date: Tue, 31 Jan 2017 11:10:44 GMT
< Server: Apache/2.4.7 (Ubuntu)
< ETag: "44c13f0287657e6c5c705aa35e1f6642"
< Accept-Ranges: bytes
< Content-Length: 0
< Vary: Accept-Encoding
< Content-Type: application/xml
<
* Connection #0 to host s.ch.internal left intact
* Closing connection #0
crawld: [Debug] Uploading payload to </2a6e9d9958f8fddc6957dd7147601663>
* About to connect() to s.ch.internal port 80 (#0)
*   Trying 10.10.10.109...
* connected
* Connected to s.ch.internal (10.10.10.109) port 80 (#0)
> PUT /anansi-dev/2a6e9d9958f8fddc6957dd7147601663 HTTP/1.1
Host: s.ch.internal
Accept: */*
Expect: 100-continue
Content-Type: text/html; charset=iso-8859-1
Content-Location: http://sws.geonames.org/2643743/
Date: Tue, 31 Jan 2017 11:10:38 GMT
Authorization: AWS 715MF2410G60NFMNFTTZ:v1nDzLAdEyQR65uoeVX2RJY+VQY=
Content-Length: 333

* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 100 Continue
* additional stuff not fine transfer.c:1037: 0 0
* We are completely uploaded and fine
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Date: Tue, 31 Jan 2017 11:10:44 GMT
< Server: Apache/2.4.7 (Ubuntu)
< ETag: "bff2d24ad039d9e79755e6946ec4bbfc"
< Accept-Ranges: bytes
< Content-Length: 0
< Vary: Accept-Encoding
< Content-Type: application/xml
<
* Connection #0 to host s.ch.internal left intact
* Closing connection #0
* Closing connection #0
crawld: [Debug] processor_handler: URI is 'http://sws.geonames.org/2643743/', Content-Type is 'text/html; charset=iso-8859-1', status is 303
crawld: [Debug] processor_handler: following 303 redirect to <http://sws.geonames.org/2643743/about.rdf>
crawld: [Debug] Adding URI <http://sws.geonames.org/2643743/about.rdf> to crawler queue
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: SELECT * FROM "crawl_resource" WHERE "hash" = '26c08f9008b36968e406fe7cd49abe55'
crawld: [Debug] DB: ROLLBACK
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: SELECT * FROM "crawl_root" WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493'
crawld: [Debug] DB: ROLLBACK
crawld: [Debug] processor_handler: object has been updated
crawld: %ANANSI-I-3000: SKIPPED (redirect) <http://sws.geonames.org/2643743/> (RDF: HTTP status 303)
crawld: [Debug] DB: UPDATE "crawl_resource" SET "updated" = '2017-01-31 11:10:38', "last_modified" = '2017-01-31 11:10:38', "status" = 303, "crawl_instance" = NULL, "state" = 'SKIPPED' WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663'
crawld: [Debug] DB: UPDATE "crawl_resource" SET "next_fetch" = '2017-02-07 11:10:39' WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663' AND "next_fetch" < '2017-02-07 11:10:39'
crawld: [Debug] DB: UPDATE "crawl_root" SET "last_updated" = '2017-01-31 11:10:39' WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493'
crawld: [Debug] DB: UPDATE "crawl_root" SET "earliest_update" = '2017-01-31 11:10:39' WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493' AND "earliest_update" < '2017-01-31 11:10:39'
crawld: [Debug] DB: UPDATE "crawl_resource" SET "error_count" = 0, "soft_error_count" = 0 WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663'
crawld: Notice: %ANANSI-N-2034: crawl thread terminating [dev] crawler 1/1 (thread 1/1)
[Thread 0x7fffed5dd700 (LWP 2197) exited]
crawld: Notice: %ANANSI-N-2026: terminating crawl threads...
crawld: %ANANSI-I-2027: waiting for crawl threads to terminate...
crawld: Notice: %ANANSI-N-2028: all crawl threads have stopped
[Inferior 1 (process 2196) exited normally]
(gdb)

@nevali
Copy link
Member

nevali commented Jan 31, 2017

And for completeness, a run with an empty database:

Starting program: /home/mcrobm02/res/acropolis/anansi/crawler/.libs/lt-crawld -t http://sws.geonames.org/2643743/
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
crawld: [Debug] loading configuration file '/opt/res/etc/crawl.conf'
crawld: [Debug] policy:schemes:whitelist => 'http'
crawld: [Debug] policy:schemes:whitelist => 'https'
crawld: [Debug] libcluster: environment name now set to 'dev'
crawld: Notice: %ANANSI-N-2010: test mode enabled: ignoring cluster configuration
crawld: [Debug] libcluster: number of workers in this cluster member set to 1
crawld: [Debug] libcluster: static: this member's base index set to 0
crawld: [Debug] libcluster: static: total cluster worker count set to 0
crawld: [Debug] libcluster: static: successfully joined the cluster
crawld: [Debug] libcluster: re-balanced; this instance has base index 0 (1 workers) from a total of 1
crawld: Notice: %ANANSI-N-2011: cluster has re-balanced: instance c4b479b20b8b470a8e9240fa34f1dabc single-thread index 1 from cluster anansi/dev of 1 threads
crawld: [Debug] creating 1 crawler threads
crawld: [Debug] launching thread 1
crawld: [Debug] S3: initialising cache at <s3://anansi-dev>
[New Thread 0x7fffed5dd700 (LWP 2663)]
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: CREATE TABLE IF NOT EXISTS "_version" ("ident" VARCHAR(64) NOT NULL, "version" INTEGER NOT NULL, "updated" TIMESTAMP NOT NULL, "comment" TEXT DEFAULT NULL, PRIMARY KEY ("ident"))
crawld: [Debug] DB: SELECT "version" FROM "_version" WHERE "ident" = 'com.github.nevali.crawl.db'
crawld: [Debug] DB: ROLLBACK
crawld: Notice: %ANANSI-N-5003: using test URI <http://sws.geonames.org/2643743/>
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: SELECT * FROM "crawl_resource" WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663'
crawld: [Debug] DB: INSERT INTO "crawl_resource" ("hash", "shorthash", "tinyhash", "crawl_bucket", "cache_bucket", "root", "uri", "added", "next_fetch", "state") VALUES ('2a6e9d9958f8fddc6957dd7147601663', 711892377, 153, 0, 0, '6dccbc6dbe6d1921c6f914bb75af3493', 'http://sws.geonames.org/2643743/', NOW(), NOW(), 'NEW')
crawld: [Debug] DB: COMMIT
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: SELECT * FROM "crawl_root" WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493'
crawld: [Debug] DB: INSERT INTO "crawl_root" ("hash", "uri", "added", "earliest_update", "rate") VALUES ('6dccbc6dbe6d1921c6f914bb75af3493', 'http://sws.geonames.org/', NOW(), NOW(), 1000)
crawld: [Debug] DB: COMMIT
crawld: Notice: %ANANSI-N-2029: new crawl thread ready [dev] crawler 1/1 (thread 1/1)
crawld: [Debug] thread 1 has started
crawld: [Debug] main thread is now sleeping
* About to connect() to s.ch.internal port 80 (#0)
*   Trying 10.10.10.109...
* connected
* Connected to s.ch.internal (10.10.10.109) port 80 (#0)
> GET /anansi-dev/2a6e9d9958f8fddc6957dd7147601663.json HTTP/1.1
Host: s.ch.internal
Accept: */*
Date: Tue, 31 Jan 2017 11:18:31 GMT
Authorization: AWS 715MF2410G60NFMNFTTZ:UA82dI7mmdEnqG/pWY26yDeG1PI=

* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Date: Tue, 31 Jan 2017 11:18:37 GMT
< Server: Apache/2.4.7 (Ubuntu)
< Accept-Ranges: bytes
< Last-Modified: Tue, 31 Jan 2017 11:10:44 GMT
< ETag: "44c13f0287657e6c5c705aa35e1f6642"
< Content-Length: 518
< Content-Type: application/json
<
* Connection #0 to host s.ch.internal left intact
* Closing connection #0
crawld: [Debug] Fetching http://sws.geonames.org/2643743/
* About to connect() to sws.geonames.org port 80 (#0)
*   Trying 188.40.33.19...
* connected
* Connected to sws.geonames.org (188.40.33.19) port 80 (#0)
> GET /2643743/ HTTP/1.1
Host: sws.geonames.org
Accept: application/rdf+xml;q=1.0, text/rdf;q=0.6, application/n-triples;q=1.0, text/plain;q=0.1, text/turtle;q=1.0, application/x-turtle;q=1.0, application/turtle;q=1.0, text/n3;q=0.3, text/rdf+n3;q=0.3, application/rdf+n3;q=0.3, application/x-trig;q=1.0, text/x-nquads;q=1.0, */*;q=0.1
User-Agent: Mozilla/5.0 (compatible; Anansi; libcrawl; +https://bbcarchdev.github.io/res/)

* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 303 See Other
< Date: Tue, 31 Jan 2017 11:18:37 GMT
< Server: Apache/2.4.6 (Linux/SUSE)
< Location: http://sws.geonames.org/2643743/about.rdf
< Content-Length: 333
< Content-Type: text/html; charset=iso-8859-1
<
* Connection #0 to host sws.geonames.org left intact
* About to connect() to s.ch.internal port 80 (#0)
*   Trying 10.10.10.109...
* connected
* Connected to s.ch.internal (10.10.10.109) port 80 (#0)
> PUT /anansi-dev/2a6e9d9958f8fddc6957dd7147601663.json HTTP/1.1
Host: s.ch.internal
Accept: */*
Content-Type: application/json
Date: Tue, 31 Jan 2017 11:18:32 GMT
Authorization: AWS 715MF2410G60NFMNFTTZ:AnF9TEXX6Az2xl6AUpIfDuAY/eU=
Content-Length: 518

* upload completely sent off: 518 out of 518 bytes
* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Date: Tue, 31 Jan 2017 11:18:37 GMT
< Server: Apache/2.4.7 (Ubuntu)
< ETag: "a858c02eb1c6a25f02188cfea8d3b1d5"
< Accept-Ranges: bytes
< Content-Length: 0
< Vary: Accept-Encoding
< Content-Type: application/xml
<
* Connection #0 to host s.ch.internal left intact
* Closing connection #0
crawld: [Debug] Uploading payload to </2a6e9d9958f8fddc6957dd7147601663>
* About to connect() to s.ch.internal port 80 (#0)
*   Trying 10.10.10.109...
* connected
* Connected to s.ch.internal (10.10.10.109) port 80 (#0)
> PUT /anansi-dev/2a6e9d9958f8fddc6957dd7147601663 HTTP/1.1
Host: s.ch.internal
Accept: */*
Expect: 100-continue
Content-Type: text/html; charset=iso-8859-1
Content-Location: http://sws.geonames.org/2643743/
Date: Tue, 31 Jan 2017 11:18:32 GMT
Authorization: AWS 715MF2410G60NFMNFTTZ:UkJX/J6StKBaB/MAhUL2SYM8SQc=
Content-Length: 333

* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 100 Continue
* additional stuff not fine transfer.c:1037: 0 0
* We are completely uploaded and fine
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Date: Tue, 31 Jan 2017 11:18:37 GMT
< Server: Apache/2.4.7 (Ubuntu)
< ETag: "bff2d24ad039d9e79755e6946ec4bbfc"
< Accept-Ranges: bytes
< Content-Length: 0
< Vary: Accept-Encoding
< Content-Type: application/xml
<
* Connection #0 to host s.ch.internal left intact
* Closing connection #0
* Closing connection #0
crawld: [Debug] processor_handler: URI is 'http://sws.geonames.org/2643743/', Content-Type is 'text/html; charset=iso-8859-1', status is 303
crawld: [Debug] processor_handler: following 303 redirect to <http://sws.geonames.org/2643743/about.rdf>
crawld: [Debug] Adding URI <http://sws.geonames.org/2643743/about.rdf> to crawler queue
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: SELECT * FROM "crawl_resource" WHERE "hash" = '26c08f9008b36968e406fe7cd49abe55'
crawld: [Debug] DB: INSERT INTO "crawl_resource" ("hash", "shorthash", "tinyhash", "crawl_bucket", "cache_bucket", "root", "uri", "added", "next_fetch", "state") VALUES ('26c08f9008b36968e406fe7cd49abe55', 650153872, 144, 0, 0, '6dccbc6dbe6d1921c6f914bb75af3493', 'http://sws.geonames.org/2643743/about.rdf', NOW(), NOW(), 'NEW')
crawld: [Debug] DB: COMMIT
crawld: [Debug] DB: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
crawld: [Debug] DB: SELECT * FROM "crawl_root" WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493'
crawld: [Debug] DB: ROLLBACK
crawld: [Debug] processor_handler: object has been updated
crawld: %ANANSI-I-3000: SKIPPED (redirect) <http://sws.geonames.org/2643743/> (RDF: HTTP status 303)
crawld: [Debug] DB: UPDATE "crawl_resource" SET "updated" = '2017-01-31 11:18:31', "last_modified" = '2017-01-31 11:18:31', "status" = 303, "crawl_instance" = NULL, "state" = 'SKIPPED' WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663'
crawld: [Debug] DB: UPDATE "crawl_resource" SET "next_fetch" = '2017-02-07 11:18:32' WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663' AND "next_fetch" < '2017-02-07 11:18:32'
crawld: [Debug] DB: UPDATE "crawl_root" SET "last_updated" = '2017-01-31 11:18:32' WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493'
crawld: [Debug] DB: UPDATE "crawl_root" SET "earliest_update" = '2017-01-31 11:18:32' WHERE "hash" = '6dccbc6dbe6d1921c6f914bb75af3493' AND "earliest_update" < '2017-01-31 11:18:32'
crawld: [Debug] DB: UPDATE "crawl_resource" SET "error_count" = 0, "soft_error_count" = 0 WHERE "hash" = '2a6e9d9958f8fddc6957dd7147601663'
crawld: Notice: %ANANSI-N-2034: crawl thread terminating [dev] crawler 1/1 (thread 1/1)
[Thread 0x7fffed5dd700 (LWP 2663) exited]
crawld: Notice: %ANANSI-N-2026: terminating crawl threads...
crawld: %ANANSI-I-2027: waiting for crawl threads to terminate...
crawld: Notice: %ANANSI-N-2028: all crawl threads have stopped
[Inferior 1 (process 2659) exited normally]

nevali added a commit that referenced this issue Jan 31, 2017
@nevali
Copy link
Member

nevali commented Jan 31, 2017

Proposed fix in 6ab134a

nevali added a commit that referenced this issue Jan 31, 2017
@nevali
Copy link
Member

nevali commented Jan 31, 2017

FWIW, the reason for the ugliness in this is because COS_SKIPPED didn't originally exist at all; it was introduced to differentiate between outright rejection because something is wrong with the resource being retrieved, and simply not needing to bother with fully retrieving it and processing its payload.

Arguably the actual redirect handling code should form part of the policy handler rather than the processor (keeping that in one place), allowing the processor to deal with crawl states alone instead of response statuses as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants