-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anansi sometimes fails to follow 303 redirects #68
Comments
What are we expecting to happen in the "success" case? eg. on that attempt, it fetched
|
On the second attempt, it didn't attempt to fetch Indeed, deleting that record makes it fetch |
I'll provide some more thorough steps to reproduce the issue tomorrow. When I was trying this before Christmas, I either got a silent failure or successful processing of the resource (I don't get a rejection due to licence as far as I remember). |
Ta. It's difficult to see what the intended result should be when it's logging that it's skipped the redirect at the same time as processing the result of the damned redirect. |
Do we want to special case 303 as a follow then? Since the code currently blocks them in a few places with this conditional:
|
301, 302 and 303 should all be treated equivalently (although in future it would be a nice-to-have to set a different TTL depending upon which kind of redirect it was); from Anansi's point of view there's no reason why 303 ought to be special-cased. If there's a problem with 303 redirect handling, then it'll also apply to 301s and 302s too. The intended for result should be: If status is (301, 302, 303) then
Given a database empty besides the redirecting resource, this ought to result in the redirect target being queued, fetched, and processed almost immediately after the skip has occurred (allowing for root minimum fetch times). Note that the skip is logged at a different level than accepted/rejected because redirects just clutter the logs, so it may not be immediately obvious that this is what has occurred. |
Given a single URI which provides a 303 return that points to a valid RDF file (with nothing else to crawl), we expect
Is that right? |
Given a single URI which provides a 303 return, we expect
Tests like this should generally be executed in one-shot mode ( At this point we know what the two URIs are (and whether they differ), and whether the target URI was successfully added. Then, on a subsequent fetch, assuming the redirect target was indeed However, all this latter part is really testing is "is the fetch of the target URI somehow dependent upon the fetch of the original URI in any fashion other than fetching one triggering the existence in the queue of the other" — as that's not something we have any particular need to test for, I wouldn't bother with that, either. Therefore we can do a one-shot test with the conditions at the top of this comment. How's that? |
Currently it is not performing step 2 for |
My logs suggest the early bail-out from the curl receiver callback is throwing a spanner in the works and the request ends up being treated as a failure instead:
|
If you change the Which implies it isn't anything to do with early bail-outs. |
Early bail-outs happen because of that code (that's what—as the preceding comment indicates—the checkpoint callback is for)... You seem to be saying that by changing the code to not return |
From the code, though, it should result in |
workaround in e7d47c7 |
Hacky way to provoke `anansi` into following 303s. Experimentation has shown this to work for eg. `geonames.org` and `richardlight.org.uk`.
okay, a few minutes of tracing shows what's going on, at least partially. the policy handler returning I wonder if the correct fix is actually to distinguish between "skipped this resource, roll back" and "skipped this resource, but further processing should be allowed to occur" |
okay, introducing that looks like it'll DTRT (first pass without the resource in the DB, second with):
|
And for completeness, a run with an empty database:
|
Proposed fix in 6ab134a |
FWIW, the reason for the ugliness in this is because Arguably the actual redirect handling code should form part of the policy handler rather than the processor (keeping that in one place), allowing the processor to deal with crawl states alone instead of response statuses as well. |
If Anansi is asked to crawl a URI which returns a 303 redirect to a representation of the resource, the redirect is sometimes ignored. On other occasions, the redirect is queued and the resource is fetched. I couldn't discern any pattern to this behaviour.
The environment where I noted this behaviour is the Acropolis docker stack, and can be reproduced as follows:
docker-compose build
inside a clone of the Anansi project.The text was updated successfully, but these errors were encountered: