-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some revisit/redirect requests appear to 'hang' #53
Comments
On the back end, this log line fires immediately:
but then nothing else AFIACT. |
Seems
|
The revisit records looks like this:
So, it's a http -> https redirect that's been deduplicated. The nearest match that's not a revisit is: <result>
<compressedoffset>863304405</compressedoffset>
<mimetype>application/http</mimetype>
<file>
/heritrix/output/warcs/quarterly/20191001020435/BL-20191003115004411-01312-62~ukwa-h3-pulse-quarterly~8443.warc.gz
</file>
<redirecturl>-</redirecturl>
<urlkey>scot,gov)/topics/marine/seamanagement</urlkey>
<digest>3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ</digest>
<httpresponsecode>302</httpresponsecode>
<robotflags>-</robotflags>
<url>https://www.gov.scot/Topics/marine/seamanagement/</url>
<capturedate>20191003115927</capturedate>
</result>
which looks fine to me. I think self-redirects should be ignored, but it looks like that's not happening here. |
Logging three -hang- requests as they talks to CDX and HDFS:
Which does indeed correspond to the redirect above. |
Hm, set up a local test. When set up set up as a collection and indexed by Using the
Using the
|
Thanks for the detailed notes! The skipping of the self-redirect should already be happening... However, what appears to be missing is a special case for the empty digest I'll try adding a fix for that and see if it helps things |
- if a revisit record has empty hash, don't attempt to lookup an original, simply use with empty payload
- if a revisit record has empty hash, don't attempt to lookup an original, simply use with empty payload
Not entirely sure if that was the issue, but updated pywb with a fix and ukwa-pywb to point to the latest |
Note the |
I set up a local test using some minimal WARCs. When set up set up as a collection and indexed by Once example, the CDX records look like this:
Using the
Using the
What may be further confusing the issue is the site is hosting different content at |
For the As a result, it was attempting to make an id_/ request to the cdx server, which obviously fails: Fortunately, there is a simple/elegant fix: If the cdx response includes the full cdx line, warc, offset, length, then it'll try to load those instead: If not, it'll try the id_/ route. The integration tests is set to |
The The two redirects are:
This can actually be disabled by setting I'll see if the 1) redirect can be eliminated also. |
|
To try and iron out some unnecessary variations, our CDX server has been upgraded to OutbackCDX 0.7.0 and we have switched to But.... we're still seeing odd behaviour. This example just hangs: I can't get to the CDX right now, but you can get the CDXJ like this: I'm struggling to see why this should be causing any kind of issue. EDIT forgot to add, if you go straight to the latest timestamp that is not a |
It must have something to do with loading multiple records as is needed for revisists.. All the cdxj entries have Would be interesting to get stats on the record load |
Ah, I'll see if I can monitor what it's doing from the storage end. The indexer we have currently doesn't output the record length, just because that's slightly tricky to implement within the map-reduce framework we are using. Hence, I've been putting it off. Looks like I'll have to work out how to do it. |
- blockrecordloader: ensure record stream is closed after parsing one record, may help with ukwa/ukwa-pywb#53 - iframe: add allow fullscreen, autoplay - wombat: update to latest, filter out custom wombat props from getOwnPropertyNames - rules: add rule for vimeo - bump version to rc6
Looking further, it seems like partial reads can definitely pose issues: https://requests.readthedocs.io/en/master/user/advanced/#body-content-workflow |
Well, potentially missing close/release_conn.. not sure if it will make a difference, but try running with this image: |
So, when requesting... ...these are the CDX and HDFS requests that get logged,
Looking at the sockets it does seem there's a lingering connection that takes a while to drop, which seems about consistent with reading the WARC whole. But even then, it should come back eventually. This is the CDX it sees on the first call:
This is the CDX it sees on the second call:
So, it seems to pull the first result (the The WARC record itself looks fine:
|
Okay, as a workaround, I've reconfigured the index to drop revisits completely and things are much happer.
|
* fixes for RC6: - blockrecordloader: ensure record stream is closed after parsing one record - wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed - simplify no_except_close may help with ukwa/ukwa-pywb#53 - iframe: add allow fullscreen, autoplay - wombat: update to latest, filter out custom wombat props from getOwnPropertyNames - rules: add rule for vimeo * cdx formatting: fix output=text to return plain text / non-cdxj output * auto fetch fix: - update to latest wombat to fix auto-fetch in rewriting mode - fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode - don't use global to allow repeated checks * rewriter html check: peek 1024 bytes to determine if page is html instead of 128 * fix jinja2 dependency for py2
- warcserver: when parsing headers to check for redirect, reserialized headers may be of different length then original, causing warcserver->app response to hang now adjusting the content-length on the warc record and also not including a fixed length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53 - undo change in path resolvers to use os.path.join, just concatenate full_path + filename - rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548) - bump version to rc7
* misc fixes for 2.4.0rc7: - warcserver: when parsing headers to check for redirect, reserialized headers may be of different length then original, causing warcserver->app response to hang now adjusting the content-length on the warc record and also not including a fixed length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53 - undo change in path resolvers to use os.path.join, just concatenate full_path + filename - rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548) - bump version to rc7 * ci: attempt to fix travis build for 27, 35
Great! -- Having (temporarily?) rolled 2.4.0-beta out as BETA Wayback, the links that hung before now work fine! |
We have problems with some URLs on a particular site, where some requests appear to work and others hang.
This hangs:
But
http://access.n45.wa.bl.uk:8280/archive/20191024125646mp_/http://www2.gov.scot/Topics/marine/seamanagement
works.Looking at the CDX:
So, it looks like our revisit records are causing some kind of problem?
The text was updated successfully, but these errors were encountered: