Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some revisit/redirect requests appear to 'hang' #53

Closed
anjackson opened this issue Dec 20, 2019 · 20 comments · Fixed by webrecorder/pywb#534
Closed

Some revisit/redirect requests appear to 'hang' #53

anjackson opened this issue Dec 20, 2019 · 20 comments · Fixed by webrecorder/pywb#534
Assignees

Comments

@anjackson
Copy link
Contributor

anjackson commented Dec 20, 2019

We have problems with some URLs on a particular site, where some requests appear to work and others hang.

This hangs:

#  curl -v -x ''  http://access.n45.wa.bl.uk:8280/archive/20191024125648mp_/http://www2.gov.scot/Topics/marine/seamanagement
* About to connect() to access.n45.wa.bl.uk port 8280 (#0)
*   Trying 192.168.45.25...
* Connected to access.n45.wa.bl.uk (192.168.45.25) port 8280 (#0)
> GET /archive/20191024125648mp_/http://www2.gov.scot/Topics/marine/seamanagement HTTP/1.1
> User-Agent: curl/7.29.0
> Host: access.n45.wa.bl.uk:8280
> Accept: */*
>

But http://access.n45.wa.bl.uk:8280/archive/20191024125646mp_/http://www2.gov.scot/Topics/marine/seamanagement works.

Looking at the CDX:

...
<result>
<compressedoffset>173790774</compressedoffset>
<mimetype>text/html</mimetype>
<file>
/heritrix/output/frequent-npld/20191010210458/warcs/BL-NPLD-20191024103625885-00622-76~npld-heritrix3-worker-1~8443.warc.gz
</file>
<redirecturl>-</redirecturl>
<urlkey>scot,gov)/topics/marine/seamanagement</urlkey>
<digest>6AZTIB2Q5GWAB47AIIHXHFVTZKOD2UBU</digest>
<httpresponsecode>200</httpresponsecode>
<robotflags>-</robotflags>
<url>https://www2.gov.scot/Topics/marine/seamanagement</url>
<capturedate>20191024125646</capturedate>
</result>
<result>
<compressedoffset>188646081</compressedoffset>
<mimetype>warc/revisit</mimetype>
<file>
/heritrix/output/frequent-npld/20191010210458/warcs/BL-NPLD-20191024102939401-00620-76~npld-heritrix3-worker-1~8443.warc.gz
</file>
<redirecturl>-</redirecturl>
<urlkey>scot,gov)/topics/marine/seamanagement</urlkey>
<digest>3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ</digest>
<httpresponsecode>0</httpresponsecode>
<robotflags>-</robotflags>
<url>http://www2.gov.scot/Topics/marine/seamanagement</url>
<capturedate>20191024125648</capturedate>
</result>
</results>
</wayback>

So, it looks like our revisit records are causing some kind of problem?

@anjackson
Copy link
Contributor Author

On the back end, this log line fires immediately:

127.0.0.1 - - [2019-12-20 18:01:56] "POST /archive/resource/postreq?url=http%3A%2F%2Fwww2.gov.scot%2FTopics%2Fmarine%2Fseamanagement&closest=20191024125648&matchType=exact HTTP/1.1" 200 2652 0.156263

but then nothing else AFIACT.

@anjackson
Copy link
Contributor Author

Seems cdx-indexer sees the same thing:

scot,gov)/topics/marine/seamanagement 20191024125648 http://www2.gov.scot/Topics/marine/seamanagement warc/revisit - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 425 188646081 BL-NPLD-20191024102939401-00620-76~npld-heritrix3-worker-1~8443.warc.gz

@anjackson
Copy link
Contributor Author

anjackson commented Dec 20, 2019

The revisit records looks like this:

[root@ingest ~]# tail -c +188646082 BL-NPLD-20191024102939401-00620-76~npld-heritrix3-worker-1~8443.warc.gz | gunzip | head -50
WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://www2.gov.scot/Topics/marine/seamanagement
WARC-Date: 2019-10-24T12:56:48Z
WARC-IP-Address: 134.19.161.249
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Truncated: length
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Refers-To-Date: 2019-03-29T23:43:44Z
WARC-Refers-To-Target-URI: http://www2.gov.scot/Topics/marine/seamanagement
WARC-Record-ID: <urn:uuid:2af268a9-9a6b-43c0-85f9-cecc2699a7ca>
Content-Type: application/http; msgtype=response
Content-Length: 136

HTTP/1.0 302 Found
Location: https://www2.gov.scot/Topics/marine/seamanagement
Server: BigIP
Connection: close
Content-Length: 0



So, it's a http -> https redirect that's been deduplicated.

The nearest match that's not a revisit is:

<result>
<compressedoffset>863304405</compressedoffset>
<mimetype>application/http</mimetype>
<file>
/heritrix/output/warcs/quarterly/20191001020435/BL-20191003115004411-01312-62~ukwa-h3-pulse-quarterly~8443.warc.gz
</file>
<redirecturl>-</redirecturl>
<urlkey>scot,gov)/topics/marine/seamanagement</urlkey>
<digest>3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ</digest>
<httpresponsecode>302</httpresponsecode>
<robotflags>-</robotflags>
<url>https://www.gov.scot/Topics/marine/seamanagement/</url>
<capturedate>20191003115927</capturedate>
</result>
# tail -c +863304406 BL-20191003115004411-01312-62~ukwa-h3-pulse-quarterly~8443.warc.gz | gunzip | head -50
WARC/1.0
WARC-Type: response
WARC-Target-URI: https://www.gov.scot/Topics/marine/seamanagement/
WARC-Date: 2019-10-03T11:59:27Z
WARC-IP-Address: 54.229.133.187
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Record-ID: <urn:uuid:91dd76dc-7089-4441-a3e2-80ed50a59763>
Content-Type: application/http; msgtype=response
Content-Length: 1402

HTTP/1.1 302
Date: Thu, 03 Oct 2019 11:59:27 GMT
Content-Length: 0
Connection: close
Content-Security-Policy: script-src 'self' 'unsafe-inline' 'unsafe-eval' data: https://cdn.syndication.twimg.com/ https://platform.twitter.com https://www.googletagmanager.com https://tagmanager.google.com/ https://www.google-analytics.com https://translate.google.com https://translate.googleapis.com https://www.youtube.com/iframe_api https://s.ytimg.com/; img-src 'self' data: https://cdn.prgloo.com/ https://www.google.com https://www.google-analytics.com/ https://www.googletagmanager.com/ https://www.gstatic.com https://translate.google.com https://translate.googleapis.com https://*.staticflickr.com/ https://*.ytimg.com/ https://*.twimg.com/ https://*.twitter.com/; style-src 'self' 'unsafe-inline' https://fonts.googleapis.com https://maxcdn.bootstrapcdn.com https://platform.twitter.com https://ton.twimg.com/ https://translate.google.com https://translate.googleapis.com https://tagmanager.google.com/; font-src 'self' data: https://fonts.gstatic.com https://maxcdn.bootstrapcdn.com; frame-src https://platform.twitter.com https://syndication.twitter.com https://www.googletagmanager.com https://www.youtube.com https://youtu.be https://www.youtube-nocookie.com; object-src 'self'; report-uri /service/csp;
Location: https://www2.gov.scot/Topics/marine/seamanagement
X-Frame-Options: SAMEORIGIN



which looks fine to me.

I think self-redirects should be ignored, but it looks like that's not happening here.

@anjackson
Copy link
Contributor Author

Logging three -hang- requests as they talks to CDX and HDFS:

192.168.45.25 - - [20/Dec/2019:23:52:45 +0000] "GET /data-heritrix?q=type%3Aurlquery+url%3Ahttp%253A%252F%252Fwww2.gov.scot%252FTopics%252Fmarine%252Fseamanagement HTTP/1.1" 200 47043 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [20/Dec/2019:23:52:45 +0000] "GET /data-heritrix?q=type%3Aurlquery+url%3Ahttp%253A%252F%252Fwww2.gov.scot%252FTopics%252Fmarine%252Fseamanagement HTTP/1.1" 200 47043 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [20/Dec/2019:23:52:45 +0000] "GET /webhdfs/v1/heritrix/output/frequent-npld/20191010210458/warcs/BL-NPLD-20191024102939401-00620-76~npld-heritrix3-worker-1~8443.warc.gz?op=OPEN&offset=188646081&user.name=access HTTP/1.1" 200 318599 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [20/Dec/2019:23:52:45 +0000] "GET /webhdfs/v1/heritrix/output/warcs/quarterly/20190401020202/BL-20190402155542852-01287-62~ukwa-h3-pulse-quarterly~8443.warc.gz?op=OPEN&offset=767694787&user.name=access HTTP/1.1" 200 98134 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [20/Dec/2019:23:53:02 +0000] "GET /data-heritrix?q=type%3Aurlquery+url%3Ahttp%253A%252F%252Fwww2.gov.scot%252FTopics%252Fmarine%252Fseamanagement HTTP/1.1" 200 47079 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [20/Dec/2019:23:53:02 +0000] "GET /data-heritrix?q=type%3Aurlquery+url%3Ahttp%253A%252F%252Fwww2.gov.scot%252FTopics%252Fmarine%252Fseamanagement HTTP/1.1" 200 47057 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [20/Dec/2019:23:53:02 +0000] "GET /webhdfs/v1/heritrix/output/frequent-npld/20191010210458/warcs/BL-NPLD-20191024102939401-00620-76~npld-heritrix3-worker-1~8443.warc.gz?op=OPEN&offset=188646081&user.name=access HTTP/1.1" 200 403327 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [20/Dec/2019:23:53:02 +0000] "GET /webhdfs/v1/heritrix/output/warcs/quarterly/20190401020202/BL-20190402155542852-01287-62~ukwa-h3-pulse-quarterly~8443.warc.gz?op=OPEN&offset=767694787&user.name=access HTTP/1.1" 200 218030 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [20/Dec/2019:23:53:13 +0000] "GET /data-heritrix?q=type%3Aurlquery+url%3Ahttp%253A%252F%252Fwww2.gov.scot%252FTopics%252Fmarine%252Fseamanagement HTTP/1.1" 200 47069 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [20/Dec/2019:23:53:13 +0000] "GET /data-heritrix?q=type%3Aurlquery+url%3Ahttp%253A%252F%252Fwww2.gov.scot%252FTopics%252Fmarine%252Fseamanagement HTTP/1.1" 200 47072 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [20/Dec/2019:23:53:13 +0000] "GET /webhdfs/v1/heritrix/output/frequent-npld/20191010210458/warcs/BL-NPLD-20191024102939401-00620-76~npld-heritrix3-worker-1~8443.warc.gz?op=OPEN&offset=188646081&user.name=access HTTP/1.1" 200 375367 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [20/Dec/2019:23:53:13 +0000] "GET /webhdfs/v1/heritrix/output/warcs/quarterly/20190401020202/BL-20190402155542852-01287-62~ukwa-h3-pulse-quarterly~8443.warc.gz?op=OPEN&offset=767694787&user.name=access HTTP/1.1" 200 89942 "-" "python-requests/2.22.0" "-"

Which does indeed correspond to the redirect above.

@anjackson anjackson changed the title Some requests appear to 'hang' Some revisit/redirect requests appear to 'hang' Dec 22, 2019
@anjackson
Copy link
Contributor Author

Hm, set up a local test. When set up set up as a collection and indexed by pywb, the WARCs worked fine. Using OutbackCDX does not behave the same.

Using the xmlquery+http it seems to work but is slow.

pywb_1         | [pid: 24|app: 0|req: 1/1] 172.27.0.1 () {38 vars in 2571 bytes} [Sun Dec 22 09:36:19 2019] GET /qa-access/20191024125648/http://www2.gov.scot/Topics/marine/seamanagement => generated 3008 bytes in 22 msecs (HTTP/1.1 200) 4 headers in 592 bytes (3 switches on core 399)
pywb_1         | [pid: 21|app: 0|req: 1/2] 172.27.0.1 () {36 vars in 2378 bytes} [Sun Dec 22 09:36:19 2019] GET /static/wb_frame.js => generated 8514 bytes in 13 msecs (HTTP/1.1 200) 2 headers in 79 bytes (3 switches on core 399)
pywb_1         | [pid: 23|app: 0|req: 1/3] 172.27.0.1 () {36 vars in 2390 bytes} [Sun Dec 22 09:36:19 2019] GET /static/default_banner.js => generated 9796 bytes in 13 msecs (HTTP/1.1 200) 2 headers in 79 bytes (3 switches on core 399)
pywb_1         | [pid: 24|app: 0|req: 2/4] 172.27.0.1 () {36 vars in 2407 bytes} [Sun Dec 22 09:36:19 2019] GET /static/default_banner.css => generated 3783 bytes in 22 msecs (HTTP/1.1 200) 2 headers in 65 bytes (191 switches on core 399)
pywb_1         | [pid: 20|app: 0|req: 1/6] 172.27.0.1 () {36 vars in 2405 bytes} [Sun Dec 22 09:36:19 2019] GET /static/ukwa-2018-w-sml.png => generated 5678 bytes in 17 msecs (HTTP/1.1 200) 2 headers in 66 bytes (26 switches on core 399)
pywb_1         | [pid: 19|app: 0|req: 1/6] 172.27.0.1 () {36 vars in 2391 bytes} [Sun Dec 22 09:36:19 2019] GET /static/calendar.svg => generated 467 bytes in 13 msecs (HTTP/1.1 200) 2 headers in 69 bytes (11 switches on core 399)
pywb_1         | /usr/local/lib/python3.7/site-packages/pywb-2.4.0rc0-py3.7.egg/pywb/warcserver/index/indexsource.py:264: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
pywb_1         | 127.0.0.1 - - [2019-12-22 09:36:20] "POST /qa-access/resource/postreq?url=http%3A%2F%2Fwww2.gov.scot%2FTopics%2Fmarine%2Fseamanagement&closest=20191024125648&matchType=exact HTTP/1.1" 200 18548 0.562804
pywb_1         | [pid: 24|app: 0|req: 3/7] 172.27.0.1 () {36 vars in 2558 bytes} [Sun Dec 22 09:36:19 2019] GET /qa-access/20191024125648mp_/http://www2.gov.scot/Topics/marine/seamanagement => generated 0 bytes in 588 msecs (HTTP/1.1 307) 3 headers in 240 bytes (2 switches on core 399)
pywb_1         | /usr/local/lib/python3.7/site-packages/pywb-2.4.0rc0-py3.7.egg/pywb/warcserver/index/indexsource.py:264: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
pywb_1         | 127.0.0.1 - - [2019-12-22 09:36:20] "POST /qa-access/resource/postreq?url=https%3A%2F%2Fwww2.gov.scot%2FTopics%2Fmarine%2Fseamanagement&closest=20191024125646&matchType=exact HTTP/1.1" 200 18413 0.040935
pywb_1         | [pid: 24|app: 0|req: 4/8] 172.27.0.1 () {36 vars in 2586 bytes} [Sun Dec 22 09:36:20 2019] GET /qa-access/20191024125646mp_/https://www2.gov.scot/Topics/marine/seamanagement => generated 23476 bytes in 70 msecs (HTTP/1.1 200) 17 headers in 1835 bytes (3 switches on core 399)
pywb_1         | [pid: 22|app: 0|req: 1/9] 172.27.0.1 () {34 vars in 2389 bytes} [Sun Dec 22 09:36:20 2019] GET /static/default_banner.js => generated 9796 bytes in 4 msecs (HTTP/1.1 200) 2 headers in 79 bytes (3 switches on core 399)
pywb_1         | [pid: 21|app: 0|req: 2/10] 172.27.0.1 () {34 vars in 2373 bytes} [Sun Dec 22 09:36:20 2019] GET /static/wombat.js => generated 112806 bytes in 4 msecs (HTTP/1.1 200) 2 headers in 81 bytes (3 switches on core 399)
pywb_1         | [pid: 20|app: 0|req: 2/11] 172.27.0.1 () {34 vars in 2371 bytes} [Sun Dec 22 09:36:20 2019] GET /static/fixup.js => generated 739 bytes in 2 msecs (HTTP/1.1 200) 2 headers in 78 bytes (23 switches on core 399)
pywb_1         | [pid: 22|app: 0|req: 2/12] 172.27.0.1 () {34 vars in 2406 bytes} [Sun Dec 22 09:36:20 2019] GET /static/default_banner.css => generated 3783 bytes in 5 msecs (HTTP/1.1 200) 2 headers in 65 bytes (191 switches on core 399)
pywb_1         | [pid: 21|app: 0|req: 3/13] 172.27.0.1 () {34 vars in 2387 bytes} [Sun Dec 22 09:36:20 2019] GET /static/transclusions.js => generated 1972 bytes in 29 msecs (HTTP/1.1 200) 2 headers in 79 bytes (84 switches on core 399)
pywb_1         | 127.0.0.1 - - [2019-12-22 09:36:20] "POST /qa-access/resource/postreq?url=https%3A%2F%2Fwww2.gov.scot%2FSG%2Fcss%2Fbetabanner.css&closest=20191024125646&matchType=exact HTTP/1.1" 404 432 0.076436
pywb_1         | 127.0.0.1 - - [2019-12-22 09:36:20] "POST /qa-access/resource/postreq?url=https%3A%2F%2Fwww2.gov.scot%2FSG%2FContent%2FsgNotification.css&closest=20191024125646&matchType=exact HTTP/1.1" 404 448 0.070863
pywb_1         | 127.0.0.1 - - [2019-12-22 09:36:20] "POST /qa-access/resource/postreq?url=https%3A%2F%2Fwww2.gov.scot%2FSG%2FContent%2FvendorLibsBundle.min.css&closest=20191024125646&matchType=exact HTTP/1.1" 404 460 0.112325
pywb_1         | 127.0.0.1 - - [2019-12-22 09:36:20] "POST /qa-access/resource/postreq?url=https%3A%2F%2Fwww2.gov.scot%2Fjs%2Fpublic.js&closest=20191024125646&matchType=exact HTTP/1.1" 404 414 0.104713
...

Using the cdx+http mode it just says Not Found? e.g. opening http://localhost:8081/qa-access/20191024125648/http://www2.gov.scot/Topics/marine/seamanagement

pywb_1         | [pid: 14|app: -1|req: -1/182]  () {0 vars in 0 bytes} [Sun Dec 22 09:31:21 2019]   => generated 0 bytes in 0 msecs ( 0) 0 headers in 0 bytes (1 switches on core 398)
pywb_1         | [pid: 14|app: 0|req: 34/183] 172.27.0.1 () {38 vars in 2597 bytes} [Sun Dec 22 09:31:23 2019] GET /qa-access/20191024125648/http://www2.gov.scot/Topics/marine/seamanagement => generated 3008 bytes in 1 msecs (HTTP/1.1 200) 4 headers in 592 bytes (3 switches on core 398)
pywb_1         | [pid: 21|app: 0|req: 19/184] 172.27.0.1 () {36 vars in 2416 bytes} [Sun Dec 22 09:31:23 2019] GET /static/default_banner.js => generated 9796 bytes in 1 msecs (HTTP/1.1 200) 2 headers in 79 bytes (3 switches on core 399)
pywb_1         | [pid: 22|app: 0|req: 20/185] 172.27.0.1 () {36 vars in 2404 bytes} [Sun Dec 22 09:31:23 2019] GET /static/wb_frame.js => generated 8514 bytes in 1 msecs (HTTP/1.1 200) 2 headers in 79 bytes (3 switches on core 398)
pywb_1         | [pid: 14|app: 0|req: 35/186] 172.27.0.1 () {36 vars in 2433 bytes} [Sun Dec 22 09:31:23 2019] GET /static/default_banner.css => generated 3783 bytes in 64 msecs (HTTP/1.1 200) 2 headers in 65 bytes (191 switches on core 398)
pywb_1         | [pid: 22|app: 0|req: 21/187] 172.27.0.1 () {36 vars in 2431 bytes} [Sun Dec 22 09:31:23 2019] GET /static/ukwa-2018-w-sml.png => generated 5678 bytes in 1 msecs (HTTP/1.1 200) 2 headers in 66 bytes (26 switches on core 398)
pywb_1         | [pid: 14|app: 0|req: 36/188] 172.27.0.1 () {36 vars in 2417 bytes} [Sun Dec 22 09:31:23 2019] GET /static/calendar.svg => generated 467 bytes in 7 msecs (HTTP/1.1 200) 2 headers in 69 bytes (11 switches on core 398)
warc-server_1  | 172.27.0.4 - - [22/Dec/2019:09:31:23 +0000] "GET /webhdfs/v1/by-filename/BL-NPLD-20191024102939401-00620-76~npld-heritrix3-worker-1~8443.warc.gz?op=OPEN&offset=188646081&length=425&user.name=access HTTP/1.1" 200 425 "-" "python-requests/2.22.0"
warc-server_1  | 172.27.0.4 - - [22/Dec/2019:09:31:23 +0000] "GET /webhdfs/v1/by-filename/BL-20191003115004411-01312-62~ukwa-h3-pulse-quarterly~8443.warc.gz?op=OPEN&offset=863304405&length=700&user.name=access HTTP/1.1" 200 700 "-" "python-requests/2.22.0"
pywb_1         | 127.0.0.1 - - [2019-12-22 09:31:23] "POST /qa-access/resource/postreq?url=http%3A%2F%2Fwww2.gov.scot%2FTopics%2Fmarine%2Fseamanagement&closest=20191024125648&matchType=exact HTTP/1.1" 200 1642 0.140938
pywb_1         | [pid: 23|app: 0|req: 10/189] 172.27.0.1 () {36 vars in 2584 bytes} [Sun Dec 22 09:31:23 2019] GET /qa-access/20191024125648mp_/http://www2.gov.scot/Topics/marine/seamanagement => generated 10 bytes in 155 msecs (HTTP/1.1 404) 10 headers in 935 bytes (3 switches on core 399)

@ikreymer
Copy link
Contributor

Thanks for the detailed notes! The skipping of the self-redirect should already be happening...
Previously, we had an issue with cdx status code being 0 due to <httpresponsecode>0</httpresponsecode> but i think that is not accounted for.

However, what appears to be missing is a special case for the empty digest sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ. I think pywb will attempt to match a non-revisit record with this digest, which could trigger additional loads. Instead, it should of course return an empty payload..

I'll try adding a fix for that and see if it helps things

ikreymer added a commit to webrecorder/pywb that referenced this issue Jan 11, 2020
- if a revisit record has empty hash, don't attempt to lookup an original, simply use with empty payload
ikreymer added a commit to webrecorder/pywb that referenced this issue Jan 11, 2020
- if a revisit record has empty hash, don't attempt to lookup an original, simply use with empty payload
@ikreymer
Copy link
Contributor

Not entirely sure if that was the issue, but updated pywb with a fix and ukwa-pywb to point to the latest webrecorder/pywb:2.4.0-rc4 image. Hopefully that will fix it!

@anjackson
Copy link
Contributor Author

Note the pywb release tag was 2.4.0rc4 (no hyphen). I've updated the reference in ukwa-pywb.

@anjackson
Copy link
Contributor Author

I set up a local test using some minimal WARCs. When set up set up as a collection and indexed by pywb, the WARCs worked fine, but OutbackCDX does not behave the same, and the JSON and XML APIs do not behave the same.

Once example, the CDX records look like this:

scot,gov)/topics/marine/seamanagement 20191003115920 http://www.gov.scot/Topics/marine/seamanagement text/html 301 QH732FYSV7UM34JYWVYMB7EZGR2CYM6B - - 454 862873909 BL-20191003115004411-01312-62~ukwa-h3-pulse-quarterly~8443.warc.gz
scot,gov)/topics/marine/seamanagement 20191003115927 https://www.gov.scot/Topics/marine/seamanagement/ unk 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 700 863304405 BL-20191003115004411-01312-62~ukwa-h3-pulse-quarterly~8443.warc.gz
scot,gov)/topics/marine/seamanagement 20191024125646 https://www2.gov.scot/Topics/marine/seamanagement text/html 200 6AZTIB2Q5GWAB47AIIHXHFVTZKOD2UBU - - 5577 173790774 BL-NPLD-20191024103625885-00622-76~npld-heritrix3-worker-1~8443.warc.gz
scot,gov)/topics/marine/seamanagement 20191024125648 http://www2.gov.scot/Topics/marine/seamanagement warc/revisit 0 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 425 188646081 BL-NPLD-20191024102939401-00620-76~npld-heritrix3-worker-1~8443.warc.gz

Using the cdx+http mode it just says Not Found?

ywb_1         | 2020-01-13 19:19:06,067: [DEBUG]: http://cdx:8080 "GET /tc?url=http%3A//www2.gov.scot/Topics/marine/seamanagement&closest=20191024125648&sort=closest&limit=10 HTTP/1.1" 200 None
pywb_1         | 2020-01-13 19:19:06,108: [DEBUG]: http://warc-server:8000 "GET /webhdfs/v1/by-filename/BL-NPLD-20191024102939401-00620-76~npld-heritrix3-worker-1~8443.warc.gz?op=OPEN&offset=188646081&length=425&user.name=access HTTP/1.1" 200 None
pywb_1         | Traceback (most recent call last):
pywb_1         |   File "/usr/local/lib/python3.7/site-packages/pywb-2.4.0rc4-py3.7.egg/pywb/warcserver/handlers.py", line 137, in __call__
pywb_1         |     out_headers, resp = loader(cdx, params)
pywb_1         |   File "/usr/local/lib/python3.7/site-packages/pywb-2.4.0rc4-py3.7.egg/pywb/warcserver/resource/responseloader.py", line 37, in __call__
pywb_1         |     entry = self.load_resource(cdx, params)
pywb_1         |   File "/usr/local/lib/python3.7/site-packages/pywb-2.4.0rc4-py3.7.egg/pywb/warcserver/resource/responseloader.py", line 216, in load_resource
pywb_1         |     http_headers.get_header('Location'))
pywb_1         |   File "/usr/local/lib/python3.7/site-packages/pywb-2.4.0rc4-py3.7.egg/pywb/warcserver/resource/responseloader.py", line 154, in raise_on_self_redirect
pywb_1         |     raise LiveResourceException(msg)
pywb_1         | pywb.utils.wbexception.LiveResourceException: Self Redirect www2.gov.scot/topics/marine/seamanagement -> www2.gov.scot/topics/marine/seamanagement
pywb_1         | 2020-01-13 19:19:06,113: [DEBUG]: http://cdx:8080 "GET /20191024125648id_/http://www2.gov.scot/Topics/marine/seamanagement HTTP/1.1" 404 10
pywb_1         | 127.0.0.1 - - [2020-01-13 19:19:06] "POST /qa-access/resource/postreq?url=http%3A%2F%2Fwww2.gov.scot%2FTopics%2Fmarine%2Fseamanagement&closest=20191024125648&matchType=exact HTTP/1.1" 200 1642 0.057308
pywb_1         | 2020-01-13 19:19:06,115: [DEBUG]: http://localhost:40839 "POST /qa-access/resource/postreq?url=http%3A%2F%2Fwww2.gov.scot%2FTopics%2Fmarine%2Fseamanagement&closest=20191024125648&matchType=exact HTTP/1.1" 200 616

Using the xmlquery+http it seems to work but is slow.

ywb_1         | 2020-01-13 19:21:10,917: [INFO]: Proxy enabled for collection "qa-access"
pywb_1         | 2020-01-13 19:21:11,090: [INFO]: Starting Gevent Server on 8080
pywb_1         | 2020-01-13 19:21:11,090: [DEBUG]: starting server on 42981
pywb_1         | 2020-01-13 19:21:11,091: [DEBUG]: starting server on 8080
pywb_1         | 172.27.0.1 - - [2020-01-13 19:21:53] "GET /qa-access/20191024125648/http://www2.gov.scot/Topics/marine/seamanagement HTTP/1.1" 200 3637 0.011839
pywb_1         | 172.27.0.1 - - [2020-01-13 19:21:53] "GET /static/wb_frame.js HTTP/1.1" 200 8630 0.005952
pywb_1         | 172.27.0.1 - - [2020-01-13 19:21:53] "GET /static/default_banner.js HTTP/1.1" 200 10126 0.001017
pywb_1         | 172.27.0.1 - - [2020-01-13 19:21:53] "GET /static/default_banner.css HTTP/1.1" 200 3885 0.000781
pywb_1         | 172.27.0.1 - - [2020-01-13 19:21:53] "GET /static/ukwa-2018-w-sml.png HTTP/1.1" 200 5781 0.001736
pywb_1         | 172.27.0.1 - - [2020-01-13 19:21:53] "GET /static/calendar.svg HTTP/1.1" 200 573 0.001103
pywb_1         | 2020-01-13 19:21:53,618: [DEBUG]: Starting new HTTP connection (1): localhost:42981
pywb_1         | 2020-01-13 19:21:53,624: [DEBUG]: Running query: http://cdx:8080/tc?q=type%3Aurlquery+url%3Ahttp%253A%252F%252Fwww2.gov.scot%252FTopics%252Fmarine%252Fseamanagement
pywb_1         | 2020-01-13 19:21:53,629: [DEBUG]: Starting new HTTP connection (1): cdx:8080
pywb_1         | 2020-01-13 19:21:53,922: [DEBUG]: http://cdx:8080 "GET /tc?q=type%3Aurlquery+url%3Ahttp%253A%252F%252Fwww2.gov.scot%252FTopics%252Fmarine%252Fseamanagement HTTP/1.1" 200 None
pywb_1         | /usr/local/lib/python3.7/site-packages/pywb-2.4.0rc4-py3.7.egg/pywb/warcserver/index/indexsource.py:268: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
pywb_1         |   if not items:
pywb_1         | 2020-01-13 19:21:53,994: [DEBUG]: Starting new HTTP connection (1): warc-server:8000
pywb_1         | 2020-01-13 19:21:54,008: [DEBUG]: http://warc-server:8000 "GET /webhdfs/v1/by-filename/BL-NPLD-20191024102939401-00620-76~npld-heritrix3-worker-1~8443.warc.gz?op=OPEN&offset=188646081&user.name=access HTTP/1.1" 200 None
pywb_1         | Traceback (most recent call last):
pywb_1         |   File "/usr/local/lib/python3.7/site-packages/pywb-2.4.0rc4-py3.7.egg/pywb/warcserver/handlers.py", line 137, in __call__
pywb_1         |     out_headers, resp = loader(cdx, params)
pywb_1         |   File "/usr/local/lib/python3.7/site-packages/pywb-2.4.0rc4-py3.7.egg/pywb/warcserver/resource/responseloader.py", line 37, in __call__
pywb_1         |     entry = self.load_resource(cdx, params)
pywb_1         |   File "/usr/local/lib/python3.7/site-packages/pywb-2.4.0rc4-py3.7.egg/pywb/warcserver/resource/responseloader.py", line 216, in load_resource
pywb_1         |     http_headers.get_header('Location'))
pywb_1         |   File "/usr/local/lib/python3.7/site-packages/pywb-2.4.0rc4-py3.7.egg/pywb/warcserver/resource/responseloader.py", line 154, in raise_on_self_redirect
pywb_1         |     raise LiveResourceException(msg)
pywb_1         | pywb.utils.wbexception.LiveResourceException: Self Redirect www2.gov.scot/topics/marine/seamanagement -> www2.gov.scot/topics/marine/seamanagement
pywb_1         | 2020-01-13 19:21:54,013: [DEBUG]: Starting new HTTP connection (2): warc-server:8000
pywb_1         | 2020-01-13 19:21:54,063: [DEBUG]: http://warc-server:8000 "GET /webhdfs/v1/by-filename/BL-NPLD-20191024103625885-00622-76~npld-heritrix3-worker-1~8443.warc.gz?op=OPEN&offset=173790774&user.name=access HTTP/1.1" 200 None
pywb_1         | 127.0.0.1 - - [2020-01-13 19:21:54] "POST /qa-access/resource/postreq?url=http%3A%2F%2Fwww2.gov.scot%2FTopics%2Fmarine%2Fseamanagement&closest=20191024125648&matchType=exact HTTP/1.1" 200 18548 0.444980
pywb_1         | 2020-01-13 19:21:54,068: [DEBUG]: http://localhost:42981 "POST /qa-access/resource/postreq?url=http%3A%2F%2Fwww2.gov.scot%2FTopics%2Fmarine%2Fseamanagement&closest=20191024125648&matchType=exact HTTP/1.1" 200 17669
pywb_1         | 172.27.0.1 - - [2020-01-13 19:21:54] "GET /qa-access/20191024125648mp_/http://www2.gov.scot/Topics/marine/seamanagement HTTP/1.1" 307 277 0.460181
pywb_1         | 2020-01-13 19:21:54,083: [DEBUG]: Starting new HTTP connection (1): localhost:42981
pywb_1         | 2020-01-13 19:21:54,086: [DEBUG]: Running query: http://cdx:8080/tc?q=type%3Aurlquery+url%3Ahttps%253A%252F%252Fwww2.gov.scot%252FTopics%252Fmarine%252Fseamanagement
pywb_1         | 2020-01-13 19:21:54,093: [DEBUG]: http://cdx:8080 "GET /tc?q=type%3Aurlquery+url%3Ahttps%253A%252F%252Fwww2.gov.scot%252FTopics%252Fmarine%252Fseamanagement HTTP/1.1" 200 None
pywb_1         | 2020-01-13 19:21:54,097: [DEBUG]: Starting new HTTP connection (3): warc-server:8000
pywb_1         | 2020-01-13 19:21:54,110: [DEBUG]: http://warc-server:8000 "GET /webhdfs/v1/by-filename/BL-NPLD-20191024103625885-00622-76~npld-heritrix3-worker-1~8443.warc.gz?op=OPEN&offset=173790774&user.name=access HTTP/1.1" 200 None
pywb_1         | 127.0.0.1 - - [2020-01-13 19:21:54] "POST /qa-access/resource/postreq?url=https%3A%2F%2Fwww2.gov.scot%2FTopics%2Fmarine%2Fseamanagement&closest=20191024125646&matchType=exact HTTP/1.1" 200 18413 0.026522
pywb_1         | 2020-01-13 19:21:54,112: [DEBUG]: http://localhost:42981 "POST /qa-access/resource/postreq?url=https%3A%2F%2Fwww2.gov.scot%2FTopics%2Fmarine%2Fseamanagement&closest=20191024125646&matchType=exact HTTP/1.1" 200 17669
pywb_1         | 172.27.0.1 - - [2020-01-13 19:21:54] "GET /qa-access/20191024125646mp_/https://www2.gov.scot/Topics/marine/seamanagement HTTP/1.1" 200 25352 0.063614

What may be further confusing the issue is the site is hosting different content at www. and www2.

@ikreymer
Copy link
Contributor

For the cdx+http config, the issue was actually something unrelated -- a bug in the RemoteIndexSource, which was designed for an 'external' cdx server that also provides the raw content via id_/, eg. web.archive.org/cdx and web.archive.org/web..

As a result, it was attempting to make an id_/ request to the cdx server, which obviously fails:
2020-01-13 19:19:06,113: [DEBUG]: http://cdx:8080 "GET /20191024125648id_/http://www2.gov.scot/Topics/marine/seamanagement

Fortunately, there is a simple/elegant fix:

If the cdx response includes the full cdx line, warc, offset, length, then it'll try to load those instead: If not, it'll try the id_/ route.

The integration tests is set to ukwa/tinycdxserver:ukwa-production which uses an old version of OutbackCDX which does not include the warc info, but testing the fix with latest nlagovau/outbackcdx has cdx+http endpoint works well.

@ikreymer
Copy link
Contributor

The www2. vs www. is indeed causing an extra redirect, so revisit record results in two redirects, vs direct 200, thus ends up being slower.

The two redirects are:

  1. First redirect is www. -> www2. This seems like a bug as it should be eliminated in the check.

  2. Second redirect is for timestamp 20191024125648 -> 20191024125646.

This can actually be disabled by setting redirect_to_exact: false in pywb and that is the default setting, most to cut down on redirect behavior. Instead of a redirect, the timestamp could remain as is (to avoid time drift) or changed with replaceState (this option needs to be added).

I'll see if the 1) redirect can be eliminated also.

@ikreymer
Copy link
Contributor

webrecorder/pywb:2.4.0-rc5 should have a fix that avoids the www. -> www2. redirect.
That should address one of the perf issues, setting redirect_to_exact: false should further improve things.

@anjackson
Copy link
Contributor Author

anjackson commented Jan 21, 2020

To try and iron out some unnecessary variations, our CDX server has been upgraded to OutbackCDX 0.7.0 and we have switched to cdx+http rather than using the XML interface.

But.... we're still seeing odd behaviour. This example just hangs:

https://beta.webarchive.org.uk/wayback/archive/20190906060549cs_/http://www2.gov.scot/Styles/css/S154-140-159.css

I can't get to the CDX right now, but you can get the CDXJ like this:

https://beta.webarchive.org.uk/wayback/archive/timemap/cdxj/https://www2.gov.scot/Styles/css/S154-140-159.css

I'm struggling to see why this should be causing any kind of issue.

EDIT forgot to add, if you go straight to the latest timestamp that is not a warc/revisit then it works fine:

https://beta.webarchive.org.uk/wayback/archive/20190901161913cs_/http://www2.gov.scot/Styles/css/S154-140-159.css

@ikreymer
Copy link
Contributor

It must have something to do with loading multiple records as is needed for revisists..

All the cdxj entries have "length": "0", my guess is that this is causing the backend to load the entire WARC from the offset, perhaps causing the slowdown? Is it possible to reindex with length? If not, perhaps its possible to close the connection if that's the issue.

Would be interesting to get stats on the record load

@anjackson
Copy link
Contributor Author

Ah, I'll see if I can monitor what it's doing from the storage end.

The indexer we have currently doesn't output the record length, just because that's slightly tricky to implement within the map-reduce framework we are using. Hence, I've been putting it off. Looks like I'll have to work out how to do it.

ikreymer added a commit to webrecorder/pywb that referenced this issue Jan 22, 2020
- blockrecordloader: ensure record stream is closed after parsing one record, may help with ukwa/ukwa-pywb#53
- iframe: add allow fullscreen, autoplay
- wombat: update to latest, filter out custom wombat props from getOwnPropertyNames
- rules: add rule for vimeo
- bump version to rc6
@ikreymer
Copy link
Contributor

Looking further, it seems like partial reads can definitely pose issues: https://requests.readthedocs.io/en/master/user/advanced/#body-content-workflow
I found a missing close that can be added as suggested, trying that out in RC6 release to see if it helps.

@ikreymer
Copy link
Contributor

Well, potentially missing close/release_conn.. not sure if it will make a difference, but try running with this image: webrecorder/pywb:2.4.0-rc6-test. It should be better about releasing connections..

anjackson added a commit that referenced this issue Jan 23, 2020
@anjackson
Copy link
Contributor Author

So, when requesting...

https://beta.webarchive.org.uk/wayback/archive/20190906060549mp_/https://www2.gov.scot/Styles/css/S154-140-159.css

...these are the CDX and HDFS requests that get logged,

192.168.45.25 - - [23/Jan/2020:12:26:15 +0000] "GET /data-heritrix?url=https%3A//www2.gov.scot/Styles/css/S154-140-159.css&closest=20190906060549&sort=closest&limit=10 HTTP/1.1" 200 2779 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [23/Jan/2020:12:26:15 +0000] "GET /data-heritrix?url=https%3A//www2.gov.scot/Styles/css/S154-140-159.css&closest=20190829224752&sort=closest&limit=10 HTTP/1.1" 200 2771 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [23/Jan/2020:12:26:15 +0000] "GET /webhdfs/v1/heritrix/output/warcs/weekly/20190905104159/BL-20190906060329783-00019-63~ukwa-h3-pulse-weekly~8443.warc.gz?op=OPEN&offset=64732233&user.name=access HTTP/1.1" 200 369142 "-" "python-requests/2.22.0" "-"
192.168.45.25 - - [23/Jan/2020:12:26:15 +0000] "GET /webhdfs/v1/heritrix/output/warcs/monthly/20190901130333/BL-20190901161643432-00077-62~ukwa-h3-pulse-monthly~8443.warc.gz?op=OPEN&offset=249525127&user.name=access HTTP/1.1" 200 222382 "-" "python-requests/2.22.0" "-"

Looking at the sockets it does seem there's a lingering connection that takes a while to drop, which seems about consistent with reading the WARC whole. But even then, it should come back eventually.

This is the CDX it sees on the first call: /data-heritrix?url=https%3A//www2.gov.scot/Styles/css/S154-140-159.css&closest=20190906060549&sort=closest&limit=10:

scot,gov)/styles/css/s154-140-159.css 20190906060549 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 64732233 /heritrix/output/warcs/weekly/20190905104159/BL-20190906060329783-00019-63~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190909171446 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 628669033 /heritrix/output/warcs/weekly/20190909080236/BL-20190909164421343-00009-63~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190901161913 https://www2.gov.scot/Styles/css/S154-140-159.css text/css 200 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 249525127 /heritrix/output/warcs/monthly/20190901130333/BL-20190901161643432-00077-62~ukwa-h3-pulse-monthly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190829224752 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 1005125301 /heritrix/output/warcs/weekly/20190829111613/BL-20190829223006394-00017-62~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190916185754 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 486531076 /heritrix/output/warcs/weekly/20190916080226/BL-20190916183129444-00013-63~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190923192758 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 262992500 /heritrix/output/warcs/weekly/20190923080231/BL-20190923191018716-00014-63~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190813052856 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 763228040 /heritrix/output/warcs/weekly/20190812080202/BL-20190813051304533-00018-63~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20191002210658 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 26314008 /heritrix/output/warcs/weekly/20191002152237/BL-20191002210524398-00019-63~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20191008000412 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 15380496 /heritrix/output/warcs/weekly/20191007082201/BL-20191008000322910-00015-63~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190801162036 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 840454981 /heritrix/output/warcs/monthly/20190801130919/BL-20190801160625325-00025-63~ukwa-h3-pulse-monthly~8443.warc.gz

This is the CDX it sees on the second call: /data-heritrix?url=https%3A//www2.gov.scot/Styles/css/S154-140-159.css&closest=20190829224752&sort=closest&limit=10:

scot,gov)/styles/css/s154-140-159.css 20190829224752 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 1005125301 /heritrix/output/warcs/weekly/20190829111613/BL-20190829223006394-00017-62~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190901161913 https://www2.gov.scot/Styles/css/S154-140-159.css text/css 200 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 249525127 /heritrix/output/warcs/monthly/20190901130333/BL-20190901161643432-00077-62~ukwa-h3-pulse-monthly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190906060549 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 64732233 /heritrix/output/warcs/weekly/20190905104159/BL-20190906060329783-00019-63~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190909171446 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 628669033 /heritrix/output/warcs/weekly/20190909080236/BL-20190909164421343-00009-63~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190813052856 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 763228040 /heritrix/output/warcs/weekly/20190812080202/BL-20190813051304533-00018-63~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190916185754 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 486531076 /heritrix/output/warcs/weekly/20190916080226/BL-20190916183129444-00013-63~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190923192758 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 262992500 /heritrix/output/warcs/weekly/20190923080231/BL-20190923191018716-00014-63~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190801162036 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 840454981 /heritrix/output/warcs/monthly/20190801130919/BL-20190801160625325-00025-63~ukwa-h3-pulse-monthly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20190730025401 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 1070016 /heritrix/output/warcs/weekly/20190729080238/BL-20190730025357934-00018-63~ukwa-h3-pulse-weekly~8443.warc.gz
scot,gov)/styles/css/s154-140-159.css 20191002210658 https://www2.gov.scot/Styles/css/S154-140-159.css warc/revisit 0 HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ - - 0 26314008 /heritrix/output/warcs/weekly/20191002152237/BL-20191002210524398-00019-63~ukwa-h3-pulse-weekly~8443.warc.gz

So, it seems to pull the first result (the warc/revisit) from the first CDX call, and then it does pull the actual all-good text/css non-revisit record. It's not clear why it calls the CDX twice, and why it uses 20190829224752 as the target timestamp (which appears to be the timestamp of the fourth result from the first CDX call?!

The WARC record itself looks fine:

# curl "http://hdfsapi/webhdfs/v1/heritrix/output/warcs/monthly/20190901130333/BL-20190901161643432-00077-62~ukwa-h3-pulse-monthly~8443.warc.gz?op=OPEN&offset=249525127&user.name=access" |gunzip | head -20
WARC-Type: response
WARC-Target-URI: https://www2.gov.scot/Styles/css/S154-140-159.css
WARC-Date: 2019-09-01T16:19:13Z
WARC-IP-Address: 134.19.161.249
WARC-Payload-Digest: sha1:HXYLQDAFKQIJ6URMQI5MSSHOKGX7MYRJ
WARC-Record-ID: <urn:uuid:b2c9722b-7dce-4bb1-9ad9-072eba29d33d>
Content-Type: application/http; msgtype=response
Content-Length: 21825

HTTP/1.1 200 OK
Content-Type: text/css
Last-Modified: Fri, 18 Dec 2015 19:27:29 GMT
...

@anjackson
Copy link
Contributor Author

Okay, as a workaround, I've reconfigured the index to drop revisits completely and things are much happer.

        index:
            type: cdx
            api_url: 'http://cdx.api.wa.bl.uk/data-heritrix?url={url}&closest={closest}&sort=closest&filter=!mimetype:warc/revisit'
            replay_url: https://www.webarchive.org.uk/wayback/archive/{timestamp}id_/{url}

ikreymer added a commit to webrecorder/pywb that referenced this issue Feb 21, 2020
* fixes for RC6:
- blockrecordloader: ensure record stream is closed after parsing one record 
- wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed
- simplify no_except_close
may help with ukwa/ukwa-pywb#53
- iframe: add allow fullscreen, autoplay
- wombat: update to latest, filter out custom wombat props from getOwnPropertyNames
- rules: add rule for vimeo

* cdx formatting: fix output=text to return plain text / non-cdxj output

* auto fetch fix:
- update to latest wombat to fix auto-fetch in rewriting mode
- fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode
- don't use global to allow repeated checks

* rewriter html check: peek 1024 bytes to determine if page is html instead of 128

* fix jinja2 dependency for py2
ikreymer added a commit to webrecorder/pywb that referenced this issue Apr 30, 2020
- warcserver: when parsing headers to check for redirect, reserialized headers
may be of different length then original, causing warcserver->app response to hang
now adjusting the content-length on the warc record and also not including a fixed
length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53
- undo change in path resolvers to use os.path.join, just concatenate full_path + filename
- rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548)
- bump version to rc7
@ikreymer ikreymer mentioned this issue Apr 30, 2020
8 tasks
ikreymer added a commit to webrecorder/pywb that referenced this issue May 1, 2020
* misc fixes for 2.4.0rc7:
- warcserver: when parsing headers to check for redirect, reserialized headers
may be of different length then original, causing warcserver->app response to hang
now adjusting the content-length on the warc record and also not including a fixed
length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53
- undo change in path resolvers to use os.path.join, just concatenate full_path + filename
- rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548)
- bump version to rc7

* ci: attempt to fix travis build for 27, 35
@anjackson
Copy link
Contributor Author

Great! -- Having (temporarily?) rolled 2.4.0-beta out as BETA Wayback, the links that hung before now work fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants