Inconsistent playback in Archive Web.Page and pywb #88

peterchanws · 2022-07-06T16:38:12Z

I archived https://www.instagram.com/clubcardinal/ using Archive Web.Page and can see the archived posts when I logged out from Instagram.
When I loaded the file in pywb, I was asked to login to see the posts.

https://was-pywb-qa.stanford.edu/was/20210909173940/https://www.instagram.com/clubcardinal/

wacz
https://argo-qa.stanford.edu/view/druid:bk846bw0404

Another example for the same issue
https://was-pywb-qa.stanford.edu/was/20210901172027/https://www.instagram.com/stanford_flip/

wacz
https://argo-qa.stanford.edu/view/druid:vg019xs5734

lwrubel · 2022-07-07T18:23:41Z

This needs some research into whether this is a capture / live leak / replay issue or something about our indexing and accessioning. This issue includes doing that investigation.

lwrubel · 2022-07-07T19:50:25Z

I was able to replicate this, using https://replayweb.page with the wacz to view the Club Cardinal posts.

lwrubel · 2022-07-11T14:05:29Z

Areas to investigate, in conversation with @edsu:

wombat.js version differences between replayweb.page and pywb
cdx differences between index in WACZ and how cdxj-indexer indexes the warcs

peterchanws · 2022-07-13T15:35:02Z

Moved this example from #98.

https://wayback.archive-it.org/8751/20220613070217/https://www.stanfordeugenics.com/
In the above archived site, I can click through "What is Eugenics?";"David Starr Jordan"; "Other Stanford Eugenicists"; "More"

However, these links don't work in our pywb environment:
https://swap.stanford.edu/was/20220613070141/https://www.stanfordeugenics.com/

edsu · 2022-07-14T19:54:45Z

I'm bringing this comment over from #98 since it relates to differences in wacz/archivewebpage and pywb replay.

I downloaded the WACZ from https://argo-stage.stanford.edu/view/druid:cm752zx1126 and can confirm that the CDX file that comes in the WACZ is significantly different from the CDX file that cdxj-indexer generates. I still haven't dug into the details to characterize what the differences are, but it is clear from the number of lines in the file (bundled CDX has 433 lines and generated has 427).

To further complicate matters when I used the CDX supplied with the WACZ in a test pywb collection it throws this error when attempting to load the content:

{'args': {'coll': 'test', 'type': 'replay', 'metadata': {}}, 'error': '{"message": "Internal Error: \'int\' object has no attribute \'startswith\'"}'}

edsu · 2022-07-15T01:01:11Z

I created a small fix for pywb to allow it to use the CDXJ files that are supplied with the WACZ. I wanted to see if pywb could display the content better using the WACZ's CDXJ file instead of the one we are generating with cdxj-indexer.

Unfortunately the resulting view looks the same: missing icon images along the top, and when you go to click on a image post you get prompted to log in:

So the problem appears to be something other than the CDXJ index itself, and lies elsewhere in replay.

In chatting with @ikreymer today he suggested that an inconsistency between pywb and wabac.js's implementation of "fuzzy matching" might be the root cause of this discrepency in playback. It could be useful to compare HTTP requests and responses to see if any differences could provide a clue as to what might need adjusting in pywb.

Ilya did also mention that the long term plan for pywb v3 is to use the ReplayWebPage component itself for playback in pywb, rather than relying on parallel implementations of things like URL rewriting and fuzzy matching. This would bring them into better alignment in terms of playback, and should (theoretically) fix this particular playback problem since the WACZ file seems to display fine in ReplayWebPage.

peterchanws · 2022-07-15T01:09:47Z

Since we can still render Instagram a/cs using replay.page (https://searchworks.stanford.edu/view/yr018yr3132), I will wait until pywb can render Instagram properly before moving them to SDR.

edsu · 2022-07-18T17:10:09Z

@ikreymer noticed a truncated JavaScript resource during replay, and remembered that a previous version of ArchiveWeb.Page had a bug in how the WARC-Record's Content-Length is recorded. Ordinarily this is not a problem for ReplayWeb.Page and ArchiveWeb.page since they do not use the Content-Length during replay, but pywb does.

To test I recreated a WACZ for /https://www.instagram.com/clubcardinal/ using the latest ArchiveWeb.page I performed a one-time-registration for the WACZ:

https://argo-qa.stanford.edu/view/druid:dw163st3321

and you can see it playing back here:

https://was-pywb-qa.stanford.edu/was/20220718153218/https://www.instagram.com/clubcardinal/

The good news is that the images all appear to render now. But if you click on one of them you are still prompted to log in, instead of seeing the detail. I suspect that a fuzzy matching rule may be the root cause. So I think it's still worth investigating that.

edsu · 2022-07-18T18:47:22Z

The above issue with Wix menu items on https://www.stanfordeugenics.com/ not playing back correctly in pywb isn't really related to this Instagram playback issue. But I did get a chance to archive the site with ArchiveWeb.page and then registered the resulting WACZ, you can see that it plays back with the menus working here:

https://was-pywb-stage.stanford.edu/was/20220718182905/https://www.stanfordeugenics.com/

So the problem here is that the Archive-It crawl was incomplete.

peterchanws · 2022-07-25T19:56:34Z

Thanks, Edu, for pointing out the incomplete capture of https://www.stanfordeugenics.com/ in AIT and also capturing the site using ArchiveWeb.page. I have replaced the old production capture site (using AIT) with the WACZ file (using ArchiveWeb.page)

edsu · 2022-08-15T14:24:57Z

This morning I performed a new capture of /https://www.instagram.com/clubcardinal/ with ArchiveWebPage extension v0.8.1 and then loaded the WACZ into QA, and created a seed for it:

Now you can click the images to get the details/comments! I think this was an improvement in the image srcset detection behavior.

peterchanws closed this as completed Jul 6, 2022

peterchanws reopened this Jul 6, 2022

lwrubel added the web archiving 2022 web archiving work cycle label Jul 7, 2022

edsu mentioned this issue Jul 12, 2022

Links works in Archive-IT not working in pywb #98

Closed

edsu added the replay label Jul 12, 2022

edsu self-assigned this Jul 13, 2022

edsu closed this as completed Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent playback in Archive Web.Page and pywb #88

Inconsistent playback in Archive Web.Page and pywb #88

peterchanws commented Jul 6, 2022

lwrubel commented Jul 7, 2022

lwrubel commented Jul 7, 2022

lwrubel commented Jul 11, 2022 •

edited

Loading

peterchanws commented Jul 13, 2022

edsu commented Jul 14, 2022

edsu commented Jul 15, 2022

peterchanws commented Jul 15, 2022

edsu commented Jul 18, 2022

edsu commented Jul 18, 2022

peterchanws commented Jul 25, 2022

edsu commented Aug 15, 2022 •

edited

Loading

Inconsistent playback in Archive Web.Page and pywb #88

Inconsistent playback in Archive Web.Page and pywb #88

Comments

peterchanws commented Jul 6, 2022

lwrubel commented Jul 7, 2022

lwrubel commented Jul 7, 2022

lwrubel commented Jul 11, 2022 • edited Loading

peterchanws commented Jul 13, 2022

edsu commented Jul 14, 2022

edsu commented Jul 15, 2022

peterchanws commented Jul 15, 2022

edsu commented Jul 18, 2022

edsu commented Jul 18, 2022

peterchanws commented Jul 25, 2022

edsu commented Aug 15, 2022 • edited Loading

lwrubel commented Jul 11, 2022 •

edited

Loading

edsu commented Aug 15, 2022 •

edited

Loading