Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent playback in Archive Web.Page and pywb #88

Closed
peterchanws opened this issue Jul 6, 2022 · 11 comments
Closed

Inconsistent playback in Archive Web.Page and pywb #88

peterchanws opened this issue Jul 6, 2022 · 11 comments
Assignees
Labels
replay web archiving 2022 web archiving work cycle

Comments

@peterchanws
Copy link
Collaborator

I archived https://www.instagram.com/clubcardinal/ using Archive Web.Page and can see the archived posts when I logged out from Instagram.
When I loaded the file in pywb, I was asked to login to see the posts.

https://was-pywb-qa.stanford.edu/was/20210909173940/https://www.instagram.com/clubcardinal/

wacz
https://argo-qa.stanford.edu/view/druid:bk846bw0404

Another example for the same issue
https://was-pywb-qa.stanford.edu/was/20210901172027/https://www.instagram.com/stanford_flip/

wacz
https://argo-qa.stanford.edu/view/druid:vg019xs5734

@peterchanws peterchanws reopened this Jul 6, 2022
@lwrubel lwrubel added the web archiving 2022 web archiving work cycle label Jul 7, 2022
@lwrubel
Copy link
Contributor

lwrubel commented Jul 7, 2022

This needs some research into whether this is a capture / live leak / replay issue or something about our indexing and accessioning. This issue includes doing that investigation.

@lwrubel
Copy link
Contributor

lwrubel commented Jul 7, 2022

I was able to replicate this, using https://replayweb.page with the wacz to view the Club Cardinal posts.

@lwrubel
Copy link
Contributor

lwrubel commented Jul 11, 2022

Areas to investigate, in conversation with @edsu:

  • wombat.js version differences between replayweb.page and pywb
  • cdx differences between index in WACZ and how cdxj-indexer indexes the warcs

@peterchanws
Copy link
Collaborator Author

Moved this example from #98.

https://wayback.archive-it.org/8751/20220613070217/https://www.stanfordeugenics.com/
In the above archived site, I can click through "What is Eugenics?";"David Starr Jordan"; "Other Stanford Eugenicists"; "More"

However, these links don't work in our pywb environment:
https://swap.stanford.edu/was/20220613070141/https://www.stanfordeugenics.com/

@edsu
Copy link
Contributor

edsu commented Jul 14, 2022

I'm bringing this comment over from #98 since it relates to differences in wacz/archivewebpage and pywb replay.


I downloaded the WACZ from https://argo-stage.stanford.edu/view/druid:cm752zx1126 and can confirm that the CDX file that comes in the WACZ is significantly different from the CDX file that cdxj-indexer generates. I still haven't dug into the details to characterize what the differences are, but it is clear from the number of lines in the file (bundled CDX has 433 lines and generated has 427).

To further complicate matters when I used the CDX supplied with the WACZ in a test pywb collection it throws this error when attempting to load the content:

{'args': {'coll': 'test', 'type': 'replay', 'metadata': {}}, 'error': '{"message": "Internal Error: \'int\' object has no attribute \'startswith\'"}'}

Screen Shot 2022-07-12 at 9 21 42 PM

@edsu
Copy link
Contributor

edsu commented Jul 15, 2022

I created a small fix for pywb to allow it to use the CDXJ files that are supplied with the WACZ. I wanted to see if pywb could display the content better using the WACZ's CDXJ file instead of the one we are generating with cdxj-indexer.

Unfortunately the resulting view looks the same: missing icon images along the top, and when you go to click on a image post you get prompted to log in:

Screen Shot 2022-07-14 at 8 42 18 PM

So the problem appears to be something other than the CDXJ index itself, and lies elsewhere in replay.

In chatting with @ikreymer today he suggested that an inconsistency between pywb and wabac.js's implementation of "fuzzy matching" might be the root cause of this discrepency in playback. It could be useful to compare HTTP requests and responses to see if any differences could provide a clue as to what might need adjusting in pywb.

Ilya did also mention that the long term plan for pywb v3 is to use the ReplayWebPage component itself for playback in pywb, rather than relying on parallel implementations of things like URL rewriting and fuzzy matching. This would bring them into better alignment in terms of playback, and should (theoretically) fix this particular playback problem since the WACZ file seems to display fine in ReplayWebPage.

@peterchanws
Copy link
Collaborator Author

Since we can still render Instagram a/cs using replay.page (https://searchworks.stanford.edu/view/yr018yr3132), I will wait until pywb can render Instagram properly before moving them to SDR.

@edsu
Copy link
Contributor

edsu commented Jul 18, 2022

@ikreymer noticed a truncated JavaScript resource during replay, and remembered that a previous version of ArchiveWeb.Page had a bug in how the WARC-Record's Content-Length is recorded. Ordinarily this is not a problem for ReplayWeb.Page and ArchiveWeb.page since they do not use the Content-Length during replay, but pywb does.

To test I recreated a WACZ for /https://www.instagram.com/clubcardinal/ using the latest ArchiveWeb.page I performed a one-time-registration for the WACZ:

https://argo-qa.stanford.edu/view/druid:dw163st3321

and you can see it playing back here:

https://was-pywb-qa.stanford.edu/was/20220718153218/https://www.instagram.com/clubcardinal/

The good news is that the images all appear to render now. But if you click on one of them you are still prompted to log in, instead of seeing the detail. I suspect that a fuzzy matching rule may be the root cause. So I think it's still worth investigating that.

@edsu
Copy link
Contributor

edsu commented Jul 18, 2022

The above issue with Wix menu items on https://www.stanfordeugenics.com/ not playing back correctly in pywb isn't really related to this Instagram playback issue. But I did get a chance to archive the site with ArchiveWeb.page and then registered the resulting WACZ, you can see that it plays back with the menus working here:

https://was-pywb-stage.stanford.edu/was/20220718182905/https://www.stanfordeugenics.com/

So the problem here is that the Archive-It crawl was incomplete.

@peterchanws
Copy link
Collaborator Author

Thanks, Edu, for pointing out the incomplete capture of https://www.stanfordeugenics.com/ in AIT and also capturing the site using ArchiveWeb.page. I have replaced the old production capture site (using AIT) with the WACZ file (using ArchiveWeb.page)

@edsu
Copy link
Contributor

edsu commented Aug 15, 2022

This morning I performed a new capture of /https://www.instagram.com/clubcardinal/ with ArchiveWebPage extension v0.8.1 and then loaded the WACZ into QA, and created a seed for it:

Now you can click the images to get the details/comments! I think this was an improvement in the image srcset detection behavior.

Screen Shot 2022-08-15 at 10 10 56 AM

@edsu edsu closed this as completed Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
replay web archiving 2022 web archiving work cycle
Projects
None yet
Development

No branches or pull requests

3 participants