-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent playback in Archive Web.Page and pywb #88
Comments
This needs some research into whether this is a capture / live leak / replay issue or something about our indexing and accessioning. This issue includes doing that investigation. |
I was able to replicate this, using https://replayweb.page with the wacz to view the Club Cardinal posts. |
Areas to investigate, in conversation with @edsu:
|
Moved this example from #98. https://wayback.archive-it.org/8751/20220613070217/https://www.stanfordeugenics.com/ However, these links don't work in our pywb environment: |
I'm bringing this comment over from #98 since it relates to differences in wacz/archivewebpage and pywb replay. I downloaded the WACZ from https://argo-stage.stanford.edu/view/druid:cm752zx1126 and can confirm that the CDX file that comes in the WACZ is significantly different from the CDX file that To further complicate matters when I used the CDX supplied with the WACZ in a test pywb collection it throws this error when attempting to load the content: {'args': {'coll': 'test', 'type': 'replay', 'metadata': {}}, 'error': '{"message": "Internal Error: \'int\' object has no attribute \'startswith\'"}'} |
I created a small fix for pywb to allow it to use the CDXJ files that are supplied with the WACZ. I wanted to see if pywb could display the content better using the WACZ's CDXJ file instead of the one we are generating with cdxj-indexer. Unfortunately the resulting view looks the same: missing icon images along the top, and when you go to click on a image post you get prompted to log in: So the problem appears to be something other than the CDXJ index itself, and lies elsewhere in replay. In chatting with @ikreymer today he suggested that an inconsistency between pywb and wabac.js's implementation of "fuzzy matching" might be the root cause of this discrepency in playback. It could be useful to compare HTTP requests and responses to see if any differences could provide a clue as to what might need adjusting in pywb. Ilya did also mention that the long term plan for pywb v3 is to use the ReplayWebPage component itself for playback in pywb, rather than relying on parallel implementations of things like URL rewriting and fuzzy matching. This would bring them into better alignment in terms of playback, and should (theoretically) fix this particular playback problem since the WACZ file seems to display fine in ReplayWebPage. |
Since we can still render Instagram a/cs using replay.page (https://searchworks.stanford.edu/view/yr018yr3132), I will wait until pywb can render Instagram properly before moving them to SDR. |
@ikreymer noticed a truncated JavaScript resource during replay, and remembered that a previous version of ArchiveWeb.Page had a bug in how the WARC-Record's To test I recreated a WACZ for /https://www.instagram.com/clubcardinal/ using the latest ArchiveWeb.page I performed a one-time-registration for the WACZ: https://argo-qa.stanford.edu/view/druid:dw163st3321 and you can see it playing back here: https://was-pywb-qa.stanford.edu/was/20220718153218/https://www.instagram.com/clubcardinal/ The good news is that the images all appear to render now. But if you click on one of them you are still prompted to log in, instead of seeing the detail. I suspect that a fuzzy matching rule may be the root cause. So I think it's still worth investigating that. |
The above issue with Wix menu items on https://www.stanfordeugenics.com/ not playing back correctly in pywb isn't really related to this Instagram playback issue. But I did get a chance to archive the site with ArchiveWeb.page and then registered the resulting WACZ, you can see that it plays back with the menus working here: https://was-pywb-stage.stanford.edu/was/20220718182905/https://www.stanfordeugenics.com/ So the problem here is that the Archive-It crawl was incomplete. |
Thanks, Edu, for pointing out the incomplete capture of https://www.stanfordeugenics.com/ in AIT and also capturing the site using ArchiveWeb.page. I have replaced the old production capture site (using AIT) with the WACZ file (using ArchiveWeb.page) |
This morning I performed a new capture of /https://www.instagram.com/clubcardinal/ with ArchiveWebPage extension v0.8.1 and then loaded the WACZ into QA, and created a seed for it:
Now you can click the images to get the details/comments! I think this was an improvement in the image |
I archived https://www.instagram.com/clubcardinal/ using Archive Web.Page and can see the archived posts when I logged out from Instagram.
When I loaded the file in pywb, I was asked to login to see the posts.
https://was-pywb-qa.stanford.edu/was/20210909173940/https://www.instagram.com/clubcardinal/
wacz
https://argo-qa.stanford.edu/view/druid:bk846bw0404
Another example for the same issue
https://was-pywb-qa.stanford.edu/was/20210901172027/https://www.instagram.com/stanford_flip/
wacz
https://argo-qa.stanford.edu/view/druid:vg019xs5734
The text was updated successfully, but these errors were encountered: