-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzzy Matching Improvements / POST requests #80
Comments
A quick update: with the latest commit, this version now works with latest Vimeo videos. Youtube will still require POST request handling, as mentioned above. The simplest solution is to add that, as that's what was done in pywb and wabac.js, and zim replay requires a modified system as discussed above. What this involves is looking at the WARC request record, and if it is a POST request, and the content-type is either json or form encoding, the POST request is added to the URL as a query. Then, the special rule is applied to add a fuzzy matching redirect. |
Thanks for all the details. Trying to understand exactly what each option would imply in terms of changes and maintenance. Surely having prefix search in ZIM (in readers actually, the libzim do provides this feature) saves duplication and possible bugs but it might mean changing every ZIM reader in an out-of-spec way… Will try to look at the code to understand the other option better. |
The immediate solution for youtube is to add the POST request mapping. Thinking about it more, there is no way around that, even with prefix support. Here's an example of the latest conversion function, which now handles both form and JSON data now: Without adding the POST data, we would end up with duplicate URLs like So probably should implement something like the above function in warc2zim.. |
OK, thanks, it's a bit clearer now. Discussed this with @kelson42 and we confirm it's not possible to provide a prefix search API at the moment as this is too big of a concept change for the format/reader. So we'll go with your other option. Let's discuss on slack if/how we can split the workload and maybe refactor those pieces so that it's easier to maintain. We should anyway have a better understanding of the replayer parts at play. We've kinda neglected it since it was maintained in webac.js |
@ikreymer what's the status of video-replay-fixes branch? Should we merge that in ? |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
Hi @ikreymer any update on this? |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
This is being addressed by #83 |
warc2zim now has a set of fuzzy matching rules (https://github.com/openzim/warc2zim/blob/master/src/warc2zim/main.py#L75)
which are a subset of the larger ruleset in wabac.js (https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js#L8)
(pywb also has rules in python that are mostly aligned with the wabac.js rules https://github.com/webrecorder/pywb/blob/master/pywb/rules.yaml)
This many different rule sets is definitely a concern when it comes to maintenance, so perhaps should at least try to have warc2zim use the wabac.js rules, since wabac.js is used for replay. These rules could be easily exposed as a json file that is loaded similar to the sw.js
Ideally, warc2zim would not need any rules and wabac.js could just read from zim using existing rules, but this is not possible for two issues:
https://example.com/?A=B&_=1234
and there is a request forhttps://example.com/?A=B&_=1235
, it can do a prefix search forhttps://example.com/?
and find the best match.This approach allows for finding the best match URL from multiple possible captures.
Since prefix querying is not possible when loading from ZIM, the alternative is a custom canonicalization option, which wabac.js also supports: We create a fake redirect', eg:
https://example.fuzzy.replayweb.page/?A=B
which redirects tohttps://example.com/?A=B&_=1234
in the ZIMThen, when wabac.js encounters
https://example.com/?A=B&_=1235
, it also maps tohttps://example.fuzzy.replayweb.page/?A=B
, and so is able to do the lookup.This does work but is less flexible than the prefix search, as there is only possible match.
wabac.js can take the POST data, especially if query or json and add it as part of the URL query.
For example, lets say a URL is the same but can only be distinguished by the POST data, which contains
{"videoid": "A"}
A combined URL after reading the request and response can then be:
https://example.com/?_=1234&__post_json_data={"videoid": "A"}
, and the previous prefix search forhttps://example.com/?
can find the best match.For ZIMs, we'll need to do more work, though. The POST request must now also be parsed and a 'fake' redirect URL, probably something like
https://example.fuzzy.replaywebpage/?__post_json_data={"videoid": "A"}
generated.This is doable, and can be added, but just wanted to raise awareness as this means creating (and continuing to maintain) a slightly different fuzzy matching scheme for ZIMs than exist for WARCs in wabac.js. The only possible alternatives, it seems, would be to allow for:
https://example.com?
This issue is now coming up with youtube as youtube is making a POST request to the same URL, only difference is in the POST data (mentioned in webrecorder/browsertrix-crawler#4). The existing POST handling + prefix system means that replayweb.page is able to replay this new youtube playrer in WARCs + WACZ, but not ZIMs
Let me know if this makes sense, or can elaborate further..
The text was updated successfully, but these errors were encountered: