-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The website with cookie banner doesn't work #3
Comments
Nice catch. I don't know if it will be an easy fix, though. |
Hi @jk-gan, thanks for bringing this up! I think the underlying issue here is that the main article text is too short to fit the readability algorithm's idea of what should count as "main text". Since the text of the cookie policy is longer, that's what gets chosen instead. With a longer article, e.g. this one, the issue goes away. One obvious step to take is to add As you note, Firefox's Reader Mode does include the article text in its result. Since readability.rs and Firefox Reader Mode are descendants of the same algorithm originally, it might be possible to tweak the criteria such that it works properly in this case. I do notice, however, that Firefox also includes the cookie policy in its Reader Mode view, which I would consider undesirable, so I don't think we want to copy Firefox exactly here. I'll look into it, but this is a part of the algorithm that I haven't really messed with before so I can't say how simple or complex it might turn out to be. |
Ok, I've been digging into this and I think I've discovered at least part of the root cause. It looks like the algorithm is correctly identifying the "main content" section, but is lumping the body text together with the "Other News" links, which results in it getting removed. That is, the readability algorithm removes any elements that a) have a "class score" less than 25 and b) have a "link density" (number of text characters that are in links compared to total number of characters) greater than 0.2. Because, again, the text here is so short, the link density of the containing Ideally it would identify from the start that the "Other News" links are not part of the main text, and so it wouldn't pick an element that contains them as its basis. Unfortunately the I'm not really sure of the best way to approach solving this. The only change I've tried that actually works is to get rid of the If anyone has alternative suggestions, I'm all ears. |
Perhaps instead of using 0.2 as a hardcoded cutoff for the link intensity it could use the link intensity of all elements for comparison. So the algorithm would become: calculate all link intensities of all elements and remove the elements with link intensity bigger than 80% of all other elements. |
The readable only able to capture the cookie banner text from this website. You can check at here. The website looks fine in the Firefox Reader View though.
The text was updated successfully, but these errors were encountered: