Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The website with cookie banner doesn't work #3

Open
jk-gan opened this issue Mar 11, 2023 · 4 comments
Open

The website with cookie banner doesn't work #3

jk-gan opened this issue Mar 11, 2023 · 4 comments

Comments

@jk-gan
Copy link

jk-gan commented Mar 11, 2023

The readable only able to capture the cookie banner text from this website. You can check at here. The website looks fine in the Firefox Reader View though.

@mre
Copy link

mre commented Mar 11, 2023

Nice catch. I don't know if it will be an easy fix, though.
I'll move this to our extraction crate repo.
@jfmontanaro fyi

@mre mre transferred this issue from readable-app/readable Mar 11, 2023
@jfmontanaro
Copy link

Hi @jk-gan, thanks for bringing this up!

I think the underlying issue here is that the main article text is too short to fit the readability algorithm's idea of what should count as "main text". Since the text of the cookie policy is longer, that's what gets chosen instead. With a longer article, e.g. this one, the issue goes away.

One obvious step to take is to add cookie to the UNLIKELY_CANDIDATES regex, as anything with cookie in the id or class attribute is unlikely to be main article text. This does work to strip the cookie policy out of the result, but unfortunately in the case in question it just comes back blank. That is, it still isn't finding the article text.

As you note, Firefox's Reader Mode does include the article text in its result. Since readability.rs and Firefox Reader Mode are descendants of the same algorithm originally, it might be possible to tweak the criteria such that it works properly in this case. I do notice, however, that Firefox also includes the cookie policy in its Reader Mode view, which I would consider undesirable, so I don't think we want to copy Firefox exactly here.

I'll look into it, but this is a part of the algorithm that I haven't really messed with before so I can't say how simple or complex it might turn out to be.

@jfmontanaro
Copy link

Ok, I've been digging into this and I think I've discovered at least part of the root cause. It looks like the algorithm is correctly identifying the "main content" section, but is lumping the body text together with the "Other News" links, which results in it getting removed.

That is, the readability algorithm removes any elements that a) have a "class score" less than 25 and b) have a "link density" (number of text characters that are in links compared to total number of characters) greater than 0.2. Because, again, the text here is so short, the link density of the containing div ends up being about 0.45, which means it gets removed during the scoring step.

Ideally it would identify from the start that the "Other News" links are not part of the main text, and so it wouldn't pick an element that contains them as its basis. Unfortunately the <main id="main-content"> element scores higher than the <section> element, probably because of the id attribute, so it gets picked.

I'm not really sure of the best way to approach solving this. The only change I've tried that actually works is to get rid of the class_score < 25 && link_density > 0.2 condition for dumping an element, and just fall back on the link_density > 0.5 condition that's evaluated for everything. I'm worried this would reduce quality overall, though.

If anyone has alternative suggestions, I'm all ears.

@mre
Copy link

mre commented Mar 13, 2023

Perhaps instead of using 0.2 as a hardcoded cutoff for the link intensity it could use the link intensity of all elements for comparison. So the algorithm would become: calculate all link intensities of all elements and remove the elements with link intensity bigger than 80% of all other elements.
Not sure if that makes sense, but maybe something you could play around with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants