Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parts of article block are sometimes not being extracted #622

Open
naktinis opened this issue Jun 17, 2024 · 1 comment
Open

Parts of article block are sometimes not being extracted #622

naktinis opened this issue Jun 17, 2024 · 1 comment
Labels
feedback Feedback from users requested

Comments

@naktinis
Copy link
Contributor

First noticed this when trying to extract text from 1Password documentation and realized that code blocks are not being extracted.

Then reduced it to a minimal reproducible example.

Version tested: 1.10.0.

HTML:

<!DOCTYPE html>
<html>
  <body>
    <article>
      <p>
        Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
        tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
        veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea
        commodo consequat. Duis aute irure dolor in reprehenderit in voluptate
        velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat
        cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id
        Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
        tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
        veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea.
      </p>

      <div>
        This is a very important line that is part of the article.
      </div>

      <p>
        Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
        tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
        veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea
        commodo consequat. Duis aute irure dolor in reprehenderit in voluptate
        velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat
        cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id.
        Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod.
      </p>
    </article>
  </body>
</html>

Extract call:

extract(h, include_formatting=True, favor_recall=True)

Output (does not include "This is a very important..."):

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea.\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod.

Is there a reason why the text within a <div> block is being ignored, and would there be any way to change this behavior? Ideally maybe wouldn't even need to favor recall.

@adbar adbar added the feedback Feedback from users requested label Jun 18, 2024
@adbar
Copy link
Owner

adbar commented Jun 18, 2024

@naktinis Yes there is a reason, text within div (and nothing else) is generally undesirable. It is always a tradeoff between precision and recall.

The easiest way I see is to add "div" manually in settings.TAG_CATALOG and re-install the package locally, it should be propagated to the extractors. Does that solve your problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback Feedback from users requested
Projects
None yet
Development

No branches or pull requests

2 participants