You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First noticed this when trying to extract text from 1Password documentation and realized that code blocks are not being extracted.
Then reduced it to a minimal reproducible example.
Version tested: 1.10.0.
HTML:
<!DOCTYPE html><html><body><article><p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea
commodo consequat. Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea.
</p><div>
This is a very important line that is part of the article.
</div><p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea
commodo consequat. Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod.
</p></article></body></html>
Output (does not include "This is a very important..."):
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea.\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod.
Is there a reason why the text within a <div> block is being ignored, and would there be any way to change this behavior? Ideally maybe wouldn't even need to favor recall.
The text was updated successfully, but these errors were encountered:
@naktinis Yes there is a reason, text within div (and nothing else) is generally undesirable. It is always a tradeoff between precision and recall.
The easiest way I see is to add "div" manually in settings.TAG_CATALOG and re-install the package locally, it should be propagated to the extractors. Does that solve your problem?
First noticed this when trying to extract text from 1Password documentation and realized that code blocks are not being extracted.
Then reduced it to a minimal reproducible example.
Version tested:
1.10.0
.HTML:
Extract call:
Output (does not include "This is a very important..."):
Is there a reason why the text within a
<div>
block is being ignored, and would there be any way to change this behavior? Ideally maybe wouldn't even need to favor recall.The text was updated successfully, but these errors were encountered: