Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improper parsing of text on Reuters - two paragraphs in the result instead of one #931

Open
ivanlabsii opened this issue Dec 22, 2024 · 0 comments

Comments

@ivanlabsii
Copy link

This is the sample page:
https://www.reuters.com/legal/qualcomm-saw-nuvia-buy-chance-save-14-billion-year-arm-fees-ceo-tells-jury-2024-12-18/

It has paragraphs like this:
<div data-testid="paragraph-0" class="text__text__1FZLe text__dark-grey__3Ml43 text__regular__2N1Xr text__small__1kGq2 body__full_width__ekUdw body__small_body__2vQyf article-body__paragraph__2-BtD">WILMINGTON, Delaware, Dec 18 (Reuters) - Internal Qualcomm <a data-testid="Link" target="_blank" href="https://www.reuters.com/markets/companies/QCOM.O" referrerpolicy="no-referrer-when-downgrade" rel="noopener" class="text__text__1FZLe text__inherit-color__3208F text__inherit-font__1Y8w3 text__inherit-size__1DZJi link__link__3Ji6W link__underline_default__2prE_ link__with-icon__3x3oD">(QCOM.O)<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 32 32" aria-hidden="true" focusable="false" role="presentation" data-testid="NewTabSymbol" class="link__new-tab-symbol__3T19s"><path fill="#666" d="M15 6c.6 0 1 .4 1 1s-.4 1-1 1H9c-.6 0-1 .4-1 1v14a1.1 1.1 0 0 0 .2.6c.1.2.4.4.8.4h14c.6 0 1-.4 1-1v-6c0-.6.4-1 1-1s1 .4 1 1v6a3 3 0 0 1-3 3H9a3 3 0 0 1-2.2-.9A3 3 0 0 1 6 23V9a3 3 0 0 1 3-3Zm10 0 .4.1.5.5.1.4v6c0 .6-.4 1-1 1s-1-.4-1-1V9.4l-5.3 5.3a1 1 0 0 1-.7.3 1 1 0 0 1-.7-.3 1 1 0 0 1 0-1.4L22.6 8H19c-.6 0-1-.4-1-1s.4-1 1-1Z"></path></svg><span style="border: 0px; clip: rect(0px, 0px, 0px, 0px); clip-path: inset(50%); height: 1px; margin: -1px; overflow: hidden; padding: 0px; position: absolute; width: 1px; white-space: nowrap;">, opens new tab</span></a> documents showed the chip firm estimated it could eventually save as much as $1.4 billion a year on payments to Arm <a data-testid="Link" target="_blank" href="https://www.reuters.com/markets/companies/O9Ty.F" referrerpolicy="no-referrer-when-downgrade" rel="noopener" class="text__text__1FZLe text__inherit-color__3208F text__inherit-font__1Y8w3 text__inherit-size__1DZJi link__link__3Ji6W link__underline_default__2prE_ link__with-icon__3x3oD">(O9Ty.F)<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 32 32" aria-hidden="true" focusable="false" role="presentation" data-testid="NewTabSymbol" class="link__new-tab-symbol__3T19s"><path fill="#666" d="M15 6c.6 0 1 .4 1 1s-.4 1-1 1H9c-.6 0-1 .4-1 1v14a1.1 1.1 0 0 0 .2.6c.1.2.4.4.8.4h14c.6 0 1-.4 1-1v-6c0-.6.4-1 1-1s1 .4 1 1v6a3 3 0 0 1-3 3H9a3 3 0 0 1-2.2-.9A3 3 0 0 1 6 23V9a3 3 0 0 1 3-3Zm10 0 .4.1.5.5.1.4v6c0 .6-.4 1-1 1s-1-.4-1-1V9.4l-5.3 5.3a1 1 0 0 1-.7.3 1 1 0 0 1-.7-.3 1 1 0 0 1 0-1.4L22.6 8H19c-.6 0-1-.4-1-1s.4-1 1-1Z"></path></svg><span style="border: 0px; clip: rect(0px, 0px, 0px, 0px); clip-path: inset(50%); height: 1px; margin: -1px; overflow: hidden; padding: 0px; position: absolute; width: 1px; white-space: nowrap;">, opens new tab</span></a>, by purchasing a little-known startup in 2021, according to evidence shown at a trial on Wednesday.</div>

This paragraph has this output:
<p>WILMINGTON, Delaware, Dec 18 (Reuters) - Internal Qualcomm </p><a data-testid=\"Link\" target=\"_blank\" href=\"https://www.reuters.com/markets/companies/QCOM.O\" referrerpolicy=\"no-referrer-when-downgrade\" rel=\"noopener\">(QCOM.O)<span>, opens new tab</span></a><p> documents showed the chip firm estimated it could eventually save as much as $1.4 billion a year on payments to Arm </p><a data-testid=\"Link\" target=\"_blank\" href=\"https://www.reuters.com/markets/companies/O9Ty.F\" referrerpolicy=\"no-referrer-when-downgrade\" rel=\"noopener\">(O9Ty.F)<span>, opens new tab</span></a><p>, by purchasing a little-known startup in 2021, according to evidence shown at a trial on Wednesday.</p>

As such it outputs two paragraphs, though the page contains just one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant