Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes, html tags remain on the string #627

Open
masylum opened this issue Jun 23, 2024 · 2 comments
Open

Sometimes, html tags remain on the string #627

masylum opened this issue Jun 23, 2024 · 2 comments
Labels
bug Something isn't working feedback Feedback from users requested

Comments

@masylum
Copy link

masylum commented Jun 23, 2024

I'm using

from trafilatura import extract
output = extract(
  input,
  include_comments=False,
  include_tables=False,
  no_fallback=True,
)

INPUT:

<p>It&rsquo;s 2AM. You&rsquo;re paged to respond to a failing set of components that you are the Subject Matter Expert (SME) for. Sleepy, you load up the playbook for when the <code>SplineReticulatorBlocked</code> alert has gone off, and start executing. The Incident Commander (IC) is vaguely aware of what you are doing, and checks in now and then.</p>

OUTPUT:

<p>It&rsquo;s 2AM. You&rsquo;re paged to respond to a failing set of components that you are the Subject Matter Expert (SME) for. Sleepy, you load up the playbook for when the <code>SplineReticulatorBlocked</code> alert has gone off, and start executing. The Incident Commander (IC) is vaguely aware of what you are doing, and checks in now and then.</p>

Some other strings, for some reason, do remove the html correctly:

INPUT:

<p>Much of my current job is maintaining and enhancing control planes for <a href="https://www.heroku.com/managed-data-services">Heroku&rsquo;s managed data services</a>. This post explores three patterns used to reduce operational burden and increase system safety and resiliency: <strong>state machines</strong> (and associated state-transition tables), <strong>transducers</strong> and <strong>re-entrant and idempotent</strong> operations.</p>

OUTPUT:

Much of my current job is maintaining and enhancing control planes for Heroku’s managed data services. This post explores three patterns used to reduce operational burden and increase system safety and resiliency: state machines (and associated state-transition tables), transducers and re-entrant and idempotent operations.
@adbar adbar added the bug Something isn't working label Jun 24, 2024
@adbar
Copy link
Owner

adbar commented Jun 24, 2024

Thanks for the detailed description, it seems to be a bug indeed.

@adbar
Copy link
Owner

adbar commented Jul 17, 2024

@masylum I need more context to reproduce the bug, the following HTML is not enough, there is no HTML code in the output.

Could you please try to make the bug reproducible?

<html>
<body>
<article>
<p>It&rsquo;s 2AM. You&rsquo;re paged to respond to a failing set of components that you are the Subject Matter Expert (SME) for. Sleepy, you load up the playbook for when the <code>SplineReticulatorBlocked</code> alert has gone off, and start executing. The Incident Commander (IC) is vaguely aware of what you are doing, and checks in now and then.</p>
</article>
</body>
</html>

@adbar adbar added the feedback Feedback from users requested label Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feedback Feedback from users requested
Projects
None yet
Development

No branches or pull requests

2 participants