Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faulty extraction for very short documents #660

Open
Psynbiotik opened this issue Jul 26, 2024 · 4 comments
Open

Faulty extraction for very short documents #660

Psynbiotik opened this issue Jul 26, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@Psynbiotik
Copy link

This example shows that data is duplicated and words are squished together even though they are distinct in the html.

python:

from trafilatura import extract

html_string = """<!DOCTYPE html>
<html lang="en-us">
<body>
<main>
    <section>
        <p>First</p>
        This gets Squished
        <div>
            <h4>There should be a space</h4>
            <p>Another sentence</p>
            This also gets Squished
        </div>
        <div>
            <h4>Where is the space</h4>
            <p>This sentence has to be long enough.</p>
        </div>
    </section>
</main>
</body>
</html>
"""

print(extract(html_string))

This results in this:
'First
This gets SquishedThere should be a space
Another sentence
This also gets SquishedWhere is the space
This sentence has to be long enough.
First
This gets SquishedAnother sentence
This also gets SquishedThis sentence has to be long enough.'

You can see First appears 2x even though it's in the html only once, same as some other sentences. Also several words get squished together with the space between them removed.

@adbar adbar changed the title Data is duplicated and spaces removed between words, simple example provided Faulty extraction for very short documents Jul 26, 2024
@adbar adbar added the enhancement New feature or request label Jul 26, 2024
@adbar
Copy link
Owner

adbar commented Jul 26, 2024

Hi @Psynbiotik, thanks for the detailed bug report.

Trafilatura is geared towards real-world cases and synthetic examples do not always work well. The problem here is that the text is near the limit of what counts as potentially successful extraction (250 chars), so another algorithm is triggered to retrieve additional content, which causes the issue.

Lowering the length threshold in the settings or removing a character from your example already fixes the problem.

Real-world pages are rather bloated than too short, the probability of finding a page like this in the wild is extremely low so this is not a concern. Attempting to fix this particular case however decreases accuracy for reasons I do not exactly understand.

It is still a problem though, we can leave the thread open until it is solved.

@Psynbiotik
Copy link
Author

This is actually a reduced version of a real webpage. However, in the real webpage only the squishing of words together occurs, not the doubling issue.

@adbar
Copy link
Owner

adbar commented Jul 26, 2024

Then it would be interesting to isolate the problem so that I can reproduce it. In your example both examples are linked to another.

@Psynbiotik
Copy link
Author

Psynbiotik commented Jul 26, 2024

Here is an example that only includes the word squishing issue, where whitespace between words is sometimes removed:

    def test_white_space_issue():
        from trafilatura import extract

        html_string = """<!DOCTYPE html>
        <html lang="en-us">
        <body>
        <main>
            <section>
                <p>First</p>
                This gets Squished
                <div>
                    <h4>There should be a space</h4>
                    <p>Another sentence</p>
                    This also gets Squished
                </div>
                <div>
                    <h4>Where is the space</h4>
                    <p>This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first. This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first </p>
                </div>
            </section>
        </main>
        </body>
        </html>
        """
        result = extract(html_string)
        assert "SquishedThere" not in result
        assert "SquishedWhere" not in result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants