Faulty extraction for very short documents #660

Psynbiotik · 2024-07-26T09:25:35Z

This example shows that data is duplicated and words are squished together even though they are distinct in the html.

python:

from trafilatura import extract

html_string = """<!DOCTYPE html>
<html lang="en-us">
<body>
<main>
    <section>
        <p>First</p>
        This gets Squished
        <div>
            <h4>There should be a space</h4>
            <p>Another sentence</p>
            This also gets Squished
        </div>
        <div>
            <h4>Where is the space</h4>
            <p>This sentence has to be long enough.</p>
        </div>
    </section>
</main>
</body>
</html>
"""

print(extract(html_string))

This results in this:
'First
This gets SquishedThere should be a space
Another sentence
This also gets SquishedWhere is the space
This sentence has to be long enough.
First
This gets SquishedAnother sentence
This also gets SquishedThis sentence has to be long enough.'

You can see First appears 2x even though it's in the html only once, same as some other sentences. Also several words get squished together with the space between them removed.

The text was updated successfully, but these errors were encountered:

adbar · 2024-07-26T12:05:51Z

Hi @Psynbiotik, thanks for the detailed bug report.

Trafilatura is geared towards real-world cases and synthetic examples do not always work well. The problem here is that the text is near the limit of what counts as potentially successful extraction (250 chars), so another algorithm is triggered to retrieve additional content, which causes the issue.

Lowering the length threshold in the settings or removing a character from your example already fixes the problem.

Real-world pages are rather bloated than too short, the probability of finding a page like this in the wild is extremely low so this is not a concern. Attempting to fix this particular case however decreases accuracy for reasons I do not exactly understand.

It is still a problem though, we can leave the thread open until it is solved.

Psynbiotik · 2024-07-26T12:08:33Z

This is actually a reduced version of a real webpage. However, in the real webpage only the squishing of words together occurs, not the doubling issue.

adbar · 2024-07-26T12:58:01Z

Then it would be interesting to isolate the problem so that I can reproduce it. In your example both examples are linked to another.

Psynbiotik · 2024-07-26T13:53:31Z

Here is an example that only includes the word squishing issue, where whitespace between words is sometimes removed:

    def test_white_space_issue():
        from trafilatura import extract

        html_string = """<!DOCTYPE html>
        <html lang="en-us">
        <body>
        <main>
            <section>
                <p>First</p>
                This gets Squished
                <div>
                    <h4>There should be a space</h4>
                    <p>Another sentence</p>
                    This also gets Squished
                </div>
                <div>
                    <h4>Where is the space</h4>
                    <p>This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first. This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first </p>
                </div>
            </section>
        </main>
        </body>
        </html>
        """
        result = extract(html_string)
        assert "SquishedThere" not in result
        assert "SquishedWhere" not in result

adbar changed the title ~~Data is duplicated and spaces removed between words, simple example provided~~ Faulty extraction for very short documents Jul 26, 2024

adbar added the enhancement New feature or request label Jul 26, 2024

adbar mentioned this issue Jul 26, 2024

Investigate spacing in element tails #661

Open

tylerjthomas9 mentioned this issue Aug 8, 2024

Fix test_basic_article_trafilatura test failure huggingface/datatrove#264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faulty extraction for very short documents #660

Faulty extraction for very short documents #660

Psynbiotik commented Jul 26, 2024

adbar commented Jul 26, 2024

Psynbiotik commented Jul 26, 2024

adbar commented Jul 26, 2024

Psynbiotik commented Jul 26, 2024 •

edited

Loading

Faulty extraction for very short documents #660

Faulty extraction for very short documents #660

Comments

Psynbiotik commented Jul 26, 2024

adbar commented Jul 26, 2024

Psynbiotik commented Jul 26, 2024

adbar commented Jul 26, 2024

Psynbiotik commented Jul 26, 2024 • edited Loading

Psynbiotik commented Jul 26, 2024 •

edited

Loading