-
-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faulty extraction for very short documents #660
Comments
Hi @Psynbiotik, thanks for the detailed bug report. Trafilatura is geared towards real-world cases and synthetic examples do not always work well. The problem here is that the text is near the limit of what counts as potentially successful extraction (250 chars), so another algorithm is triggered to retrieve additional content, which causes the issue. Lowering the length threshold in the settings or removing a character from your example already fixes the problem. Real-world pages are rather bloated than too short, the probability of finding a page like this in the wild is extremely low so this is not a concern. Attempting to fix this particular case however decreases accuracy for reasons I do not exactly understand. It is still a problem though, we can leave the thread open until it is solved. |
This is actually a reduced version of a real webpage. However, in the real webpage only the squishing of words together occurs, not the doubling issue. |
Then it would be interesting to isolate the problem so that I can reproduce it. In your example both examples are linked to another. |
Here is an example that only includes the word squishing issue, where whitespace between words is sometimes removed:
|
This example shows that data is duplicated and words are squished together even though they are distinct in the html.
python:
This results in this:
'First
This gets SquishedThere should be a space
Another sentence
This also gets SquishedWhere is the space
This sentence has to be long enough.
First
This gets SquishedAnother sentence
This also gets SquishedThis sentence has to be long enough.'
You can see First appears 2x even though it's in the html only once, same as some other sentences. Also several words get squished together with the space between them removed.
The text was updated successfully, but these errors were encountered: