Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find_date doesn't extract %D %b %Y formatted dates in free text #67

Closed
k-sareen opened this issue Oct 23, 2022 · 7 comments
Closed

find_date doesn't extract %D %b %Y formatted dates in free text #67

k-sareen opened this issue Oct 23, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@k-sareen
Copy link

k-sareen commented Oct 23, 2022

For the following MWE:

from htmldate import find_date

print(find_date("<html><body>Wed, 19 Oct 2022 14:24:05 +0000</body></html>"))

htmldate outputs 2022-01-01 instead of the expected 2022-10-19.

I've traced the execution of the above call and I believe it is the search_page function that has the bug. It doesn't seem to catch the above date pattern as a valid date and only grabs onto the 2022 part of the date string (which autocompletes the rest to 1st Jan).

I haven't found time to understand why the bug happens in detail so I don't have a solution right now. I'll try and see if I can fix the bug and will make a PR if I can.

@adbar adbar changed the title find_date returns the wrong date if date is formatted %D %b %Y find_date doesn't extract %D %b %Y formatted dates in free text Oct 24, 2022
@adbar adbar added the enhancement New feature or request label Oct 24, 2022
@adbar
Copy link
Owner

adbar commented Oct 24, 2022

Hi @k-sareen, thanks for your feedback.

It's not a bug since htmldate cannot extract dates from free text, in this case it looks simple but try this on a 10000 character long string where you don't know where the date is... For this reason, the package targets metadata or HTML fields and uses free text as a last resort.
All it can do here is return the year as an approximation. But your example shows that it may be useful to look around the year info and maybe pass this string to the pipeline. I'm going to think about it.

@adbar
Copy link
Owner

adbar commented Oct 24, 2022

Note about a quick fix: the issue can be resolved as follows, but the code gets slower and would have to be tested carefully. It can lead to false positives by extracting any date mentioned in the text without disambiguation or further clue about its relevance:

Changes in core.py:

  • imports: from .extractors import regex_parse
  • beginning of search_page() function:
    dateobject = regex_parse(htmlstring)
    if (
        date_validator(dateobject, outputformat, earliest=min_date, latest=max_date)
        is True
    ):
        try:
            LOGGER.debug("custom parse result: %s", dateobject)
            return dateobject.strftime(outputformat)  # type: ignore
        except ValueError as err:
            LOGGER.error("value error during conversion: %s %s", string, err)
    

@k-sareen
Copy link
Author

Ah right. I apologize, I seem to have misunderstood what kinds of dates htmldate can handle. Thank you for your quick response. Would you have a recommendation of a library that can handle dates in free form text? Unfortunately I can't control what kind of dateformat I receive from articles (why can't everyone just use ISO 😢).

@adbar
Copy link
Owner

adbar commented Oct 24, 2022

No problem, I could add this functionality to the library but I need some time to test it. Just out of curiosity: Which languages are you interested in?

@k-sareen
Copy link
Author

I'm working with English text/articles only. Though I think you're right that this is a bit of a slippery slope as it may potentially catch dates that are mentioned in the prose but are not the actual article date. I think it might be best to keep your library simple and I'll try and get around this edge case myself. Thank you again for your great work and for your insight!

P.S. Should I close the issue?

@adbar
Copy link
Owner

adbar commented Oct 24, 2022

Thanks for your feedback, you can leave the issue open, I'll think about it and close it if it goes beyond the scope of the library.

@adbar
Copy link
Owner

adbar commented Nov 15, 2022

Full text search is now supported and your example above works.

@adbar adbar closed this as completed Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants