Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetime coming from response headers issue #179

Open
SamComber opened this issue Dec 2, 2024 · 1 comment
Open

Datetime coming from response headers issue #179

SamComber opened this issue Dec 2, 2024 · 1 comment
Labels
question Further information is requested

Comments

@SamComber
Copy link

I noticed that htmldate utilizes the find_date function, which internally relies on examine_header.

Does it make sense to parse the response header from the server? Do servers typically default this to the current date?

Here’s an example where this date is extracted: '2024-12-02'...

from htmldate import find_date

find_date(
    "https://octopus.energy/blog/agile-octopus-bigger-story/",
    original_date=True,
    extensive_search=True,
)

But the published at is actually...

image

If I comment lines on examine_header we do extract out the correct date (2022-12-13) during # last resort

@adbar adbar added the question Further information is requested label Dec 6, 2024
@adbar
Copy link
Owner

adbar commented Dec 6, 2024

Hi @SamComber, the server response is not used by Htmldate, the header is the one in the HTML document where a meta tag is set to a very late date, as I just checked: <meta name="created" content="6th Dec 2024 12:01">.

Sometimes the information on the pages is not reliable and it's hard to discriminate between several fields which are all plausible dates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants