Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: supports delaying url date extraction #66

Merged
merged 3 commits into from
Nov 7, 2022
Merged

Conversation

getorca
Copy link
Contributor

@getorca getorca commented Oct 21, 2022

add a feature to improve precision of dates by delaying the extraction of the URL. see (#55)

adds the boolean parameter url_delayed to the find_date function

This is slightly hackey, but is a quick fix. A better longer term solution will be allowing the extractors to be defined in order.

@adbar
Copy link
Owner

adbar commented Oct 21, 2022

Hi @getorca, could you please format the code with black?

@getorca
Copy link
Contributor Author

getorca commented Oct 21, 2022

Hi @getorca, could you please format the code with black?

should be good to go @adbar

@adbar
Copy link
Owner

adbar commented Oct 21, 2022

Thanks @getorca, it looks good, I'll give it some thought and integrate the PR next week.

@adbar
Copy link
Owner

adbar commented Oct 24, 2022

Hi @getorca, I just made sure the changes are easier to understand.

I also realized that the deferred URL extraction could be moved further down in the code but I have nothing to benchmark it on, do you think it would be beneficial or do we first leave the code as it is?

@getorca
Copy link
Contributor Author

getorca commented Oct 24, 2022

Hi @getorca, I just made sure the changes are easier to understand.

I also realized that the deferred URL extraction could be moved further down in the code but I have nothing to benchmark it on, do you think it would be beneficial or do we first leave the code as it is?

Yes, it could, I moved it down as far as I'm familiar with more precise dates being extracted from.

@getorca
Copy link
Contributor Author

getorca commented Oct 24, 2022

@adbar, I also need to benchmark this when used in trafiltura, because when I pulled it into my project, It was about 30% slower than goose3. But I'm running in parallel, so I'm not sure if it's related to that, the possible me leaks in trafiltura, or a difference in extractions slowing down some of my other pipeline steps. Or the date change slowed it that much, wrote a new bench marking library over the weekend. Still need to add a func to let the extractions run in parallel to see if it is something else causing the slowdown. I'll let you know more later.

@adbar
Copy link
Owner

adbar commented Oct 24, 2022

OK, so I'm leaving the PR open for now?

I usually benchmark Trafilatura without metadata extraction, it could be that portions of the code are slower but typically I'd expect it to extract more metadata than goose3. In any case, date extraction with htmldate is much faster and more accurate on my benchmark, you should be able to reproduce it (see tests/comparison.py).

BTW if you need to profile code I can really recommend pyinstrument (among others).

@adbar
Copy link
Owner

adbar commented Nov 7, 2022

@getorca The PR looks ready to merge, do you confirm?

@getorca
Copy link
Contributor Author

getorca commented Nov 7, 2022

@adbar Yes, absolutely. Thanks.

@adbar adbar merged commit 57695ba into adbar:master Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants