-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: supports delaying url date extraction #66
Conversation
Thanks @getorca, it looks good, I'll give it some thought and integrate the PR next week. |
Hi @getorca, I just made sure the changes are easier to understand. I also realized that the deferred URL extraction could be moved further down in the code but I have nothing to benchmark it on, do you think it would be beneficial or do we first leave the code as it is? |
Yes, it could, I moved it down as far as I'm familiar with more precise dates being extracted from. |
@adbar, I also need to benchmark this when used in trafiltura, because when I pulled it into my project, It was about 30% slower than goose3. But I'm running in parallel, so I'm not sure if it's related to that, the possible me leaks in trafiltura, or a difference in extractions slowing down some of my other pipeline steps. Or the date change slowed it that much, wrote a new bench marking library over the weekend. Still need to add a func to let the extractions run in parallel to see if it is something else causing the slowdown. I'll let you know more later. |
OK, so I'm leaving the PR open for now? I usually benchmark Trafilatura without metadata extraction, it could be that portions of the code are slower but typically I'd expect it to extract more metadata than BTW if you need to profile code I can really recommend pyinstrument (among others). |
@getorca The PR looks ready to merge, do you confirm? |
@adbar Yes, absolutely. Thanks. |
add a feature to improve precision of dates by delaying the extraction of the URL. see (#55)
adds the boolean parameter
url_delayed
to thefind_date
functionThis is slightly hackey, but is a quick fix. A better longer term solution will be allowing the extractors to be defined in order.