feature: supports delaying url date extraction #66

getorca · 2022-10-21T02:45:56Z

add a feature to improve precision of dates by delaying the extraction of the URL. see (#55)

adds the boolean parameter url_delayed to the find_date function

This is slightly hackey, but is a quick fix. A better longer term solution will be allowing the extractors to be defined in order.

adbar · 2022-10-21T10:36:11Z

Hi @getorca, could you please format the code with black?

getorca · 2022-10-21T11:11:58Z

Hi @getorca, could you please format the code with black?

should be good to go @adbar

adbar · 2022-10-21T13:36:08Z

Thanks @getorca, it looks good, I'll give it some thought and integrate the PR next week.

adbar · 2022-10-24T09:36:09Z

Hi @getorca, I just made sure the changes are easier to understand.

I also realized that the deferred URL extraction could be moved further down in the code but I have nothing to benchmark it on, do you think it would be beneficial or do we first leave the code as it is?

getorca · 2022-10-24T15:33:58Z

Hi @getorca, I just made sure the changes are easier to understand.

I also realized that the deferred URL extraction could be moved further down in the code but I have nothing to benchmark it on, do you think it would be beneficial or do we first leave the code as it is?

Yes, it could, I moved it down as far as I'm familiar with more precise dates being extracted from.

getorca · 2022-10-24T15:44:44Z

@adbar, I also need to benchmark this when used in trafiltura, because when I pulled it into my project, It was about 30% slower than goose3. But I'm running in parallel, so I'm not sure if it's related to that, the possible me leaks in trafiltura, or a difference in extractions slowing down some of my other pipeline steps. Or the date change slowed it that much, wrote a new bench marking library over the weekend. Still need to add a func to let the extractions run in parallel to see if it is something else causing the slowdown. I'll let you know more later.

adbar · 2022-10-24T17:34:22Z

OK, so I'm leaving the PR open for now?

I usually benchmark Trafilatura without metadata extraction, it could be that portions of the code are slower but typically I'd expect it to extract more metadata than goose3. In any case, date extraction with htmldate is much faster and more accurate on my benchmark, you should be able to reproduce it (see tests/comparison.py).

BTW if you need to profile code I can really recommend pyinstrument (among others).

adbar · 2022-11-07T16:50:34Z

@getorca The PR looks ready to merge, do you confirm?

getorca · 2022-11-07T17:10:39Z

@adbar Yes, absolutely. Thanks.

feature: supports delaying url date extraction

d7de73f

getorca mentioned this pull request Oct 21, 2022

return datetime instead of date #55

Closed

reformatted code.py style with black

784c8ad

rename parameter and add comments

b5486f6

adbar merged commit 57695ba into adbar:master Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: supports delaying url date extraction #66

feature: supports delaying url date extraction #66

getorca commented Oct 21, 2022

adbar commented Oct 21, 2022

getorca commented Oct 21, 2022

adbar commented Oct 21, 2022

adbar commented Oct 24, 2022

getorca commented Oct 24, 2022

getorca commented Oct 24, 2022 •

edited

Loading

adbar commented Oct 24, 2022

adbar commented Nov 7, 2022

getorca commented Nov 7, 2022

feature: supports delaying url date extraction #66

feature: supports delaying url date extraction #66

Conversation

getorca commented Oct 21, 2022

adbar commented Oct 21, 2022

getorca commented Oct 21, 2022

adbar commented Oct 21, 2022

adbar commented Oct 24, 2022

getorca commented Oct 24, 2022

getorca commented Oct 24, 2022 • edited Loading

adbar commented Oct 24, 2022

adbar commented Nov 7, 2022

getorca commented Nov 7, 2022

getorca commented Oct 24, 2022 •

edited

Loading