You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi everyone, recently we discovered an issue in our system because some urls were parsed without www and thefeore an extractor for that source wasn't used. In this case we either need to submit all custom extractors without www or allow searching for www extractors in addition to base hostname extractors.
the custom extractor is not used and body has only 1949 words instead of 3950.
Steps to Reproduce
postlight-parser https://newyorker.com/culture/annals-of-inquiry/the-case-for-free-range-lab-mice and see content and word_count fields
Detailed Description
Because of not using custom extractors parser returns an incomplete body.
Possible Solution
Either rename all folders without www. and set domains without www. or allow getExtractor to also check extractors with www. + hostname and www + base host name
I'm not sure which option is better for the parser (I'd rather go with the first one, though it might be error-prone, the second one is less error-prone).
The text was updated successfully, but these errors were encountered:
hwo411
changed the title
Remove www from custom extractor domains or treat domain without www as www
Urls without www are not handled by extractors where domains have www in url
May 31, 2023
Hi everyone, recently we discovered an issue in our system because some urls were parsed without www and thefeore an extractor for that source wasn't used. In this case we either need to submit all custom extractors without www or allow searching for www extractors in addition to base hostname extractors.
Expected Behavior
Commands
postlight-parser https://www.newyorker.com/culture/annals-of-inquiry/the-case-for-free-range-lab-mice
and
postlight-parser https://newyorker.com/culture/annals-of-inquiry/the-case-for-free-range-lab-mice
to produce the same result.
Current Behavior
In case of
postlight-parser https://newyorker.com/culture/annals-of-inquiry/the-case-for-free-range-lab-mice
the custom extractor is not used and body has only 1949 words instead of 3950.
Steps to Reproduce
postlight-parser https://newyorker.com/culture/annals-of-inquiry/the-case-for-free-range-lab-mice
and seecontent
andword_count
fieldsDetailed Description
Because of not using custom extractors parser returns an incomplete body.
Possible Solution
Either rename all folders without www. and set domains without www. or allow getExtractor to also check extractors with www. + hostname and www + base host name
I'm not sure which option is better for the parser (I'd rather go with the first one, though it might be error-prone, the second one is less error-prone).
The text was updated successfully, but these errors were encountered: