-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update TechCrunch
#522
Update TechCrunch
#522
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this
|
||
@attribute | ||
def title(self) -> Optional[str]: | ||
return self.precomputed.meta.get("og:title") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would parse this from the JSON, since the og:title tag value has an extra | TechCrunch
appended to the headline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the same problem also occurs with V1. Also, since (almost) only the selectors are changed, I think also a minor parser update (V1_1) would be enough to spare all the duplicate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The headline referenced in the LD+JSON suffers from encoding problems. I left it as is and removed | TechCrunch
@@ -14,10 +14,42 @@ | |||
|
|||
|
|||
class TechCrunchParser(ParserProxy): | |||
class V2(BaseParser): | |||
_summary_selector: XPath = CSSSelector("div.entry-content > p#speakable-summary") | |||
_paragraph_selector: XPath = CSSSelector("div.entry-content > p:not(#speakable-summary)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to also parse bulletpoint lists? If so, I have found one usage: https://techcrunch.com/2024/05/15/senate-study-proposes-at-least-32b-yearly-for-ai-programs/
This PR:
TechCrunch
to adapt the newest layout