Update `TechCrunch` #522

MaxDall · 2024-05-17T07:21:39Z

This PR:

upgrades TechCrunch to adapt the newest layout
fix a bug in the extraction of the previous version

addie9800

Thanks for updating this

addie9800 · 2024-05-17T12:32:00Z

src/fundus/publishers/us/techcrunch.py

+
+        @attribute
+        def title(self) -> Optional[str]:
+            return self.precomputed.meta.get("og:title")


I would parse this from the JSON, since the og:title tag value has an extra | TechCrunch appended to the headline

Actually the same problem also occurs with V1. Also, since (almost) only the selectors are changed, I think also a minor parser update (V1_1) would be enough to spare all the duplicate code.

The headline referenced in the LD+JSON suffers from encoding problems. I left it as is and removed | TechCrunch

addie9800 · 2024-05-17T12:36:39Z

src/fundus/publishers/us/techcrunch.py

@@ -14,10 +14,42 @@


 class TechCrunchParser(ParserProxy):
+    class V2(BaseParser):
+        _summary_selector: XPath = CSSSelector("div.entry-content > p#speakable-summary")
+        _paragraph_selector: XPath = CSSSelector("div.entry-content > p:not(#speakable-summary)")


Do we want to also parse bulletpoint lists? If so, I have found one usage: https://techcrunch.com/2024/05/15/senate-study-proposes-at-least-32b-yearly-for-ai-programs/

update TechCrunch to version 2 and fix an extraction bug for version 1

7f9e01f

MaxDall requested a review from addie9800 May 17, 2024 07:21

addie9800 requested changes May 17, 2024

View reviewed changes

adjust title parsing and include li elements

b8fec4a

MaxDall requested a review from addie9800 May 18, 2024 12:50

addie9800 approved these changes May 20, 2024

View reviewed changes

MaxDall merged commit 10dfea4 into master May 20, 2024
5 checks passed

MaxDall deleted the update-tech-crunch branch May 20, 2024 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `TechCrunch` #522

Update `TechCrunch` #522

MaxDall commented May 17, 2024

addie9800 left a comment

addie9800 May 17, 2024

addie9800 May 17, 2024

MaxDall May 18, 2024

addie9800 May 17, 2024

Update TechCrunch #522

Update TechCrunch #522

Conversation

MaxDall commented May 17, 2024

addie9800 left a comment

Choose a reason for hiding this comment

addie9800 May 17, 2024

Choose a reason for hiding this comment

addie9800 May 17, 2024

Choose a reason for hiding this comment

MaxDall May 18, 2024

Choose a reason for hiding this comment

addie9800 May 17, 2024

Choose a reason for hiding this comment

Update `TechCrunch` #522

Update `TechCrunch` #522