-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cbc as first canadian publisher #583
Merged
Merged
Changes from 1 commit
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
2cf32ee
add cbc as first canadian publisher
addie9800 1bfaacc
redo ld parsing
addie9800 1f5e047
Merge branch 'master' into add-canadian-news-sources
MaxDall 7453d21
Merge branch 'fork/krautreporter' into add-canadian-news-sources
MaxDall 4371d6b
rework `CBCNews` parser
MaxDall 823114a
update test case
MaxDall dbb3566
Merge branch 'master' into add-canadian-news-sources
MaxDall ebb05e3
fix a bug with topic parsing
MaxDall File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
from fundus.publishers.base_objects import Publisher, PublisherGroup | ||
from fundus.publishers.ca.cbc_news import CBCNewsParser | ||
from fundus.scraping.url import NewsMap, RSSFeed, Sitemap | ||
|
||
# noinspection PyPep8Naming | ||
|
||
|
||
class CA(metaclass=PublisherGroup): | ||
CBCNews = Publisher( | ||
name="CBC News", | ||
domain="https://www.cbc.ca/", | ||
parser=CBCNewsParser, | ||
sources=[ | ||
RSSFeed("https://www.cbc.ca/webfeed/rss/rss-topstories"), | ||
RSSFeed("https://www.cbc.ca/webfeed/rss/rss-world"), | ||
RSSFeed("https://www.cbc.ca/webfeed/rss/rss-canada"), | ||
], | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
import datetime | ||
import json | ||
import re | ||
from typing import List, Optional | ||
|
||
import more_itertools | ||
from lxml.cssselect import CSSSelector | ||
from lxml.etree import XPath | ||
from lxml.html import document_fromstring | ||
|
||
from fundus.parser import ArticleBody, BaseParser, ParserProxy, attribute | ||
from fundus.parser.data import LinkedDataMapping | ||
from fundus.parser.utility import ( | ||
extract_article_body_with_selector, | ||
generic_author_parsing, | ||
generic_date_parsing, | ||
generic_topic_parsing, | ||
) | ||
|
||
|
||
class CBCNewsParser(ParserProxy): | ||
class V1(BaseParser): | ||
_summary_selector = CSSSelector("h2.deck") | ||
_subheadline_selector = CSSSelector("div.story > h2") | ||
_paragraph_selector = CSSSelector("div.story > p") | ||
|
||
_author_ld_selector = XPath("//script[@id='initialStateDom']") | ||
|
||
@attribute | ||
def body(self) -> ArticleBody: | ||
return extract_article_body_with_selector( | ||
self.precomputed.doc, | ||
summary_selector=self._summary_selector, | ||
subheadline_selector=self._subheadline_selector, | ||
paragraph_selector=self._paragraph_selector, | ||
) | ||
|
||
@attribute | ||
def authors(self) -> List[str]: | ||
doc = document_fromstring(self.precomputed.html) | ||
ld_nodes = self._author_ld_selector(doc) | ||
try: | ||
author_ld = json.loads(re.sub(r"(window\.__INITIAL_STATE__ = |;$)", "", ld_nodes[0].text_content())) | ||
except json.JSONDecodeError: | ||
return [] | ||
if not (details := author_ld.get("detail")): | ||
return [] | ||
if not (content := details.get("content")): | ||
return [] | ||
return generic_author_parsing(content.get("authorList")) | ||
|
||
@attribute | ||
def publishing_date(self) -> Optional[datetime.datetime]: | ||
return generic_date_parsing(self.precomputed.ld.bf_search("ReportageNewsArticle")[0].get("datePublished")) | ||
|
||
@attribute | ||
def title(self) -> Optional[str]: | ||
if not (title := self.precomputed.meta.get("og:title")): | ||
return title | ||
return re.sub(r" \|.*", "", title) | ||
|
||
@attribute | ||
def topics(self) -> List[str]: | ||
return generic_topic_parsing(self.precomputed.ld.bf_search("ReportageNewsArticle")[0].get("articleSection")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There seems to be a lot more information about the topics looking through the HTML using the term |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
{ | ||
"V1": { | ||
"authors": [ | ||
"Yasmine Hassan" | ||
], | ||
"body": { | ||
"summary": [ | ||
"The appointment came days after Ismail Haniyeh was assassinated in Tehran" | ||
], | ||
"sections": [ | ||
{ | ||
"headline": [], | ||
"paragraphs": [ | ||
"Hours after Yahya Sinwar was named the new leader of Hamas's political bureau on Tuesday, many in Gaza wondered how the appointment would affect the war and ceasefire negotiations with Israel.", | ||
"The announcement, posted on Hamas's Telegram channel soon after former leader Ismail Haniyeh was killed in Iran, was seen as a defiant move from the group. Israel has characterized Sinwar as the \"mastermind\" behind the Oct. 7 attacks on southern Israel, which Israeli figures say killed 1,200 and took over 250 hostages into Gaza.", | ||
"Sinwar, 61, has led Hamas in the Gaza Strip since 2017. But his background leans more in military rather than in politics, and his methods are seen as more extreme than his predecessor's.", | ||
"That has created questions over how Sinwar will manage negotiations, and how Israel will negotiate with the man who they say orchestrated the attacks — and whom they've vowed to kill.", | ||
"Many Palestinians interviewed in Gaza expressed similar concern over the promotion, although some welcomed the move. The news comes during a time of tense negotiation to end a war that has devastated the region and killed more than 39,000, according to Palestinian tallies, over the past 10 months." | ||
] | ||
}, | ||
{ | ||
"headline": [ | ||
"Palestinians react" | ||
], | ||
"paragraphs": [ | ||
"Jamil Al Saadouni, 58, told CBC freelance videographer Mohamed El Saife in Khan Younis that Sinwar's appointment was \"an internal decision.\"", | ||
"He lamented the fact that Palestinian civilians, who are directly impacted by the war in Gaza, were not consulted on the best replacement for Haniyeh.", | ||
"\"This has nothing to do with other factions or the Palestinian people.\"", | ||
"Abu Hassan Amer, 44, agreed.", | ||
"\"Choosing a military leadership during this period can harm the negotiations,\" he told El Saife. \"Because as they say, the non-political gun creates roadblocks.\"", | ||
"Sinwar is seen as a \"hard-liner\" even within Hamas, said Matthew Levitt, senior fellow at the Washington Institute for Near East Policy, which was founded in 1985 with support from the American Israel Public Affairs Committee, a pro-Israel lobbying organization.", | ||
"Sinwar served over 20 years in Israeli jails in connection with the killings of two Israeli soldiers and four fellow Palestinians, and was released early in 2011 as part of a prisoner swap. He has been known to hunt down people suspected of collaborating with Israel.", | ||
"Levitt said that because of his time in jail, Sinwar \"understands Israelis.\"", | ||
"\"He learned Hebrew, he spoke with his jailers, and that really showed on Oct. 7, when he understood the trauma that the kidnapping and killing of a large number of people would do for the Israelis,\" he said.", | ||
"By comparison, Haniyeh, who ruled in exile from Qatar, often took a more moderate and pragmatic stance.", | ||
"\"The killing of Haniyeh already brought negotiations back to the drawing board,\" Lina Khatib, an expert on the conflict at U.K. think-tank Chatham House, told the AP in an interview. \"The next chess move by Hamas makes negotiations even trickier.\"", | ||
"Haniyeh was killed by an airstrike in Tehran, where he was attending the inauguration of Iran's new president. While Hamas and Iran have blamed Israel for the strike, Israel has not claimed responsibility for it." | ||
] | ||
}, | ||
{ | ||
"headline": [ | ||
"A military man in politics" | ||
], | ||
"paragraphs": [ | ||
"Some in Gaza welcomed the news of Sinwar's promotion, saying they needed someone to defend them.", | ||
"\"Choosing him from the stance of Palestine is a good choice,\" Abu Anas Al Saud told El Saife. \"We need someone to defend the land that was stolen.\"", | ||
"But Al Saud is aware of the effect Sinwar may have on ceasefire talks.", | ||
"\"He's the most wanted man to Israel,\" he said. \"It will not advance negotiations at all.\"", | ||
"Sinwar only made rare appearances before the war. He hasn't been seen in public since Oct. 7, and is thought to be hiding deep in tunnels beneath the Gaza Strip. Mediators say it takes several days to exchange messages with him, raising questions on how he will now manage Hamas as its international face.", | ||
"Sinwar \"is someone who grew up within the brigade and the militant terrorist wing of Hamas,\" said Levitt.", | ||
"However, while Sinwar's promotion might seem like a direct \"challenge to Israel,\" a deal was still possible, Sadeq Abu Amer told the AP. He noted that Sinwar \"might take a step that will surprise everyone.\" Abu Amer is the head of the Palestinian Dialogue Group in Turkey, which says on its site that it aims to \"protect the historical rights of the Palestinian people.\"", | ||
"And while the assassination of Haniyeh makes a difference \"in the immediate,\" Levitt said, in the long term, both sides are still looking for a deal.", | ||
"\"The same factors that were driving Hamas towards the deal and separately driving the Israeli prime minister to a deal are still there.\"" | ||
] | ||
}, | ||
{ | ||
"headline": [ | ||
"'There is only one place for Yahya Sinwar'" | ||
], | ||
"paragraphs": [ | ||
"On Tuesday, Israel's chief military spokesperson, Rear Admiral Daniel Hagari, said Sinwar's appointment would not stop Israel from pursuing him.", | ||
"\"There is only one place for Yahya Sinwar, and it is beside Mohammed Deif and the rest of the Oct. 7 terrorists,\" he told the Saudi state-owned Al-Arabiya television. \"That is the only place we're preparing and intending for him.\"", | ||
"Amer, in Gaza, stressed the importance of diplomacy before military strength, particularly as negotiations continue between both sides.", | ||
"\"There are rules to resistance, rules to war and rules to peace,\" said Amer. \"[And] we need peace in this current moment.\"" | ||
] | ||
} | ||
] | ||
}, | ||
"publishing_date": "2024-08-08 08:00:00+00:00", | ||
"title": "Palestinians say his appointment could ruin ceasefire talks", | ||
"topics": [ | ||
"World" | ||
] | ||
} | ||
} |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"CBCNews_2024_08_08.html.gz": { | ||
"url": "https://www.cbc.ca/news/world/gaza-israel-ceasefire-negotiations-sinwar-1.7287711?cmp=rss", | ||
"crawl_date": "2024-08-08 23:53:17.604667" | ||
} | ||
} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we don't use the author given in the
LD
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it isn't accessible properly. This JSON is not wrapped within the usual ld json type and contains lists, which causes the bf_search to not work properly. I tried fixing it by replacing
new.extend(v for v in node.values() if isinstance(v, dict) or isinstance(v, list))
this line in the function, but that ended up breaking the bf_search completely, so I figured this might be the straightforward option. But since we also need it in the topics, I did now implement a local fix.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should work now with the changes made here #592
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately even with the changes made there, it still does not work, since the keywords are in a script block that is not classified as ld+json, so it is ignored by the default _base_setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a really clean way to do this, but unfortunately, we need #588 for this. So let's finish this one after.