Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cbc as first canadian publisher #583

Merged
merged 8 commits into from
Sep 3, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions docs/supported_publishers.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,38 @@
</table>


## CA-Publishers

<table class="publishers ca">
<thead>
<tr>
<th>Class&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;</th>
<th>Name&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;</th>
<th>URL&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;</th>
<th>Missing&#160;Attributes</th>
<th>Additional&#160;Attributes&#160;&#160;&#160;&#160;</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<code>CBCNews</code>
</td>
<td>
<div>CBC News</div>
</td>
<td>
<a href="https://www.cbc.ca/">
<span>www.cbc.ca</span>
</a>
</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
</tbody>
</table>


## CH-Publishers

<table class="publishers ch">
Expand Down
2 changes: 2 additions & 0 deletions src/fundus/publishers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from fundus.publishers.at import AT
from fundus.publishers.au import AU
from fundus.publishers.base_objects import Publisher, PublisherGroup
from fundus.publishers.ca import CA
from fundus.publishers.ch import CH
from fundus.publishers.cn import CN
from fundus.publishers.de import DE
Expand Down Expand Up @@ -61,3 +62,4 @@ class PublisherCollection(metaclass=PublisherCollectionMeta):
tr = TR
my = MY
no = NO
ca = CA
18 changes: 18 additions & 0 deletions src/fundus/publishers/ca/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from fundus.publishers.base_objects import Publisher, PublisherGroup
from fundus.publishers.ca.cbc_news import CBCNewsParser
from fundus.scraping.url import NewsMap, RSSFeed, Sitemap

# noinspection PyPep8Naming


class CA(metaclass=PublisherGroup):
CBCNews = Publisher(
name="CBC News",
domain="https://www.cbc.ca/",
parser=CBCNewsParser,
sources=[
RSSFeed("https://www.cbc.ca/webfeed/rss/rss-topstories"),
RSSFeed("https://www.cbc.ca/webfeed/rss/rss-world"),
RSSFeed("https://www.cbc.ca/webfeed/rss/rss-canada"),
],
)
64 changes: 64 additions & 0 deletions src/fundus/publishers/ca/cbc_news.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
import datetime
import json
import re
from typing import List, Optional

import more_itertools
from lxml.cssselect import CSSSelector
from lxml.etree import XPath
from lxml.html import document_fromstring

from fundus.parser import ArticleBody, BaseParser, ParserProxy, attribute
from fundus.parser.data import LinkedDataMapping
from fundus.parser.utility import (
extract_article_body_with_selector,
generic_author_parsing,
generic_date_parsing,
generic_topic_parsing,
)


class CBCNewsParser(ParserProxy):
class V1(BaseParser):
_summary_selector = CSSSelector("h2.deck")
_subheadline_selector = CSSSelector("div.story > h2")
_paragraph_selector = CSSSelector("div.story > p")

_author_ld_selector = XPath("//script[@id='initialStateDom']")

@attribute
def body(self) -> ArticleBody:
return extract_article_body_with_selector(
self.precomputed.doc,
summary_selector=self._summary_selector,
subheadline_selector=self._subheadline_selector,
paragraph_selector=self._paragraph_selector,
)

@attribute
def authors(self) -> List[str]:
doc = document_fromstring(self.precomputed.html)
ld_nodes = self._author_ld_selector(doc)
try:
author_ld = json.loads(re.sub(r"(window\.__INITIAL_STATE__ = |;$)", "", ld_nodes[0].text_content()))
except json.JSONDecodeError:
return []
if not (details := author_ld.get("detail")):
return []
if not (content := details.get("content")):
return []
return generic_author_parsing(content.get("authorList"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't use the author given in the LD.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it isn't accessible properly. This JSON is not wrapped within the usual ld json type and contains lists, which causes the bf_search to not work properly. I tried fixing it by replacing new.extend(v for v in node.values() if isinstance(v, dict) or isinstance(v, list)) this line in the function, but that ended up breaking the bf_search completely, so I figured this might be the straightforward option. But since we also need it in the topics, I did now implement a local fix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work now with the changes made here #592

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately even with the changes made there, it still does not work, since the keywords are in a script block that is not classified as ld+json, so it is ignored by the default _base_setup.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a really clean way to do this, but unfortunately, we need #588 for this. So let's finish this one after.


@attribute
def publishing_date(self) -> Optional[datetime.datetime]:
return generic_date_parsing(self.precomputed.ld.bf_search("ReportageNewsArticle")[0].get("datePublished"))

@attribute
def title(self) -> Optional[str]:
if not (title := self.precomputed.meta.get("og:title")):
return title
return re.sub(r" \|.*", "", title)

@attribute
def topics(self) -> List[str]:
return generic_topic_parsing(self.precomputed.ld.bf_search("ReportageNewsArticle")[0].get("articleSection"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a lot more information about the topics looking through the HTML using the term keyword.
Take this article for example.

75 changes: 75 additions & 0 deletions tests/resources/parser/test_data/ca/CBCNews.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
{
"V1": {
"authors": [
"Yasmine Hassan"
],
"body": {
"summary": [
"The appointment came days after Ismail Haniyeh was assassinated in Tehran"
],
"sections": [
{
"headline": [],
"paragraphs": [
"Hours after Yahya Sinwar was named the new leader of Hamas's political bureau on Tuesday, many in Gaza wondered how the appointment would affect the war and ceasefire negotiations with Israel.",
"The announcement, posted on Hamas's Telegram channel soon after former leader Ismail Haniyeh was killed in Iran, was seen as a defiant move from the group. Israel has characterized Sinwar as the \"mastermind\" behind the Oct. 7 attacks on southern Israel, which Israeli figures say killed 1,200 and took over 250 hostages into Gaza.",
"Sinwar, 61, has led Hamas in the Gaza Strip since 2017. But his background leans more in military rather than in politics, and his methods are seen as more extreme than his predecessor's.",
"That has created questions over how Sinwar will manage negotiations, and how Israel will negotiate with the man who they say orchestrated the attacks — and whom they've vowed to kill.",
"Many Palestinians interviewed in Gaza expressed similar concern over the promotion, although some welcomed the move. The news comes during a time of tense negotiation to end a war that has devastated the region and killed more than 39,000, according to Palestinian tallies, over the past 10 months."
]
},
{
"headline": [
"Palestinians react"
],
"paragraphs": [
"Jamil Al Saadouni, 58, told CBC freelance videographer Mohamed El Saife in Khan Younis that Sinwar's appointment was \"an internal decision.\"",
"He lamented the fact that Palestinian civilians, who are directly impacted by the war in Gaza, were not consulted on the best replacement for Haniyeh.",
"\"This has nothing to do with other factions or the Palestinian people.\"",
"Abu Hassan Amer, 44, agreed.",
"\"Choosing a military leadership during this period can harm the negotiations,\" he told El Saife. \"Because as they say, the non-political gun creates roadblocks.\"",
"Sinwar is seen as a \"hard-liner\" even within Hamas, said Matthew Levitt, senior fellow at the Washington Institute for Near East Policy, which was founded in 1985 with support from the American Israel Public Affairs Committee, a pro-Israel lobbying organization.",
"Sinwar served over 20 years in Israeli jails in connection with the killings of two Israeli soldiers and four fellow Palestinians, and was released early in 2011 as part of a prisoner swap. He has been known to hunt down people suspected of collaborating with Israel.",
"Levitt said that because of his time in jail, Sinwar \"understands Israelis.\"",
"\"He learned Hebrew, he spoke with his jailers, and that really showed on Oct. 7, when he understood the trauma that the kidnapping and killing of a large number of people would do for the Israelis,\" he said.",
"By comparison, Haniyeh, who ruled in exile from Qatar, often took a more moderate and pragmatic stance.",
"\"The killing of Haniyeh already brought negotiations back to the drawing board,\" Lina Khatib, an expert on the conflict at U.K. think-tank Chatham House, told the AP in an interview. \"The next chess move by Hamas makes negotiations even trickier.\"",
"Haniyeh was killed by an airstrike in Tehran, where he was attending the inauguration of Iran's new president. While Hamas and Iran have blamed Israel for the strike, Israel has not claimed responsibility for it."
]
},
{
"headline": [
"A military man in politics"
],
"paragraphs": [
"Some in Gaza welcomed the news of Sinwar's promotion, saying they needed someone to defend them.",
"\"Choosing him from the stance of Palestine is a good choice,\" Abu Anas Al Saud told El Saife. \"We need someone to defend the land that was stolen.\"",
"But Al Saud is aware of the effect Sinwar may have on ceasefire talks.",
"\"He's the most wanted man to Israel,\" he said. \"It will not advance negotiations at all.\"",
"Sinwar only made rare appearances before the war. He hasn't been seen in public since Oct. 7, and is thought to be hiding deep in tunnels beneath the Gaza Strip. Mediators say it takes several days to exchange messages with him, raising questions on how he will now manage Hamas as its international face.",
"Sinwar \"is someone who grew up within the brigade and the militant terrorist wing of Hamas,\" said Levitt.",
"However, while Sinwar's promotion might seem like a direct \"challenge to Israel,\" a deal was still possible, Sadeq Abu Amer told the AP. He noted that Sinwar \"might take a step that will surprise everyone.\" Abu Amer is the head of the Palestinian Dialogue Group in Turkey, which says on its site that it aims to \"protect the historical rights of the Palestinian people.\"",
"And while the assassination of Haniyeh makes a difference \"in the immediate,\" Levitt said, in the long term, both sides are still looking for a deal.",
"\"The same factors that were driving Hamas towards the deal and separately driving the Israeli prime minister to a deal are still there.\""
]
},
{
"headline": [
"'There is only one place for Yahya Sinwar'"
],
"paragraphs": [
"On Tuesday, Israel's chief military spokesperson, Rear Admiral Daniel Hagari, said Sinwar's appointment would not stop Israel from pursuing him.",
"\"There is only one place for Yahya Sinwar, and it is beside Mohammed Deif and the rest of the Oct. 7 terrorists,\" he told the Saudi state-owned Al-Arabiya television. \"That is the only place we're preparing and intending for him.\"",
"Amer, in Gaza, stressed the importance of diplomacy before military strength, particularly as negotiations continue between both sides.",
"\"There are rules to resistance, rules to war and rules to peace,\" said Amer. \"[And] we need peace in this current moment.\""
]
}
]
},
"publishing_date": "2024-08-08 08:00:00+00:00",
"title": "Palestinians say his appointment could ruin ceasefire talks",
"topics": [
"World"
]
}
}
Binary file not shown.
6 changes: 6 additions & 0 deletions tests/resources/parser/test_data/ca/meta.info
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"CBCNews_2024_08_08.html.gz": {
"url": "https://www.cbc.ca/news/world/gaza-israel-ceasefire-negotiations-sinwar-1.7287711?cmp=rss",
"crawl_date": "2024-08-08 23:53:17.604667"
}
}