Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add The Globe and Mail #587

Merged
merged 4 commits into from
Sep 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions docs/supported_publishers.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,21 @@
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>
<code>TheGlobeAndMail</code>
</td>
<td>
<div>The Globe and Mail</div>
</td>
<td>
<a href="https://www.theglobeandmail.com">
<span>www.theglobeandmail.com</span>
</a>
</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
</tbody>
</table>

Expand Down
10 changes: 10 additions & 0 deletions src/fundus/publishers/ca/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from fundus.publishers.base_objects import Publisher, PublisherGroup
from fundus.publishers.ca.cbc_news import CBCNewsParser
from fundus.publishers.ca.globe_and_mail import TheGlobeAndMailParser
from fundus.publishers.ca.national_post import NationalPostParser
from fundus.scraping.url import NewsMap, RSSFeed, Sitemap

Expand All @@ -17,6 +18,15 @@ class CA(metaclass=PublisherGroup):
RSSFeed("https://www.cbc.ca/webfeed/rss/rss-canada"),
],
)
TheGlobeAndMail = Publisher(
name="The Globe and Mail",
domain="https://www.theglobeandmail.com",
parser=TheGlobeAndMailParser,
sources=[
NewsMap("https://www.theglobeandmail.com/arc/outboundfeeds/news-sitemap-index/?outputType=xml"),
NewsMap("https://www.theglobeandmail.com/arc/outboundfeeds/sitemap-index/?outputType=xml"),
],
)

NationalPost = Publisher(
name="National Post",
Expand Down
49 changes: 49 additions & 0 deletions src/fundus/publishers/ca/globe_and_mail.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import datetime
from typing import List, Optional

from lxml.cssselect import CSSSelector

from fundus.parser import ArticleBody, BaseParser, ParserProxy, attribute
from fundus.parser.utility import (
extract_article_body_with_selector,
generic_author_parsing,
generic_date_parsing,
generic_topic_parsing,
)


class TheGlobeAndMailParser(ParserProxy):
class V1(BaseParser):
_subheadline_selector = CSSSelector("article > h4")
_paragraph_selector = CSSSelector("article > p")

@attribute
def body(self) -> ArticleBody:
return extract_article_body_with_selector(
self.precomputed.doc,
subheadline_selector=self._subheadline_selector,
paragraph_selector=self._paragraph_selector,
)

@attribute
def authors(self) -> List[str]:
return generic_author_parsing(self.precomputed.ld.bf_search("author"))

@attribute
def publishing_date(self) -> Optional[datetime.datetime]:
return generic_date_parsing(self.precomputed.ld.bf_search("datePublished"))

@attribute
def title(self) -> Optional[str]:
return self.precomputed.meta.get("og:title")

@attribute
def topics(self) -> List[str]:
topic_list = [topic.lower() for topic in generic_topic_parsing(self.precomputed.meta.get("keywords"))]
topic_set = set(topic_list)
topic_duplicates = list(topic_list)
for element in topic_set:
topic_duplicates.remove(element)
for duplicate in topic_duplicates:
topic_list.remove(duplicate)
return [topic.title() for topic in topic_list if "news" not in topic]
67 changes: 67 additions & 0 deletions tests/resources/parser/test_data/ca/TheGlobeAndMail.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
{
"V1": {
"authors": [
"Chris Wilson-Smith"
],
"body": {
"summary": [],
"sections": [
{
"headline": [],
"paragraphs": [
"If one analyst catches Nvidia chief executive Jensen Huang so much as sneezing tomorrow, I’d stay away from market news for a few days. Below, as investors sharpen their knives in case the AI giant reports merely terrific earnings, we look at why Nvidia’s long-term growth could be damaging in unexpected ways.",
"Earnings up: Royal Bank of Canada has surpassed analysts’ estimates for quarterly profit as it set aside a smaller than expected sum to protect itself against losses on bad loans.",
"And up again: National Bank of Canada also has posted profits that exceeded analysts expectations, a week ahead of a key vote on its proposed $5-billion takeover of rival Canadian Western Bank.",
"Stare down: Buyers smell blood in the water as distressed commercial properties are put up for sale. But so far, sellers of that troubled real estate are refusing to accept rock-bottom values.",
"Chair down: The inaugural chair of the organization in charge of overseeing Canada’s adoption of international sustainability reporting standards has stepped down, prompting a search for a replacement at a key time in its duties.",
"Deep down: A court battle pitting the two brothers behind Dye & Durham Ltd. against one another has exposed broad discontent among institutional shareholders toward the real-estate software company dating to well before activists launched campaigns against it this year."
]
},
{
"headline": [
"Nvidia’s new vertical: Nation building"
],
"paragraphs": [
"Nvidia’s chief executive is known for a few sayings and stylistic choices: We are at the beginning of “a new industrial revolution” powered by artificial intelligence. His company’s relatively affordable products are “democratizing” access to its computational powers. He has a cool leather jacket.",
"Of late, he seems focused on another vision for the future: “Sovereign AI.”",
"That would be the idea that each nation produces artificial intelligence using its own infrastructure, data, work force and business networks.",
"Canada is among the subscribers to this idea, and the argument is similar to the one made by manufacturers of, say, electric vehicles: If we don’t protect our industry, if we don’t develop and innovate, we are then beholden to the whims of industry giants in other countries. We lose homegrown winners and jobs, and possibly expose ourselves to security threats.",
"It’s a great idea for Jensen Huang, who gets to sell billions of dollars worth of chips to governments. But it’s a little more complicated – and possibly even more dangerous – than he makes it out to be.",
"The biggest problem might have been best illustrated by Huang himself. At the World Governments Summit in Dubai this February, he reminded an audience of leaders across industry and politics that investment in AI infrastructure is essential.",
"He then told these leaders, who were gathered in a country that criminalizes being gay, what he would do if he were a leader of a developing nation: “The first thing that I would do, of course, is I would codify the language, the data of your culture into your own large language model.”",
"Did leaders of authoritarian nations lean forward in their seats?",
"That possibility is one of many concerns held by critics of sovereign AI. They argue embracing the concept, especially with the support of a global AI leader, could legitimize and accelerate state programs that codify belief systems, language preferences, behaviours. And if every nation becomes responsible for its own AI innovations, some of those breakthroughs could be left trapped behind geographic borders. That’s not to mention the risk of fuelling an already dominant Nvidia into a force that could squeeze out competition completely.",
"How does all this square with “democratizing” access to artificial intelligence? And more to Huang’s point: are there lines he wouldn’t feel comfortable seeing crossed?",
"In July, the Digital Forensic Research Lab outlined ways authoritarian governments that embrace sovereign AI could use it to further erode human rights.",
"At a more basic level, the report says, state-backed data initiatives for sovereign AI are likely to hurt marginalized populations, given governments’ views on national identity tend to be rooted in more deeply held – if not completely fixed – ideas. The report points to China, which has already succeeded in censoring models that threaten Beijing’s messaging. But the warning applies to any nation embracing the concept.",
"Canada, which has an “AI Sovereign Compute Strategy” as part of a broader set of measures, seems attuned to these risks. As part of its efforts to spur the development of Canadian-owned and located AI infrastructure, it launched consultations with businesses, developers, researchers and Indigenous groups that end on Sept. 6.",
"We’ll be curious to see what these consultations find, and how they will reflect Canada’s “culture.” (Not that the country has ever struggled to define what that is, of course.)",
"Huang’s own remarks suggest his strategy is to form the building blocks of the next industrial revolution, then leave it to his client countries to decide how to use them.",
"Today, even the slightest hint of weakness in Nvidia’s forecast could make for volatile trading over the coming weeks – but most analysts don’t see much of a threat to the AI giant. Longer-term, though, Nvidia’s expansion might attract more scrutiny to its growing role – whether it acknowledges it or not – as a nation builder with no apparent code of its own.",
"Reliance on the low-wage stream of the temporary foreign work program has shot up since 2022. The federal government agreed to ease access to the program in response to calls from restaurant owners and other employers who said they were struggling to find staff after months of pandemic restrictions. Ottawa announced this week plans to cut the low-wage stream back to prepandemic levels amid criticism of its growing use by Canadian employers.",
"Today: Nvidia and CrowdStrike report after close, assuming no faulty software updates. Investors will be chewing over reports from RBC and National Bank of Canada as they wait for ...",
"Tomorrow: ... Canadian Imperial Bank of Commerce earnings, which will be the last of the Big Six this quarter. Other earnings include Dell Technologies Inc., Dollar General Corp., and Lululemon Athletica Inc.",
"Friday: Canadian Western Bank reports as it awaits approval to be purchased by National Bank. Canada reports monthly GDP growth, and the U.S. releases two indicators of consumer spending and price growth.",
"Long arms of the claw: Inadequate federal enforcement of the lobster fishery in southwestern Nova Scotia is emboldening organized crime that is “terrorizing” the local community."
]
},
{
"headline": [
"Morning markets"
],
"paragraphs": [
"Global markets held steady as investors stayed on the sidelines ahead of Nvidia’s earnings release after the closing bell. Wall Street futures and TSX futures were little changed.",
"Overseas, the pan-European STOXX 600 was up 0.49 per cent in morning trading. Britain’s FTSE 100 slipped 0.14 per cent, Germany’s DAX rose 0.78 per cent and France’s CAC 40 gained 0.49 per cent.",
"In Asia, Japan’s Nikkei closed 0.22 per cent higher, while Hong Kong’s Hang Seng dropped 1.02 per cent.",
"The Canadian dollar traded at 74.27 U.S. cents."
]
}
]
},
"publishing_date": "2024-08-28 11:26:02.614000+00:00",
"title": "Business Brief: Nvidia’s into nation building. We cool with that?",
"topics": [
"Noastack"
]
}
}
Binary file not shown.
4 changes: 4 additions & 0 deletions tests/resources/parser/test_data/ca/meta.info
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,9 @@
"NationalPost_2024_08_28.html.gz": {
"url": "https://nationalpost.com/news/canada/kamala-harris-childhood-montreal-canada",
"crawl_date": "2024-08-28 13:13:43.905282"
},
"TheGlobeAndMail_2024_08_28.html.gz": {
"url": "https://www.theglobeandmail.com/business/article-business-brief-nvidias-into-nation-building-we-cool-with-that/",
"crawl_date": "2024-08-28 13:26:27.319831"
}
}