Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NationalPost #584

Merged
merged 4 commits into from
Sep 3, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions docs/supported_publishers.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,21 @@
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>
<code>NationalPost</code>
</td>
<td>
<div>National Post</div>
</td>
<td>
<a href="https://nationalpost.com">
<span>nationalpost.com</span>
</a>
</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
</tbody>
</table>

Expand Down
12 changes: 12 additions & 0 deletions src/fundus/publishers/ca/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from fundus.publishers.base_objects import Publisher, PublisherGroup
from fundus.publishers.ca.cbc_news import CBCNewsParser
from fundus.publishers.ca.national_post import NationalPostParser
from fundus.scraping.url import NewsMap, RSSFeed, Sitemap

# noinspection PyPep8Naming
Expand All @@ -16,3 +17,14 @@ class CA(metaclass=PublisherGroup):
RSSFeed("https://www.cbc.ca/webfeed/rss/rss-canada"),
],
)

NationalPost = Publisher(
name="National Post",
domain="https://nationalpost.com",
parser=NationalPostParser,
sources=[
NewsMap("https://nationalpost.com/sitemap-news.xml"),
Sitemap("https://nationalpost.com/sitemap.xml"),
addie9800 marked this conversation as resolved.
Show resolved Hide resolved
RSSFeed("https://nationalpost.com/feed"),
],
)
2 changes: 0 additions & 2 deletions src/fundus/publishers/ca/cbc_news.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,11 @@
import re
from typing import List, Optional

import more_itertools
from lxml.cssselect import CSSSelector
from lxml.etree import XPath
from lxml.html import document_fromstring

from fundus.parser import ArticleBody, BaseParser, ParserProxy, attribute
from fundus.parser.data import LinkedDataMapping
from fundus.parser.utility import (
extract_article_body_with_selector,
generic_author_parsing,
Expand Down
62 changes: 62 additions & 0 deletions src/fundus/publishers/ca/national_post.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import datetime
import json
import re
from typing import List, Optional

from lxml.cssselect import CSSSelector
from lxml.etree import XPath
from lxml.html import document_fromstring

from fundus.parser import ArticleBody, BaseParser, ParserProxy, attribute
from fundus.parser.utility import (
extract_article_body_with_selector,
generic_author_parsing,
generic_date_parsing,
generic_topic_parsing,
)


class NationalPostParser(ParserProxy):
class V1(BaseParser):
_summary_selector = CSSSelector("article p.article-subtitle")
_subheadline_selector = XPath(
"//section[@class='article-content__content-group article-content__content-group--story']/p/strong"
)
_paragraph_selector = XPath(
"//section[@class='article-content__content-group article-content__content-group--story']/p[text()]"
)

@attribute
def body(self) -> ArticleBody:
return extract_article_body_with_selector(
self.precomputed.doc,
summary_selector=self._summary_selector,
subheadline_selector=self._subheadline_selector,
paragraph_selector=self._paragraph_selector,
)

@attribute
def authors(self) -> List[str]:
return generic_author_parsing(self.precomputed.ld.bf_search("author"))

@attribute
def publishing_date(self) -> Optional[datetime.datetime]:
return generic_date_parsing(self.precomputed.ld.bf_search("datePublished"))

@attribute
def title(self) -> Optional[str]:
return self.precomputed.meta.get("og:title")

@attribute
def topics(self) -> List[str]:
preliminary_topics = self.precomputed.ld.bf_search("keywords")
filter_list = ["Curated", "News", "Newsroom daily", "story", "Canada", "World"]
filtered_topics = [
topic
for topic in preliminary_topics
if "NLP Entity Tokens" not in topic
and "NLP Category" not in topic
and topic not in filter_list
and not re.search(r"[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}", topic)
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some sort of regex filter would be nice here. Actually, I think you could just use the regular regex_filter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree. Using regex_filter seems a bit cursed Typing wise, but it does work just fine.

return generic_topic_parsing(filtered_topics)
38 changes: 38 additions & 0 deletions tests/resources/parser/test_data/ca/NationalPost.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
{
"V1": {
"authors": [
"Special to National Post"
],
"body": {
"summary": [
"A draft policy by the CMA essentially calls for an end to private workplace coverage for virtual health care"
],
"sections": [
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add a test case including at least one subheadline.

"headline": [],
"paragraphs": [
"Under the proposed policy, the CMA would advise that virtual care, such as secure messaging by physicians and services by nurse practitioners, be considered an insured benefit under provincial and territorial health care plans; it would also ask that the government reinterpret the Canada Health Act to eliminate “Duplicative insurance models whereby people can purchase private insurance to access medically necessary services” already covered by public insurance, echoing the federal government’s recently stated intentions.",
"These measures could jeopardize existing workplace virtual care coverage and other methods Canadians use to access care currently not supported by the public health system. If adopted, these changes will result in undue complexity for patients and further burden a system that is operating well beyond its capacity.",
"In the past, workplace benefits focused on paramedical services like physiotherapy and massage therapy. However, that was before our health-care system plunged into crisis. Despite Canada spending a higher share of GDP on health care than the majority of its high-income peers in the Organisation for Economic Co-operation and Development (OECD), it has a critical and worsening shortage of access to doctors, leading to deteriorating health outcomes and higher costs.",
"Some employers have, therefore, extended a lifeline to their employees by offering supplemental access to primary virtual care.",
"Instead of helping Canadians and its member physicians navigate these challenging times with innovative and effective ways to deploy the available resources, the CMA is proposing to advocate for an end to health-care access provided by privately funded virtual care. This is one of the few parts of our system that is working well at improving wait times, with costs that are largely underwritten by employers and insurers.",
"The CMA is essentially calling to restrict patients’ access to this type of care at a time when an estimated 6.5 million Canadians — per a recent OurCare study that was recognized by the CMA’s own medical journal — cannot access a family doctor. Additionally, millions more who have a family doctor wait days, if not weeks, to see their physician when they are in need. In fact, only 35 per cent of OurCare respondents could access same- or next-day appointments with their family doctor.",
"As a doctor and member of the CMA, I believe this policy proposal represents a massive betrayal to patients, yanking desperately needed health-care access from millions of hardworking Canadians. A large majority of Canadians disagree with the government’s position — and the CMA’s tentative position. According to Global News and Ipsos, more than 60 per cent of Canadians are in favour of allowing private health care for those who can afford it.",
"The draft policy also represents a betrayal of the CMA’s membership: in the summer of 2023, the association’s own polling found that 56 per cent of Canadian doctors support the right of patients to access private care when the public system cannot deliver timely access to a physician.",
"In its haste to support a risky and ideological policy over the will of Canadians at large, the CMA has ignored the interests and wishes of patients, Canadian businesses, institutions and associations that have spoken out on behalf of Canadians. These organizations have rightly pointed out that the association’s draft plan would eliminate access to care for a large percentage of our population and strain an already overburdened health-care system.",
"To be clear, I am not criticizing my fellow doctors — only the CMA, which has chosen to misrepresent the majority of us. The many doctors I speak to see tremendous value in the role private care now plays in alleviating pressure on a stressed system. In fact, the CMA appears to sense that its new policy is causing tension. A week after its release, the CMA leadership emphasized the “draft” nature of their policy in an op-ed in The Hill Times, and suggested their intention to undergo consultations with CMA members and stakeholders. This is an incredulous direction to take given the disregard CMA leaders have already demonstrated for their members’ opinions during the creation of this “draft.”",
"In order to regain its credibility, the CMA, a supposedly member-driven organization, needs to explain why, on this issue, it chose to ignore the majority opinion of its own members and instead align its draft policy with a status quo its members reject.",
"We are at a pivotal time in Canada’s health-care system. The majority of Canadians, and the majority of its physicians, have expressed a desire to blaze a new and different path in health care. Rather than doubling down on our current failing system with its unique prohibition on private health care, the CMA must show leadership and represent the will of our citizens and health practitioners.",
"By ignoring the reality of the heartbreaking state of health care and widespread wishes for change, the CMA — not private care — is suddenly a real threat to the future of Canada’s health-care system.",
"National Post"
]
}
]
},
"publishing_date": "2024-08-12 10:00:41+00:00",
"title": "Brett Belchetz: The Canadian Medical Association is the real threat to health-care access",
"topics": [
"NP Comment"
]
}
}
Binary file not shown.
4 changes: 4 additions & 0 deletions tests/resources/parser/test_data/ca/meta.info
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,9 @@
"CBCNews_2024_08_08.html.gz": {
"url": "https://www.cbc.ca/news/world/gaza-israel-ceasefire-negotiations-sinwar-1.7287711?cmp=rss",
"crawl_date": "2024-08-08 23:53:17.604667"
},
"NationalPost_2024_08_12.html.gz": {
"url": "https://nationalpost.com/opinion/brett-belchetz-the-canadian-medical-association-is-the-real-threat-to-health-care-access",
"crawl_date": "2024-08-12 12:55:45.037006"
}
}