Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cbc as first canadian publisher #583

Merged
merged 8 commits into from
Sep 3, 2024
Merged

Conversation

addie9800
Copy link
Collaborator

No description provided.

Copy link
Collaborator

@MaxDall MaxDall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding 👍


@attribute
def topics(self) -> List[str]:
return generic_topic_parsing(self.precomputed.ld.bf_search("ReportageNewsArticle")[0].get("articleSection"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a lot more information about the topics looking through the HTML using the term keyword.
Take this article for example.

Comment on lines 40 to 50
doc = document_fromstring(self.precomputed.html)
ld_nodes = self._author_ld_selector(doc)
try:
author_ld = json.loads(re.sub(r"(window\.__INITIAL_STATE__ = |;$)", "", ld_nodes[0].text_content()))
except json.JSONDecodeError:
return []
if not (details := author_ld.get("detail")):
return []
if not (content := details.get("content")):
return []
return generic_author_parsing(content.get("authorList"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't use the author given in the LD.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it isn't accessible properly. This JSON is not wrapped within the usual ld json type and contains lists, which causes the bf_search to not work properly. I tried fixing it by replacing new.extend(v for v in node.values() if isinstance(v, dict) or isinstance(v, list)) this line in the function, but that ended up breaking the bf_search completely, so I figured this might be the straightforward option. But since we also need it in the topics, I did now implement a local fix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work now with the changes made here #592

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately even with the changes made there, it still does not work, since the keywords are in a script block that is not classified as ld+json, so it is ignored by the default _base_setup.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a really clean way to do this, but unfortunately, we need #588 for this. So let's finish this one after.

@addie9800 addie9800 requested a review from MaxDall August 27, 2024 22:04
@MaxDall MaxDall added the DON'T MERGE Current PR is based on another PR. When used, indicate in title with [Based on #...] label Aug 28, 2024
@MaxDall MaxDall removed the DON'T MERGE Current PR is based on another PR. When used, indicate in title with [Based on #...] label Sep 3, 2024
Copy link
Collaborator

@MaxDall MaxDall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

@addie9800 addie9800 merged commit 131ae12 into master Sep 3, 2024
5 checks passed
@addie9800 addie9800 deleted the add-canadian-news-sources branch September 3, 2024 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants