Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for "Washington Post" #467

Merged
merged 4 commits into from
May 6, 2024
Merged

Conversation

areinicke
Copy link

I have added support for the US-Publisher "Washington Post" (https://www.washingtonpost.com/)

I have ran the tests as instructed and no errors were produced.

Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good. Thanks for adding this 👍

src/fundus/publishers/us/__init__.py Show resolved Hide resolved
src/fundus/publishers/us/washington_post.py Show resolved Hide resolved
src/fundus/publishers/us/washington_post.py Outdated Show resolved Hide resolved
@addie9800
Copy link
Collaborator

You could consider also adding a function for topics, because within the json, theres a tag called keywords which would provide the necessary data

Everything you have implemented so far looks good. Now what still remains open is a function for the topics. If you add that you also need to run python -m scripts.generate_parser_test_files -p WashingtonPost -oj to update the test cases. (My guess is that this is why the tests are failing atm as well). After all of that make sure to also run black . to do any necessary reformatting.

@areinicke
Copy link
Author

areinicke commented Apr 29, 2024

Unfortunately, I am unsure on how to specifically extract the values of the "keywords" tag with the methods Fundus provides or without causing the topics method to be huge. I have tried several options but was unsuccessful so far. An alternative would be to just extract the "article:section" value from the meta section. However, this would be extremely broad and only return one topic per article, which is not ideal.

Additionally, adding the additional RSS Feeds you provided seems to have caused the main page of the Washington Post ( https://www.washingtonpost.com/ ) to be considered as an article as well. When this occurs, no article text or publishing date is returned obviously. Fundus will say "--missing plaintext--"

In the meantime, I have fixed the tests. They should run fine now.

@addie9800
Copy link
Collaborator

You are right, for some reason the RSS Feeds sometimes don't contain the actual link to the article and just lead to the homepage. I don't know why. For this we have the url_filter attribute and since it was just something small, I added it to the PR.
I'm sorry regarding the keywords, because I also couldn't find what I found the last time and it does really not make sense adding them. Sorry about that :)

@MaxDall MaxDall merged commit 1996937 into flairNLP:master May 6, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants