Add Support for "Washington Post" #467

areinicke · 2024-04-26T15:53:00Z

I have added support for the US-Publisher "Washington Post" (https://www.washingtonpost.com/)

I have ran the tests as instructed and no errors were produced.

addie9800

This looks really good. Thanks for adding this 👍

src/fundus/publishers/us/__init__.py

src/fundus/publishers/us/washington_post.py

addie9800 · 2024-04-29T09:59:19Z

You could consider also adding a function for topics, because within the json, theres a tag called keywords which would provide the necessary data

Everything you have implemented so far looks good. Now what still remains open is a function for the topics. If you add that you also need to run python -m scripts.generate_parser_test_files -p WashingtonPost -oj to update the test cases. (My guess is that this is why the tests are failing atm as well). After all of that make sure to also run black . to do any necessary reformatting.

areinicke · 2024-04-29T15:14:48Z

Unfortunately, I am unsure on how to specifically extract the values of the "keywords" tag with the methods Fundus provides or without causing the topics method to be huge. I have tried several options but was unsuccessful so far. An alternative would be to just extract the "article:section" value from the meta section. However, this would be extremely broad and only return one topic per article, which is not ideal.

Additionally, adding the additional RSS Feeds you provided seems to have caused the main page of the Washington Post ( https://www.washingtonpost.com/ ) to be considered as an article as well. When this occurs, no article text or publishing date is returned obviously. Fundus will say "--missing plaintext--"

In the meantime, I have fixed the tests. They should run fine now.

addie9800 · 2024-05-02T10:11:07Z

You are right, for some reason the RSS Feeds sometimes don't contain the actual link to the article and just lead to the homepage. I don't know why. For this we have the url_filter attribute and since it was just something small, I added it to the PR.
I'm sorry regarding the keywords, because I also couldn't find what I found the last time and it does really not make sense adding them. Sorry about that :)

Add Support for Washington Post

29cd4a7

addie9800 requested changes Apr 27, 2024

View reviewed changes

src/fundus/publishers/us/__init__.py Show resolved Hide resolved

src/fundus/publishers/us/washington_post.py Show resolved Hide resolved

src/fundus/publishers/us/washington_post.py Outdated Show resolved Hide resolved

Fixes + added subheadlines + added additional sources

700c249

Fixed tests

3417745

Add URL Filter + formatting

4b9f56f

addie9800 approved these changes May 2, 2024

View reviewed changes

MaxDall merged commit 1996937 into flairNLP:master May 6, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for "Washington Post" #467

Add Support for "Washington Post" #467

areinicke commented Apr 26, 2024

addie9800 left a comment

addie9800 commented Apr 29, 2024

areinicke commented Apr 29, 2024 •

edited

Loading

addie9800 commented May 2, 2024

Add Support for "Washington Post" #467

Add Support for "Washington Post" #467

Conversation

areinicke commented Apr 26, 2024

addie9800 left a comment

Choose a reason for hiding this comment

addie9800 commented Apr 29, 2024

areinicke commented Apr 29, 2024 • edited Loading

addie9800 commented May 2, 2024

areinicke commented Apr 29, 2024 •

edited

Loading