Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for "Washington Post" #467

Merged
merged 4 commits into from
May 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions docs/supported_publishers.md
Original file line number Diff line number Diff line change
Expand Up @@ -786,5 +786,22 @@
</td>
<td>&#160;</td>
</tr>
<tr>
<td>
<code>WashingtonPost</code>
</td>
<td>
<div>Washington Post</div>
</td>
<td>
<a href="https://www.washingtonpost.com/">
<span>www.washingtonpost.com</span>
</a>
</td>
<td>
<code>topics</code>
</td>
<td>&#160;</td>
</tr>
</tbody>
</table>
15 changes: 15 additions & 0 deletions src/fundus/publishers/us/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from .the_intercept import TheInterceptParser
from .the_nation_parser import TheNationParser
from .the_new_yorker import TheNewYorkerParser
from .washington_post import WashingtonPostParser
from .washington_times_parser import WashingtonTimesParser
from .world_truth import WorldTruthParser

Expand Down Expand Up @@ -117,6 +118,20 @@ class US(PublisherEnum):
parser=WashingtonTimesParser,
)

WashingtonPost = PublisherSpec(
name="Washington Post",
domain="https://www.washingtonpost.com/",
sources=[
areinicke marked this conversation as resolved.
Show resolved Hide resolved
Sitemap("https://www.washingtonpost.com/sitemaps/sitemap.xml.gz"),
NewsMap("https://www.washingtonpost.com/sitemaps/news-sitemap.xml.gz"),
RSSFeed("https://feeds.washingtonpost.com/rss/world"),
RSSFeed("https://feeds.washingtonpost.com/rss/national"),
],
parser=WashingtonPostParser,
# Adds a URL-filter to ignore incomplete URLs
url_filter=regex_filter("washingtonpost.com(\/)?$"),
)

TheNewYorker = PublisherSpec(
name="The New Yorker",
domain="https://www.newyorker.com/",
Expand Down
39 changes: 39 additions & 0 deletions src/fundus/publishers/us/washington_post.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
import datetime
from typing import List, Optional

from lxml.cssselect import CSSSelector

from fundus.parser import ArticleBody, BaseParser, ParserProxy, attribute
from fundus.parser.utility import (
extract_article_body_with_selector,
generic_author_parsing,
generic_date_parsing,
)


class WashingtonPostParser(ParserProxy):
class V1(BaseParser):
_paragraph_selector = CSSSelector("div[data-qa='article-body'] > p, div[class='story relative'] > p")
_summary_selector = CSSSelector("h2[data-qa='subheadline']")
_subheadline_selector = CSSSelector("div[data-qa='article-body'] > h3[data-qa='article-header']> div")

@attribute
def body(self) -> ArticleBody:
return extract_article_body_with_selector(
self.precomputed.doc,
paragraph_selector=self._paragraph_selector,
summary_selector=self._summary_selector,
subheadline_selector=self._subheadline_selector,
)

@attribute
areinicke marked this conversation as resolved.
Show resolved Hide resolved
def title(self) -> Optional[str]:
return self.precomputed.meta.get("og:title")

@attribute
def authors(self) -> List[str]:
return generic_author_parsing(self.precomputed.ld.bf_search("author"))

@attribute
def publishing_date(self) -> Optional[datetime.datetime]:
return generic_date_parsing(self.precomputed.meta.get("article:published_time"))
58 changes: 58 additions & 0 deletions tests/resources/parser/test_data/us/WashingtonPost.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
{
"V1": {
"authors": [
"Matthew Cappucci"
],
"body": {
"summary": [
"Strong tornadoes are possible across the central states both Friday and Saturday"
],
"sections": [
{
"headline": [],
"paragraphs": [
"Friday marks day two of a four-day-long barrage of severe storms set to threaten the central states, with the potential for strong tornadoes across portions of the Plains and Corn Belt.",
"Rotating thunderstorms or “supercells” are expected from Nebraska and Iowa as far south as the Ozarks of eastern Oklahoma and northwestern Arkansas, and an even more widespread tornado threat may materialize into Saturday from Texas to the Great Lakes.",
"Cities that could be affected by violent storms Friday and/or Saturday include Dallas, Oklahoma City, Tulsa, Wichita, Kansas City, Mo., Omaha, Des Moines and Milwaukee.",
"Severe thunderstorms, with a few embedded tornadoes, were rolling through eastern Oklahoma Friday morning. Those were the leftovers of Thursday’s storms. While hail was widespread across northwestern Kansas on Thursday, tornadoes largely failed to materialize, but that will probably change on Friday.",
"Friday will likely feature an arcing band of rotating thunderstorms that will swing from Omaha to Des Moines and could extend as far south as Kansas City. All three major cities are encapsulated in a Level 3 out of 5 risk of severe weather drawn up by the National Weather Service’s Storm Prediction Center. The agency warns that “all hazards will be possible, including tornadoes with some potentially strong, very large hail over two inches in diameter, and wind damage.”",
"On Saturday, storms will be more widespread, with at least some risk of severe weather from Northern Michigan to the Texas-Mexico border. A tornado is possible anywhere within that zone, but the risk is highest between roughly Kansas and Dallas, including Oklahoma City and Wichita, where a few intense tornadoes could form.",
"Thursday’s storms erupted as predicted in western Kansas, but an outflow boundary — or the leading edge of cool-air exhaust exiting earlier storms to the northeast — “undercut” and weakened most of the storms. Subsequently, they struggled to rotate. While golf ball- to tennis ball-sized hail was common, tornadoes were not. For the most part, only a few weak “landspout” tornadoes were observed in northeastern Colorado.",
"The active pattern doesn’t look to ease until the middle of next week, and even that’s not a guarantee. Small disturbances in the jet stream could trigger additional rounds of storms, albeit more localized, over the Plains next week."
]
},
{
"headline": [
"Friday’s storms"
],
"paragraphs": [
"Zone 1 — Corn Belt and Missouri Valley",
"A Level 3 out of 5 enhanced risk of severe weather covers northeastern Nebraska, southwestern Iowa, northeastern Kansas and northwestern Missouri. Omaha and Lincoln in Nebraska and Kansas City and St. Joseph in Missouri are within this zone. That’s where the greatest potential for a strong tornado or two exists.",
"Storms will fire during the afternoon on the leading edge of a dry slot, or a wedge of cool, dry air filtering in on the backside of a low pressure system in Nebraska. Ahead of that low, warm, humid air will waft north. The insurgence of dry air will kick that moisture upward into storms. Given proximity to the surface low, meanwhile, the atmosphere will be replete with spin.",
"It’s expected that a band of rotating storms will form around midafternoon near or west of Omaha, then travel east along Interstate 80 toward Des Moines before weakening. Thunderstorm coverage will decrease as one heads south, but a few rotating south with an attendant tornado risk are possible all the way south to Kansas City.",
"Storms will weaken by about 10 p.m. in eastern areas as they outrun the instability, or storm fuel, that gave rise to them.",
"Zone 2 — South of Kansas City to around Dallas",
"In this zone, storms probably won’t be as widespread because the rising air required to incite storms will be more concentrated to the north. However, a cold front trailing through the area could touch off scattered storms, which could be severe.",
"There are some signs that storms could be more numerous in Texas, including around Dallas, than originally anticipated. While spin won’t be overly impressive, plentiful storm fuel could foster clusters of storms capable of producing large hail and damaging winds. Those storms could fire as early as 1 p.m. Central time."
]
},
{
"headline": [
"Looking ahead"
],
"paragraphs": [
"A second storm system is already brewing in the wake of the first, and some meteorologists expect it to be stronger. A low pressure system will garner strength in eastern Colorado and will eject onto the Plains on Saturday.",
"Southerly winds ahead of it will rapidly scoop warm and moist air northward. That will help the atmosphere to reload, with a return of thunderstorm fuel in some cases just 12 to 18 hours after Friday’s storms depart.",
"The low will also drag a dryline eastward where storms are expected to erupt in the afternoon. Storms that sprout will probably grow tall enough to begin to rotate because of a strong jet stream roaring overhead.",
"A limiting factor may be morning storms. If they prove widespread, they could gobble up some of the storm fuel that otherwise would have been utilized by stronger afternoon storms. Likewise, morning storms would cut back on sunshine. Sunshine is integral to heating the ground and helping cook up more storm fuel.",
"Nonetheless, the Storm Prediction Center has outlined a Level 3 out of 5 risk of severe weather from Des Moines all the way south to Highway 287 northwest of Dallas on Saturday. Kansas City, Wichita, Tulsa and Oklahoma City are in that zone.",
"“The greatest threat is currently anticipated across parts of the central and southern Plains, where very large hail, damaging winds, and a few strong tornadoes will be possible,” wrote the center. “A larger area of potential threat will extend from south-central Texas north-northeastward into the Great Lakes.”",
"More storms are expected Sunday, but the timing and placement remain uncertain and depend heavily on how Saturday’s storms evolve. For now, expect scattered severe storms with the potential for damaging winds and hail from the Corn Belt to East Texas."
]
}
]
},
"publishing_date": "2024-04-26 15:24:04.336000+00:00",
"title": "Strong tornadoes possible in severe storm outbreak in central states"
}
}
Binary file not shown.
4 changes: 4 additions & 0 deletions tests/resources/parser/test_data/us/meta.info
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,10 @@
"url": "https://www.newyorker.com/news/the-control-of-nature/a-heat-shield-for-the-most-important-ice-on-earth",
"crawl_date": "2023-05-16 14:55:06.350230"
},
"WashingtonPost_2024_04_26.html.gz": {
"url": "https://www.washingtonpost.com/weather/2024/04/26/storms-tornadoes-iowa-kansas-nebraska-oklahoma/",
"crawl_date": "2024-04-26 17:38:15.805977"
},
"WashingtonTimes_2023_04_28.html.gz": {
"url": "https://www.washingtontimes.com/news/2023/apr/28/indiana-governor-endorses-revised-gop-state-budget/?utm_source=RSS_Feed&utm_medium=RSS",
"crawl_date": "2023-04-28 20:33:11.404979"
Expand Down