-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added US publisher Rolling Stone #453
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding 👍 Looks really good!
|
||
class RollingStoneParser(ParserProxy): | ||
class V1(BaseParser): | ||
_paragraph_selector = CSSSelector("div.a-content > * > p.paragraph") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the lists have a slightly different structure than normal articles. The CSSSelector is fixed and works for both now
https://github.com/brandjakHU/fundus/blob/e8918df4fca60325ae08b4cfb56b09559be83efc/src/fundus/publishers/us/rolling_stone.py#L18
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing this. Unfortunately, they still don't work for the articles I posted, further, the new selector selects a lot of empty paragraphs. I would suggest to use div.a-content p.paragraph
which parses both correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, i changed it to div.a-content p.paragraph
. I originally used only p
because i found one list article which uses <p>
nodes without the paragraph
-class (https://www.rollingstone.com/music/music-lists/best-beyonce-songs-1378620/find-your-way-back-2019-1380991/). However, all other articles i looked at use class=paragraph
, so maybe it is better to select like this and avoid empty paragraphs?
class V1(BaseParser): | ||
_paragraph_selector = CSSSelector("div.a-content > * > p.paragraph") | ||
_summary_selector = CSSSelector("article > header > div.article-excerpt") | ||
_subheadline_selector = CSSSelector("article > header > div.article-kicker") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find an example article for the subheadline selector, could you provide one in this comment?
Also, the subheadlines included in this article are not parsed properly:
https://www.rollingstone.com/politics/politics-news/donald-trump-indictments-lawsuits-legal-cases-explained-1234767770/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted subheadline as the smaller headline on top of the title, in the case of the upper article it would be "Defendant Don". The parsing is now fixed
https://github.com/brandjakHU/fundus/blob/e8918df4fca60325ae08b4cfb56b09559be83efc/src/fundus/publishers/us/rolling_stone.py#L20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Fundus, subheadlines
are headlines that separate paragraphs. In the example article i posted the subheadlines would be Paying Off a Porn Star
, Election Interference and Jan. 6
, ...
I would suggest div.a-content h2.heading, div.a-content h2.c-gallery-vertical-featured-image__title
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. I added div.a-content h2.heading
. For the other subheadlines, the class h2.c-gallery-vertical-featured-image__title
could not be parsed because it is not in the original sourcecode (there, the content is only h2
without any classes, maybe the class is added with a script before being shown in the browser). Instead, i selected via the identifier that is used by all of these lists with div.a-content div#pmc-gallery-vertical h2
No description provided.