scrapper post-process script for RSSGuard ( https://github.com/martinrotter/rssguard )
Arguments - each is a CSS selector ( https://www.w3schools.com/cssref/css_selectors.asp ):
- item
- item title (optional - else would use link's text as title)
- item description (optional - else would use all the text from item as description)
- item link (optional - else would use 1st found link in the item (or the item itself if it's a link))
- item title 2nd part (optional (or if static main title \ multilink option is enabled), else just title, e.g. title is "Batman" and 2nd part is "chapter 94")
- item date (optional, else it'd all be "just now") - aim this selector either at text nodes (e.g.
span
) or elements (a
,img
) withtitle
oralt
containing the Date (e.g. "New!" flashing image badges you get the Date when hovering over)
- for
1) item
-@
at start - enables searching for multiple links inside the found item, e.g. onediv
item and multiplea
links inside it and you want it as separate feed items - for everything after
1) item
-~
as the whole argument - to let the script decide what to do (default action) - e.g. use 1st found link inside the item, use whole text inside the item as the description etc (not actually an option, but rather a format for the argument line), e.g.python css2rss.py div.itemclass ~ span.description
(here link's inner text (2nd argument) will be used as the title by default action but description is being looked for (3rd argument)) - for
2) title
,5) item title 2nd part
and3) item description
-!
at start - makes it a static specified value (after the !), e.g."!my title"
, if you make 1st part of the title fixed then 2nd part title addon would get auto-enabled and it would use text inside the found link as the 2nd part (unless you specify what to use manually as the 5th argument) - for
2) title
,5) item title 2nd part
-$
at start - executes a python code expression instead of using CSS selectors, uses found item link as a starting point and takestext
from iteval("tLink."+your_inputted_argument).text
, see https://www.crummy.com/software/BeautifulSoup/bs4/doc/ for things you can do with it - e.g. go one level up (to the parent element) or to the next element - or select elements CSS selectors can't select, see example below - for
6) date
-?
at start - tells the parser that you're expecting an Americal format of date - "Month/Day/Year"
-
1) item
is searched in the whole document and the rest is searched inside theitem
document node (but you can make theitem
point right at thea
hyperlink - it will be used by default) -
use space
"
, e.g.python css2rss.py div.class "div.subclass > h1.title" span.description
(btw, you can also enclose arguments without any spaces into brackets if you'd like) Warning: starting from RSSGuard v4.5.2 which supports single quotation marks as well'
you have to either use single quotation marks instead'
to enclose arguments to pass them as is or escape backslashes and double-quotes with backslashes, e.g.python css2rss.py "\\:argument starting with\\:"
orpython css2rss.py '\:argument starting with\:'
-
if no item is found - a feed item would be generated with the html dump of the whole page so you could see what could be wrong (e.g. - cloudflare block page)
-
content you need to log-in first to see is available
- scrapper uses cookies of RSSGuard, so if you login into a website using built-in browser of RSSGuard - scrapper would be able to access that content as well to scrape it into a feed
- No javaScripts would run on scrapped pages, so sites which populate their content with javaScripts wouldn't be able to get scrapped, instead their starting version (what you'd see in
right click -> view page source
) would get scrapped.- You could try to get the needed content from other pages of the site, e.g. - main page, releases page or even the search page - one of these pages could be static and not constructed using javaScripts
-
Have Python 3+ or newer ( https://www.python.org/downloads/ ) installed (and added to PATH during install)
1.2. Have Python Soup ( https://www.crummy.com/software/BeautifulSoup/ ) installed (Win+R -> cmd -> enter ->
pip install beautifulsoup4
)
1.3. (optional) If you'd like to parse Dates for articles - Have Maya ( https://github.com/timofurrer/maya/ ) installed (Righ click the Start menu -> run powershell as administrator -> cmd ->pip install maya
) -
Put css2rss.py into your
data4
folder (so you can call the script with justpython css2rss.py
, else you'd need to specify full path to the.py
file)
- a simple link makeover into an rss feed (right-clicked a link -> inspect element -> use its CSS selector):
url: https://www.foxnews.com/media
script: python css2rss.py ".title > a"
(link a
right inside an element with title
class
- the reason for implementing static titles
url: https://kumascans.com/manga/sokushi-cheat-ga-saikyou-sugite-isekai-no-yatsura-ga-marude-aite-ni-naranai-n-desu-ga/
script: python css2rss.py ".eph-num > a" "!Sokushi Cheat" ".chapterdate" ~ ".chapternum"
- the reason for implementing searching multiple links inside one item
url: https://www.asurascans.com/
script: python css2rss.py "@.uta" "h4" img "li > a" "li > a"
- the reason for implementing eval expressions for titles (since CSS selectors can't select text nodes outside any tags)
url: https://reaperscans.com/
script: python css2rss.py "@div.space-y-4:first-of-type div.relative.bg-white" "p.font-medium" "img" "a.border" "$contents[0]"
url: https://reader.kireicake.com/
script: python css2rss.py @.group a[href*='/series/'] .meta_r ".element > .title a" ".element > .title a"
- example for parsing Dates for articles, here it uses OR in the css selector and it looks for either
a
element (the "New!" badge) with date inside its tooltip (title
oralt
) OR for aspan
element without any child nodes (both these elements are of class.post-on
url: https://drakescans.com/
script: python css2rss.py "@.page-item-detail" ".post-title a" "img" "span.chapter > a" ~ ".post-on > a,.post-on:not(:has(*))"
- the workaround to scrap sites which give out their contents via javaScripts (the workaround is to find a static page - right-click -> view page source - and see if your text is originally there - that means it's static and not given out later via JS)
url: https://manhuaus.com/?s=Wo+Wei+Xie+Di&post_type=wp-manga&post_type=wp-manga
script: python css2rss.py ".latest-chap a" "!I'm an Evil God"