hacker_news_scraper

A Python 3 script for scraping the Hacker News feed, filtering that content by

number of points, and/or
number of comments, and/or
excluding posts {dead | flagged | youtube | wikipedia | ...} according to a keywords list

Run via ~/.bashrc alias or crontab (see notes near top of script).

Sample output: hn.txt

Updates

I provided a script, hn-regex_test.py for testing regex expressions over "hn.txt" output file:
- hn.txt output (raw, before postprocessing): hn.2020.05.03.raw.txt
- hn.txt output (after postprocessing): hn.2020.05.03.postprocessed.txt
added a dictionary and a method, multiple_replace(), to "hn.py" for postprocessing of various annoyances; e.g., the BeautifulSoup "smart quotes" that get added to the "hn.txt" output file
I scheduled the following in /etc/crontab which allows me to read (and save daily snapshots) of the output in my mail client (Claws Mail: URLs active) ...

# At 6:05 am [https://crontab.guru/#5_6_*_*_*]:
5    6    *    *    *    victoria    nice -n 19    mutt -e "set content_type=text/text" -s 'HackerNews' mail@VictoriasJourney.com -i /mnt/Vancouver/programming/python/scripts/output/hn.txt

mutt arguments:

s : subject

i : include file as message body

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
README.md		README.md
hacker_news.png		hacker_news.png
hn-regex_test.py		hn-regex_test.py
hn.2020.05.03.postprocessed.txt		hn.2020.05.03.postprocessed.txt
hn.2020.05.03.raw.txt		hn.2020.05.03.raw.txt
hn.py		hn.py
hn.py-notification-screenshot-2020-04-14.png		hn.py-notification-screenshot-2020-04-14.png
hn.txt		hn.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hacker_news_scraper

About

Releases

Packages

Languages

victoriastuart/hacker_news_scraper

Folders and files

Latest commit

History

Repository files navigation

hacker_news_scraper

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages