This project is a crawler, website builder. It crawls hackers news and puts information about each story in an sql table (postgress for that matter). The point of the crawler is to display stories that were flagged, hence its name.
Here is the generated page red flagged hackers news
The HN moderators are doing a good job, mostly - as you can see.
let's wait and see what the reaction at the proper HN will be... Here is the HN submission and another one here, let's see if it ends up being flagged too (that would make it recursive ;-)
There is still a point to use this stuff, if you want to create your own database of HN post: here is the usage.
this program requires postgresql and psycopg2
package.
I installed this stuff as follows:
brew install postgresql
pip3 install psycopg2-binary --force-reinstall --no-cache-dir
To start the postress server locally
brew services start postgresql
To create the db
createdb rf-hn
To create the db table, run the crawler with the -i
option, -c
option runs the crawler.
(there are additional options to change the db host, db name, etc).
./crawler crawl -i
Run the page generator, (after the crawler)
./crawl.py format
Here is the help blurb:
usage: crawl.py [-h] [--verbose] [--db DB] [--user USER] [--host HOST]
[--prompt]
{crawl,oldcrawl,format,db} ...
Scanner for 'hacker news - red flag eddition' project
positional arguments:
{crawl,oldcrawl,format,db}
crawl crawl hn (new crawler, crawls a range of entry ids
oldcrawl crawl hn (old crawler, crawl the front page, then
crawl the next page, etc)
format format the site
db db commands
optional arguments:
-h, --help show this help message and exit
--verbose, -v trace all commands, verbose output (default: False)
--db DB, -b DB set posgress db name (for db connect) (default: rf-hn)
--user USER, -u USER set posgress db name (for db connect) (default:
michaelmo)
--host HOST, -n HOST set postgress host (for db connect) (default:
localhost)
--prompt, -p prompts for the db password (default: False)
Help for ``crawl``` sub command
usage: crawl.py crawl [-h] [--from FROM_ENTRY] [--to TO_ENTRY] [--init]
optional arguments:
-h, --help show this help message and exit
--from FROM_ENTRY, -f FROM_ENTRY
set highest entry id to start crawl (default: find the
highest and start with it)
--to TO_ENTRY, -t TO_ENTRY
set lowest entry id to start crawl
--init, -i first run, create db table
Help for oldcrawl
subcommand
usage: crawl.py oldcrawl [-h] [--init] [--maxpage MAXPAGE] [--tab TAB]
optional arguments:
-h, --help show this help message and exit
--init, -i first run, create db table
--maxpage MAXPAGE, -m MAXPAGE
maximum number of pages to crawl
--tab TAB, -t TAB tab to crawl (0 - newest, 1 - new, 2 - ask, 3 - show)
Help for format
subcommand
usage: crawl.py format [-h] [--format]
optional arguments:
-h, --help show this help message and exit
--format, -f format the page from db content
usage: crawl.py db [-h] [--min-entryid] [--max-entryid]
optional arguments:
-h, --help show this help message and exit
--min-entryid, -m show entry_id of the oldest entry
--max-entryid, -x show entry_id of the earliest entry