usage: internet_archive.py [-h] [-c CONFIG] [-d DIR] [--overwritten]
[-s] [--compress] [--selenium]
input
Homepages scraper
positional arguments:
input
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Configuration file
-d DIR, --dir DIR Output directory for HTML files
--overwritten Overwritten if HTML file exists
-s, --statistics Run the script to count amount of snapshots
--compress Compress download HTML files
--selenium Use Selenium to download dynamics HTML content
Using same input file as Scraping homepage.
usage: process_ia_homepage.py [-h] [-o OUTPUT] [--with-header] [--with-text]
[--unique]
directory
Parse Homepage and Download Article
positional arguments:
directory Scraped homepages directory
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output file name
--with-header Output with header at the first row
--with-text Download the article text
--unique Keep only unique articles links
usage: internet_archive.py [-h] [-c CONFIG] [-d DIR] [--overwritten]
[-s] [--compress] [--selenium]
input
Homepages scraper
positional arguments:
input
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Configuration file
-d DIR, --dir DIR Output directory for HTML files
--overwritten Overwritten if HTML file exists
-s, --statistics Run the script to count amount of snapshots
--compress Compress download HTML files
--selenium Use Selenium to download dynamics HTML content
usage: process_ia_top10.py [-h] [-o OUTPUT] [--with-header] [--with-text]
directory
Parse Homepage and Download Article
positional arguments:
directory Scraped homepages directory
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output file name
--with-header Output with header at the first row
--with-text Download the article text
"date","time","src","order","url","link_text","homepage_keywords","path","title","text","top_image","authors","summary","keywords"
date
Datetime
Timesrc
News's sourceorder
Link's order in pageurl
Link's URLlink_text
Link's texthomepage_keywords
Keywords of pagepath
Path to download articletitle
Title of articletext
Text of articletop_image
Top image of articleauthors
Authors of articlesummary
Summary of articlekeywords
Keywords of article
Note that there is no data in columns path
to keywords
if --with-text is not specific.
-
2012
homepage.csv ⇨ internet_archive.py ⇨ internet_archive (HTML files) ⇨ process_ia_homepage.py ⇨ ia-homepage-html-2012.tar.gz + ia-output-homepage-2012-text.csv.gz
-
2016
homepage.csv ⇨ internet_archive.py ⇨ internet_archive (HTML files) ⇨ process_ia_homepage.py ⇨ ia-output-homepage-2016-text.csv.gz
politics_homepage.csv ⇨ internet_archive.py ⇨ ia-politics-html.tar.gz (HTML files between 2012~2016/08/15)
⇨ process_ia_homepage.py ⇨ ia-output-homepage-2016-text.csv.gz
homepage_top10.csv ⇨ internet_archive.py ⇨ ia-top10-html.tar.gz (HTML files)
⇨ process_ia_top10.py ⇨ ia-news-top10-html.tar.gz (HTML files) + ia-output-top10-text-all.csv
politics_homepage_top10.csv ⇨ internet_archive.py ⇨ ia-top10-html.tar.gz (HTML files)
⇨ process_ia_top10.py ⇨ ia-politics-top10-html.tar.gz + ia-output-politics-top10-text-all.csv
nyt_ia_jsonp-topnews.csv ⇨ internet_archive.py ⇨ HTML files
⇨ process_ia_jsonp_topnews.py ⇨ HTML files + output_nyt_ia_jsonp_topnews.csv