Two types of scraping and parsing jobs:
- Current (continuous, till 2016 election day)
- Organization
- What is being collected?
- Homepages
- Politics Homepages
- Top 10
- Past (Internet Archive)
The final data is posted here
-
Homepage
- current-homepage-html.tar.gz
- current-output-homepage.csv
-
Politics Homepage
- current-politics-homepage-html.tar.gz
- current-output-politics-homepage.csv
-
Top 10
- current-top10-html.tar.gz (split by news_org/harvested pages from links to top10)
- current-output-top10.csv
Three kinds of things being scraped:
-
html file name = [fox]_ + date_time + ....
-
CSV fields: date, time, src, order, url, link_text, homepage_keywords, path, title, text, top_image, authors, summary, keywords (Please note we didn't scrape & parse the article so all fields from
path
will be empty) -
Sources
-
html file name =
- [fox_politics] + date_time + ....
- For WP last link (the most): wp_themost + date_time
-
CSV fields: date, time, src, order, url, link_text, homepage_keywords, path, title, text, top_image, authors, summary, keywords (Please note we didn't scrape & parse the article so all fields from
path
will be empty) -
Sources
- https://www.washingtonpost.com/politics/
- http://www.nytimes.com/pages/politics/index.html
- http://www.huffingtonpost.com/section/politics
- http://www.foxnews.com/politics.html
- http://www.wsj.com/news/politics
- https://www.yahoo.com/news/politics/
- http://www.usatoday.com/news/politics/
- https://www.washingtonpost.com/pb/themost/
-
html file name + folder structure:
- Multiple folders --- one for each news org and containing content of the files in a gzip compression of the original HTML, name of each file = three_letter_src_name_date_time_order.html.gz
-
Summary of Data Being Collected
- Top 5 overall WSJ
- Top 4 overall USA
- Top 10 'trending' huffpost
- Top 5 politics, overall WP
- Top 10 overall, national, politics for NYT and Fox
- Top 10 overall for yahoo news, AP, reuters, google news
-
CSV fields: date, time, src, url, order on the list, text of the link, title of the article, path to local content file, src_list *
-
Washington Post:
- Top 5 Most Read (left side bar): https://www.washingtonpost.com/politics/
- Top 5 Most Read (left side bar): https://www.washingtonpost.com/regional/
- Top 5 'The Atlantic' from here:
-
NYT:
- July 10-now (Aug 12) got top 10 for national section only.
- going forward: planning to get top 10 from:
-
WSJ
- Has top 5 with no separate link and same top 5 everywhere (not specific to politics on politics page)
-
Fox
- July 10-now (Aug 12) got top 10 for politics section only
- From now on, get: http://feeds.foxnews.com/foxnews/national http://feeds.foxnews.com/foxnews/most-popular
-
Huffington Post:
- July 10-now got top 10 from http://www.huffingtonpost.com/mapi/v2/us/trending?device=desktop&statsType=rawPageView&statsPlatform=desktop&algo=trending
- Add: http://www.huffingtonpost.com/ (right bar, trending)
-
USA Today:
- Home page, right bar: http://www.usatoday.com/ (most Popular) Just 4 links, looks like indeed what we've been and will continue to get
-
Yahoo:
- July 10-now been getting:
- Get 5 from: https://www.yahoo.com/news/
- Get 5 from: http://news.yahoo.com/most-popular/?pt=BureoF4GVB/?format=rss
- Going forward:
- Will keep 5 from: https://www.yahoo.com/news/
- Will change to 10 for http://news.yahoo.com/most-popular/?pt=BureoF4GVB/?format=rss
- Will add Yahoo originals, AP and Reuters (not sure what's gained from https://www.yahoo.com/news/ - does not refer to most viewed/popular)
-
Google News
- July 10-now:
- 5 from: https://news.google.com/
- 5 from: https://news.google.com/?ned=us&topic=po (But 2nd link seems to link to old news.)
- Starting now:
- 10 from news.google.com
- still 5 from other link
- July 10-now:
-
HTML files:
- 2012: ia-homepage-html-2012.tar.gz
-
CSV files (compressed in gzip)
- 2012: ia-output-homepage-2012-text.csv.gz
- 2016: ia-output-homepage-2016-text.csv.gz
- HTML file: ia-politics-html.tar.gz
- CSV output: ia-output-politics-homepage-2012-2016-notext.csv.gz (without text data)
- HTML files:
- ia-news-top10-html.tar.gz (split by news_org/harvested pages from links to top10)
- ia-politics-top10-html.tar.gz
- ia-top10-html.tar.gz
- CSV files
- ia-output-politics-top10-text-all.csv
- it's the Top10 of politics news that we have scraped and parsed from Internet Archive for 2012 and 2016 between Jul 1 and Nov 30.
- ia-output-top10-text-all.csv
- it's the Top10 of all news for year 2012 and 2016 between Jul 1 and Nov 30. But articles not scraped and parsed --- there are 56k links that need to be scraped/parsed.
- ia-output-politics-top10-text-all.csv
The frequency with which Internet Archive takes snapshots of different websites varies for unknown reasons. Here are the total number of snapshots for each kind of page: * Source: nyt, 31129 snapshots * Source: wsj, 15573 snapshots * Source: fox, 16838 snapshots * Source: hpmg, 26667 snapshots * Source: usat, 26545 snapshots * Source: google, 0 snapshots (Page cannot be displayed due to robots.txt) * Source: yahoo, 13991 snapshots
- Yahoo:
- Homepage: http://web.archive.org/web/20110701091910/http://news.yahoo.com/
- Top10:
- Google:
- Page cannot be displayed due to robots.txt
- Homepage: http://web.archive.org/web/20110701014447/http://news.google.com/
- USA Today:
- Homepage: http://web.archive.org/web/20120701152440/http://www.usatoday.com/
- Top10:
- WSJ
- Homepage: http://web.archive.org/web/20120701034332/http://online.wsj.com/home-page#
- Politics: https://web.archive.org/web/20120703083921/http://online.wsj.com/public/page/news-politics-campaign.html
- Top10:
- Top10 Politics:
- Fox News
- Homepage: https://web.archive.org/web/20120701050401/http://www.foxnews.com/
- Politics: https://web.archive.org/web/20120701123538/http://www.foxnews.com/politics/index.html
- Top10:
- Top10 Politics:
- HuffPo:
- Homepage: http://web.archive.org/web/20110701031057/http://www.huffingtonpost.com/
- Politics: https://web.archive.org/web/20120711044502/http://www.huffingtonpost.com/politics/
- Top10:
- 2012: http://web.archive.org/web/20110701031057/http://www.huffingtonpost.com/
- 2016: don't see here
- Top10 Politics:
- NYT:
- Homepage: http://web.archive.org/web/20110701014448/http://www.nytimes.com/
- Politics: http://web.archive.org/web/20120701051437/http://www.nytimes.com/pages/politics/index.html
- Top 10:
- Can't use API. And Politics most popular missing on Internet Archive.
- Fetched homepage top10: http://web.archive.org/web/20121022004730/http://www.nytimes.com/most-popular
- WaPo: --- http://web.archive.org/web/*/washingtonpost.com seems to run into robots.txt