A quickie Python 3.5 script that parses the PDF-listing of RSS feeds that Facebook uses to monitor for breaking news stories to add to its Trending Section.
On May 12, 2016, Gizmodo published an article titled, Facebook Admits Its Trending Section Includes Topics Not Actually Trending on Facebook, which covered the fallout from Gizmodo's previous reporting that Facebook's Trending Section is mostly human-curated. As part of its response, Facebook released a list of 1,000 RSS feeds (as a PDF file) that it says it uses to crawl for interesting news stories that may not have yet percolated through its social shares.
This repo contains code (and the results) to convert that PDF list into a machine-readable CSV (data/rss-urls.csv) and then to fetch each RSS URL. A few of the URLs 404, but programmers who know how to parse XML can make use of the retrieved data to do their own content analysis.
Note: There appears to be only 929 lines in Facebook's list of RSS feeds, according to wc -l data/rss-urls.csv
, not "1,000". And when counting uniques --
csvcut -c3 data/rss-urls.csv | sort | uniq | wc -l
The result is 888 lines.
Each URL is given a country and category. Here's the group count of those fields:
count | country | topic |
---|---|---|
11 | AU | business |
10 | AU | entertainment |
20 | AU | general |
10 | AU | health |
13 | AU | politics |
7 | AU | science |
11 | AU | sports |
7 | AU | tech |
20 | CA | business |
35 | CA | entertainment |
5 | CA | gaming |
30 | CA | general |
19 | CA | health |
15 | CA | politics |
13 | CA | science |
18 | CA | sports |
16 | CA | tech |
17 | GB | business |
39 | GB | entertainment |
5 | GB | gaming |
25 | GB | general |
18 | GB | health |
21 | GB | politics |
13 | GB | science |
17 | GB | sports |
11 | GB | tech |
27 | IN | business |
21 | IN | entertainment |
1 | IN | gaming |
33 | IN | general |
12 | IN | health |
29 | IN | politics |
10 | IN | science |
19 | IN | sports |
13 | IN | tech |
17 | US | business |
35 | US | entertainment |
30 | US | gaming |
39 | US | general |
41 | US | health |
47 | US | politics |
48 | US | science |
14 | US | sports |
66 | US | tech |
The data/feeds/ folder already includes results from a fetch on 2016-05-12, read the directions further below if you want to run it from scratch. The data/feeds/ contains JSON files that include the metadata when requesting a given RSS URL. If successful, the serialized JSON object contains the raw, unparsed XML in a field named response_text
(i.e. I haven't extracted the individual news items from each valid RSS feed).
Here's an example of how http://deadline.com/feed (saved as: data/deadline.com---feed.json) is serialized:
{
"requested_url": "http://deadline.com/feed/",
"fetched_at": "2016-05-12T23:35:52.197688",
"status_code": 200,
"response_text": "<?xml version=\"1.0\" encoding=\"UTF-8\"?><rss version=\"2.0\"\n\txmlns:content=\"http://purl.org/rss/1.0/modules/content/\"\n\txmlns:wfw=\"http://wellformedweb.org/CommentAPI/\"\n\txmlns:dc=\"http://purl.org/dc/elements/1.1/\"\n...</channel>\n</rss>\n",
"headers": {
"Date": "Fri, 13 May 2016 06:33:29 GMT",
"Vary": "Accept-Encoding, Accept-Encoding",
"Last-Modified": "Fri, 13 May 2016 06:30:23 GMT",
"Content-Type": "application/rss+xml; charset=UTF-8",
"Server": "nginx",
"X-nc": "HIT bur 209",
"X-UA-Compatible": "IE=Edge",
"X-hacker": "If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.",
"Connection": "keep-alive",
"Content-Encoding": "gzip",
"X-ac": "4.sjc _bur",
"Transfer-Encoding": "chunked"
},
"response_url": "http://deadline.com/feed/"
}
If you are looking to deserialize the XML, here's one way to do it (using the xmltodict lib):
import json
import xmltodict
jdata = json.load(open('data/feeds/deadline.com_feed.json'))
feed = xmltodict.parse(jdata['response_text'])
print(feed['rss']['channel']['title'])
# Deadline
items = feed['rss']['channel']['item']
len(items)
# 12
print(items[4]['title'])
# The CW Looking To Redevelop Kevin Williamson Paranormal Drama
print(items[4]['link'])
# http://deadline.com/2016/05/kevin-williamson-paranormal-drama-pilot-redeveloped-the-cw-1201755022/
RSS feeds have different formats...which is why I haven't taken the time to write a deserializer myself, but I'm sure someone more familiar with RSS can do it easily.
My scripts have some exotic dependencies even though they do little more than just fetch URLs:
-
Python 3.5 (and its new standard library modules such as pathlib)
-
The deco library -- which has just been updated to 3.5 -- for easy concurrency. You can remove the
@concurrent
and@synchronized
decorators if you don't want the fuss. I had to install the egg straight from Github:pip install -e git+https://github.com/alex-sherman/deco.git#egg=deco
-
The scripts/fetch_pdf.py script requires Poppler to run pdftotext via the shell.
To re-fetch Facebook's PDF and re-parse it into data/rss-urls.csv:
$ python scripts/fetch_pdf.py
The following script will run through each entry in data/rss-urls.csv to fetch each RSS URL and save the response to a corresponding JSON file in data/feeds/:
$ python scripts/fetch_feeds.py
The JSON file for each fetch attempt includes metadata -- such as the headers, HTTP status code, and datetime of the request -- as well as a response_text
that contains the raw text of the server response. The HTTP request will automatically follow redirects, so everything is either a 200
or some kind of error code. However, there is a requested_url
-- which corresponds to the URL that came from Facebook's original document -- and a response_url
, which can be used to compare against requested_url
to see if a redirect occurred. This is a hacky way to deal with some redirects not pointing to actual RSS resources, e.g http://www.nationaljournal.com/?rss=1.
There's a scripts/metrics.py that simply counts up the metadata:
Status code metrics:
792: 200
44: 404
23: ConnectionError
20: 403
3: 400
1: 502
1: InvalidSchema
1: 500
1: 429
Of the 792 requests that were successful, 109 were likely redirects