Skip to content

spnet Indexing Tutorial

cjlee112 edited this page Aug 23, 2013 · 3 revisions

The Big Picture

SelectedPapers.net seeks to index posts that refer to specific papers. In the case of Google+, it must do this by periodically querying Google+ for posts with the #spnetwork tag. Google+ returns a JSON dictionary representing each post; the spnet/incoming.py module processes each post to see whether it is new or updated (changed data, e.g. new comments) and if so stores it to the mongoDB database. The components:

  • spnet/gplus.py performs all interactions with Google+, such as querying for posts (in the language of the Google+ API this is referred to as searching for "activities").
  • spnet/incoming.py processes received posts (each as JSON dictionary), and updates the database accordingly. The incoming module is designed to be generic (i.e. no Google+ specific code in this module). Note that this module also does not contain any mongoDB-specific code; instead it simply creates instances of spnet/core classes such as Recommendation, Post, Reply. These classes follow a simple CRUD interface: creating an object inserts a record in the database; requesting an object with a specified ID reads a record from the database; calling its update() method updates that record in the database; calling its delete() method deletes its record from the database.
  • The actual mongoDB-specific code for implementing those operations are in base classes (e.g. Document, EmbeddedDocument, ArrayDocument etc.) in spnet/base.py. You won't ordinarily need to deal with raw mongoDB queries or functions.

Learning about Google+ Polling

When you consider all the above, you'll see that there's no need for indexing to run inside the spnet webserver process. Since the webserver just queries the database for every user request, all that an indexing process needs to do is update the database. That can be run as a separate process. In fact, if you look at the bottom of spnet/gplus.py, you'll see it can be run as a script that will do just that -- query Google+ and save new posts to the database:

if __name__ == '__main__':
    import connect
    dbconn = connect.init_connection() # initialize mongoDB connection for all core classes
    n = 0
    for post in publicAccess.load_recent_spnetwork():
        n += 1 # search method is an iterator, so won't even start working till we consume its results
    print 'received %d new posts' % n

What exactly does this do? Let's follow this back to the load_recent_spnetwork() method:

def load_recent_spnetwork(self, maxDays=10, recentEvents=None, **kwargs):
    'scan recent G+ posts for updates, and save updates to DB'
    postIt = self.search_activities(query='#spnetwork', orderBy='recent')
    return self.find_or_insert_posts(postIt, maxDays=maxDays,
                                     recentEvents=recentEvents, **kwargs)
  • search_activities() queries Google+ and returns an iterator for those results (concretely, it's a generator function that yields JSON dictionaries each representing a Google+ post).
  • find_or_insert_posts() is just a wrapper for incoming.find_or_insert_posts(). All it does is pass a bunch of additional arguments that are Google+ specific callbacks to handle generic needs such as how to extract a userID from the JSON dictionary, how to get the actual text of the post, etc.

To get a feel for this, let's look directly at what Google+ is returning back to us. We simply start Python, and import the gplus module:

>>> import gplus
>>> l = list(gplus.publicAccess.search_activities(query='#spnetwork', orderBy='recent')) # read iterator results into a list
>>> len(l)
36
>>> [d['verb'] for d in l]
[u'share', u'post', u'share', u'share', u'post', u'share', u'share', u'share', u'post', u'post', u'share', u'share', u'post', u'share', u'post', u'share', u'share', u'post', u'share', u'post', u'share', u'post', u'post', u'share', u'share', u'share', u'share', u'post', u'share', u'share', u'share', u'share', u'share', u'share', u'share', u'share']
>>> l[0].keys()
[u'kind', u'provider', u'title', u'url', u'object', u'updated', u'actor', u'access', u'verb', u'etag', u'published', u'id']
>>> l[0]['actor']
{u'url': u'https://plus.google.com/118044437833307827959', u'image': {u'url': u'https://lh6.googleusercontent.com/-Bknm9zLFNX4/AAAAAAAAAAI/AAAAAAAAACo/ZKAkaQOIJpY/photo.jpg?sz=50'}, u'displayName': u'Topos Quantum Theory', u'id': u'118044437833307827959'}
>>> l[0]['object'].keys()
[u'resharers', u'attachments', u'url', u'actor', u'content', u'plusoners', u'replies', u'id', u'objectType']
>>> l[0]['object']['content']
u'<b>A foundations of mathematics for the 21st century</b><br /><br />It&#39;s here!\xa0\xa0 For decades, mathematicians been dreaming of an approach to math where different proofs that x = y can be seen as different <i>paths</i> in a <i>space</i>. \xa0 It&#39;s finally been made precise, thanks to Vladimir Voevodsky and a gang of mathematicians who holed up for a year at the Institute for Advanced Studies, at Princeton.\xa0<br /><br />I won&#39;t try to explain it, since that&#39;s what the book does.\xa0 I&#39;ll just mention a few of the radical new features:<br /><br />\u2022\xa0 It includes set theory as a special case, but it&#39;s founded on more general things called &#39;types&#39;.\xa0 Types include sets, but also propositions.\xa0 Proving a proposition amounts to constructing an element of a certain type.\xa0 So, proofs are no longer &#39;outside&#39; the mathematics being discussed, they&#39;re inside it just like everything else.<br /><br />\u2022 The logic is - in general - &#39;constructive&#39;, meaning that to prove something exists amounts to giving a procedure for constructing it.\xa0 As a result, the system can be <i>and is being</i> computerized with the help of programs like COQ and AGDA.<br /><br />\u2022 Types can be seen as &#39;spaces&#39;, and their elements as &#39;points&#39;.\xa0 A proof that two elements of a type are equal can be seen as constructing a path between two points.\xa0 Sets are just a special case: the &#39;0-types&#39;, which have no interesting higher-dimensional aspect.\xa0 There are also types that look like spheres and tori!\xa0 Technically speaking, the branch of topology called <i>homotopy theory</i> is now a part of logic!\xa0 That&#39;s why the subject is called <b>homotopy type theory</b>.<br /><br />\u2022 Types can also be seen as <b>infinity-groupoids</b>.\xa0 Very roughly, these are structures with elements, isomorphisms between elements, isomorphisms between isomorphisms, and so on <i>ad infinitum</i>.\xa0 So, a certain chunk of the important new branch of math called &#39;higher category theory&#39; is now part of logic, too.<br /><br />\u2022 The most special contribution of Voevodsky is the <b>univalence axiom</b>.\xa0 Very <i>very</i> roughly, this expands the concept of &#39;equality&#39; so that it&#39;s just as general as the hitherto more flexible concept of &#39;isomorphism&#39; - or, if you know some more fancy math, &#39;equivalence&#39;.\xa0\xa0 Mathematicians working on homotopy theory and higher category theory have known for decades that equality is too rigid a concept to be right - for certain applications.\xa0 The univalence axiom updates our concept of equality so that it&#39;s good again! \xa0\xa0 (However, nobody knows if this axiom is constructive - this is a big open question.)<br /><br />Since this is all about <i>foundations</i>, and it&#39;s all quite new, please don&#39;t ask me yet what its practical applications are.\xa0 Ask me in a hundred years.\xa0 For now, I can tell you that this is the &#39;upgrade&#39; that the foundations of math has needed ever since the work of Grothendieck.\xa0 It&#39;s truly 21-century math.<br /><br />It&#39;s also a book for the 21st century, because it&#39;s escaped the grip of expensive publishers!\xa0 While it&#39;s 600 pages long, a hardback copy costs less than $27.\xa0 Paperback costs less than $18, and an electronic copy is free!\xa0\xa0<br /><br /> <a class="ot-hashtag" href="https://plus.google.com/s/%23spnetwork">#spnetwork</a>   <a class="ot-hashtag" href="https://plus.google.com/s/%23homotopytheory">#homotopytheory</a>   <a class="ot-hashtag" href="https://plus.google.com/s/%23ncategories">#ncategories</a>   <a class="ot-hashtag" href="https://plus.google.com/s/%23logic">#logic</a>   <a class="ot-hashtag" href="https://plus.google.com/s/%23foundations">#foundations</a>  \xa0'
>>> l[0]['published']
u'2013-08-21T16:28:55.273Z'
>>> l[-1]['published']
u'2013-07-19T14:21:08.403Z'
>>> spn = set([d['id'] for d in l])

A few notes:

  • Google+, like Twitter, only seems to return about a month of recent results. In this case, our earliest result is from mid-July.
  • verb=post means an original post; verb=share means someone re-shared someone else's post. Currently we only index the former, not the latter.

Trying some other search terms

What would we get via other search terms? Google+ search seems pretty flakey, e.g. many posts tagged #spnetwork are not found by the above search. There's no discernible logic to which ones are found vs. not. We're following up with Google+ to try to get them to fix this. But in the meantime it might be good to broaden our search methods to try to "rescue" some posts that the #spnetwork query fails to find. For example, we could try searching for arxiv:, since any #spnetwork post about an arxiv paper should include that text:

>>> l = list(gplus.publicAccess.search_activities(query='arxiv:', orderBy='recent')) # read iterator results into a list
>>> len(l)
29
>>> a = set([d['id'] for d in l])
>>> len(a - spn)
29

This evidently finds a completely different set of posts that #spnetwork does. For that matter, we could search for any post with arxiv.org in it, since that's likely to be a URL to an arxiv paper:

>>> l = list(gplus.publicAccess.search_activities(query='arxiv.org', orderBy='recent')) # read iterator results into a list
>>> len(l)
109
>>> a2 = set([d['id'] for d in l])
>>> len(a2 - spn)
106
>>> len(a2 - a)
86

A lot more results, and again mostly disjoint from the previous searches.

>>> print l[0]['object']['content']
<a href="http://arxiv.org/abs/1306.6047" class="ot-anchor" rel="nofollow">http://arxiv.org/abs/1306.6047</a><br /><br />Can&#39;t help but think some of the nostradamus distributor techniques might be applicable here too, they&#39;re still not entirely away from bottlenecking on dispatch (although it sounds like attribute lookup may be a bigger issue currently).    <a href="http://www.emulators.com/docs/nx25_nostradamus.htm" class="ot-anchor" rel="nofollow">http://www.emulators.com/docs/nx25_nostradamus.htm</a>

Looks like meaningful discussion of a paper... so this could be worth indexing.