Skip to content

Simple Feature Tutorial

cjlee112 edited this page Aug 28, 2013 · 8 revisions

Goal: implementing a basic feature

This tutorial will introduce you to basics of working with spnet data by walking through the steps of implementing a simple feature. Let's try to implement a proposed feature from the issue tracker, display of arXiv metadata. This is a fairly simple feature to implement, because we're assuming we'll work with existing data (no need to write code to get new kinds of data), so we just need to figure out what paper metadata arXiv has already given us, and choose a nice way to present it. For example, we could follow the existing model of how arXiv displays those metadata.

Let's start with an example arXiv paper.

The DOI would enable spnet to connect this arXiv paper to the final published version. The comments and journal reference would be nice to show too.

Looking at internal spnet data

Now let's see what metadata spnet receives from arXiv, by directly viewing the raw data. To get easy command line access to raw data, we can either start an spnet web server:

python -i web.py

Alternatively we can simply connect to the database manually:

python
>>> import connect
>>> dbconn = connect.init_connection()
>>> import core

At this point we can directly query the data we want:

>>> p = core.ArxivPaperData('0807.3498', insertNew='findOrInsert')
  • spnet/core.py defines the core spnet data classes.
  • ArxivPaperData represents the data for an arXiv paper.
  • For all core classes, obtaining the record for a specified object is as easy as providing its ID.
  • The setting insertNew='findOrInsert' tells it to first check your mongoDB for this record, but if not found, then to use the class's external query method to retrieve it from an external source (in this case arXiv.org's API), and to insert the result into mongoDB (ensuring that henceforth this record will definitely be in the mongoDB database).

We can inspect the data either using the standard Python dir() builtin function:

>>> dir(p)
['__class__', '__cmp__', '__delattr__', '__dict__', '__doc__', '__format__',
'__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__',
'__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'__weakref__', '_dbDocDict', '_dbfield', '_get_doc', '_insert_parent', '_isNewInsert',
'_parent_link', '_query_external', '_set_parent', '_spnet_url_base', 'array_append',
'array_del', 'arxiv_comment', 'arxiv_doi', 'arxiv_journal_ref', 'arxiv_primary_category',
'author', 'authorNames', 'author_detail', 'authors', 'check_required_fields', 'coll',
'delete', 'find', 'find_obj', 'find_or_insert', 'get_abstract', 'get_doctag',
'get_downloader_url', 'get_hashtag', 'get_local_url', 'get_source_url', 'get_spnet_url',
'get_value', 'guidislink', 'id', 'insert', 'link', 'links', 'parent', 'published',
'published_parsed', 'set_attrs', 'summary', 'summary_detail', 'tags', 'title',
'title_detail', 'update', 'updated', 'updated_parsed', 'useObjectId']

As usual, dir() shows a mix of standard methods, data attributes provided by arXiv, and attributes generated by our class.

Alternatively, every core object also gives direct access to the JSON dictionary stored in the mongoDB database, via its _dbDocDict attribute:

>>> p._dbDocDict.keys()
['authorNames', 'updated', u'arxiv_doi', 'updated_parsed', 'published_parsed', 'title',
'authors', 'summary_detail', u'arxiv_journal_ref', 'summary', 'links', 'guidislink',
'title_detail', 'tags', 'link', 'author', 'published', 'author_detail', 'id',
u'arxiv_primary_category', u'arxiv_comment']

Note that core objects automatically mirror these JSON data as object attributes. So we can access them directly as attribute names:

>>> p.arxiv_doi
u'10.3934/jmd.2009.3.159'
>>> p.arxiv_journal_ref
u'J. Mod. Dyn. 3 (2009), no. 2, 159--231'
>>> p.arxiv_comment
u'Errors have been corrected in Section 9 from the prior and published\n  versions of this paper. In particular, the formulas associated to homology\n  classes of curves corresponding to stable periodic billiard paths in obtuse\n  Veech triangles were corrected. See Remark 9.1 of the paper for more\n  information. The main results and the results from other sections are\n  unaffected. 82 pages, 43 figures'
>>> p.arxiv_primary_category
{'term': u'math.DS', 'scheme': u'http://arxiv.org/schemas/atom'}
>>> p.tags
[{'term': u'math.DS', 'scheme': u'http://arxiv.org/schemas/atom', 'label': None}, {'term': u'37D50 (Primary) 37E15, 51M04 (Secondary)', 'scheme': u'http://arxiv.org/schemas/atom', 'label': None}]
>>> p.updated
u'2013-06-04T02:18:08Z'
>>> p.updated_parsed
datetime.datetime(2013, 6, 4, 3, 18, 8)
>>> p.published
u'2008-07-22T15:09:57Z'

Clearly, we can use these data fields to display the same metadata as arXiv.org shows for this paper.

Deciding how to display the metadata

First take a look at the page source for how arXiv displays this paper. Scroll down to the metatable div. As you can see, arXiv uses a simple table and CSS class values to layout these metadata. Since selectedpapers.net uses arXiv's CSS, we can more or less copy this table format exactly, and use Jinja2 templating to inject the paper's metadata into this format.

Now take a look at our template for displaying a paper. Scroll down to the point just after the class="abstract" blockquote. At this point, we could inject arxiv metadata as easily as (e.g. arxiv_doi):

{{ paper.arxiv.arxiv_doi }}
  • paper is the core.Paper object representing this paper record.
  • paper.arxiv is the core.ArxivPaperData object representing the arXiv data for this paper. (Of course, a paper not from arXiv would lack this attribute.)

We could just add these changes to the template, and our feature is done.

A deeper look: RESTful templating

If you're curious how the template receives its variables (e.g. paper), read on. The spnet server follows the popular REST model, which specifies that a data view page would be a GET HTTP request with a URI of the form https://selectedpapers.net/COLLECTION/ID, where COLLECTION is the name of a particular kind of data, and ID is the unique ID of a particular data object in that collection. E.g. https://selectedpapers.net/arxiv/1308.0729. This is handled by the spnet code as follows:

  • the root of the webserver is a spnet.web.Server object. When a request is received (e.g. /arxiv/1308.0729), the CherryPy dispatcher looks for a corresponding attribute on the server object whose name matches the URI (in this example case, if the server object is server, it would look for its attribute server.arxiv). It then calls that attribute's default method with the remaining elements of the URI and GET/POST keyword arguments: for our example this would be server.arxiv.default('1308.0729', **kwargs).

  • in the spnet code, such collection attributes are generally spnet.rest.Collection instances (typically subclassed to provide appropriate methods for that specific collection). These implement REST in the following simple way: first, based on the HTTP request verb (e.g. GET) it calls its associated request method (formed by just prepending an underscore to the verb, e.g. _GET()), which returns the requested data object (in this example, an spnet.core.Paper object representing this arXiv paper); second, based on the requested response type (typically html), it calls the associated response method (formed by appending the verb and response type, e.g. get_html()), which returns the response string (i.e. HTML).

  • by default, spnet.rest.Collection objects automatically look for templates they can bind as response methods. Specifically, any template file of the form VERB_COLLECTION.TYPE (e.g. get_paper.html) will automatically be bound as a spnet.view.TemplateView callable to Collection objects that serve that collection name. For example, the server.arxiv object specifies that it serves paper objects, so it binds spnet/_templates/get_paper.html as its server.arxiv.get_html() method.

Finally, let's look at how spnet.view.TemplateView passes data to the template. The code is pretty trivial so let's just look at it:

class TemplateView(object):
    exposed = True
    def __init__(self, template, name=None, **kwargs):
        self.template = template
        self.kwargs = kwargs
        self.name = name

    def __call__(self, doc=None, **kwargs):
        f = self.template.render
        kwargs.update(self.kwargs)
        session = get_session()
        try:
            kwargs.update(session['viewArgs'])
        except KeyError:
            pass
        if doc is not None:
            kwargs[self.name] = doc
        try:
            user = session['person']
        except KeyError:
            user = session['person'] = None
        if user and user.force_reload():
            user = user.__class__(user._id) # reload from DB
            session['person'] = user # save on session
        return f(kwargs=kwargs, hasattr=hasattr, enumerate=enumerate,
                 urlencode=urllib.urlencode, list_people=people_link_list,
                 getattr=getattr, str=str, map=map_helper, user=user,
                 display_datetime=display_datetime, timesort=timesort,
                 recentEvents=recentEventsDeque, len=len,
                 Selection=webui.Selection, **kwargs) # apply template

For our purpose the only parts of this that matter are

  • doc is the requested data object, a Python object representing a MongoDB "document". In our example this would be a spnet.core.Paper object. Note this is passed to the template as a keyword argument whose name is simply the "document name" for this collection, i.e. paper.

  • user is the spnet.core.Person object representing the current login. To get this, we just look it up from CherryPy's session dictionary representing the current session. This is passed to the template as the user keyword arg.

  • any keyword arguments from the GET/POST request are also passed to the template via kwargs.

  • a number of other convenience functions are also passed to the template, e.g. so we can use len() in our templates.

Finally, you can see how the actual "application tree" of REST Collection objects is constructed, by looking at spnet/apptree.py and scrolling down to the get_collections() function.

Deeper data model considerations

Returning back to our original arXiv - DOI feature idea, it may be worthwhile to think a bit more about how these data should be organized, and where such display features are best implemented.

  • first, let's consider the relationship between the different "paper databases" that selectedpapers.net currently covers: arXiv, DOI, and PubMed. Note that these are not mutually exclusive, either in reality or in how spnet/core models them. In principle, a paper could have an ID in all three of these databases, and in that case we'd probably like to unify all three of those IDs to a single paper record in selectdpapers.net. That way people commenting on the different IDs (arXiv, DOI or PubMed) would all be integrated into a single conversation about that paper, rather fragmented into different records. This would be very nice, e.g. if a paper is first discussed as an arXiv preprint, and later as a published paper (DOI).
  • Concretely, the spnet data model takes advantage of the flexibility of the mongoDB "NoSQL" database. In a regular SQL database we'd have to store arXiv, DOI and PubMed records as three separate tables. MongoDB allows us to "embed" records within another record. Specifically, we have one table that stores Paper records. Each Paper record can have embedded within it an arXiv, DOI and PubMed records (or none of these, if the paper isn't linked to any of those databases). These embedded records can be added at any time; for example, a Paper recorded could initially be created with an embedded ArxivPaperData record, and later, when the paper is published, a DoiPaperData embedded record could be added.
  • Currently, selectedpapers.net correctly implements that "record unification" for DOI and PubMed, but not for arXiv vs. the other databases (basically because we didn't know how to lookup the DOI for an arXiv paper, and vice versa).
  • Now that we have the arxiv_doi metadata, we could implement that unification fairly easily. We can follow the model for how PubmedPaperData and DoiPaperData link to each other (e.g. where the PubMed data provide a DOI), by looking at spnet/core.py.

Linking from ArxivPaperData to DoiPaperData

If you look at PubmedPaperData class you'll see code like:

class PubmedPaperData(EmbeddedDocument):
    'store pubmed data for a paper as subdocument of Paper'
    ...
    def _insert_parent(self, d):
        'create Paper document in db for this arxiv.id'
        try: # connect with DOI record
            DOI = d['doi']
            return DoiPaperData(DOI=DOI, insertNew='findOrInsert',
                                getPubmed=False).parent
        except KeyError: # no DOI, so save as usual
            return Paper(docData=dict(title=d['title'],
                                      authorNames=d['authorNames']))
  • the _insert_parent() method for any embedded ("subdocument") class implements insertion of a new "parent" document record (in which this subdocument will be embedded). It must return the object representing that newly inserted record.
  • in the PubmedPaperData case, it simply checks whether the pubmed data provide a DOI. If so, it simply creates a DoiPaperData record (with that DOI), and returns its parent document (i.e. the Paper document in which that DoiPaperData subdocument is embedded. As a result, our new PubmedPaperData subdocument will be embedded in that same Paper document. (i.e. that Paper document will have both an embedded DoiPaperData and embedded PubmedPaperData).
  • of course, if there's no DOI, it has no choice but to create a new Paper document filled in with only the minimum required data, title and authorNames.

We can easily apply this model to ArxivPaperData. All we have to do is add the same try...except clause. I.e. change the current ArxivPaperData._insert_parent() code:

class ArxivPaperData(EmbeddedDocument):
...
    def _insert_parent(self, d):
        'create Paper document in db for this arxiv.id'
        return Paper(docData=dict(title=d['title'],
                                  authorNames=d['authorNames']))

to:

class ArxivPaperData(EmbeddedDocument):
...
    def _insert_parent(self, d):
        'create Paper document in db for this arxiv.id'
        try: # connect with DOI record
            DOI = d['arxiv_doi']
            return DoiPaperData(DOI=DOI, insertNew='findOrInsert').parent
        except KeyError: # no DOI, so save as usual
            return Paper(docData=dict(title=d['title'],
                                      authorNames=d['authorNames']))