-
Notifications
You must be signed in to change notification settings - Fork 11
Simple Feature Tutorial
This tutorial will introduce you to basics of working with spnet data by walking through the steps of implementing a simple feature. Let's try to implement a proposed feature from the issue tracker, display of arXiv metadata. This is a fairly simple feature to implement, because we're assuming we'll work with existing data (no need to write code to get new kinds of data), so we just need to figure out what paper metadata arXiv has already given us, and choose a nice way to present it. For example, we could follow the existing model of how arXiv displays those metadata.
Let's start with an example arXiv paper.
- here's how it renders on selectedpapers.net
- here's how it renders on arXiv
The DOI would enable spnet to connect this arXiv paper to the final published version. The comments and journal reference would be nice to show too.
Now let's see what metadata spnet receives from arXiv, by directly viewing the raw data. To get easy command line access to raw data, we can either start an spnet web server:
python -i web.py
Alternatively we can simply connect to the database manually:
python
>>> import connect
>>> dbconn = connect.init_connection()
>>> import core
At this point we can directly query the data we want:
>>> p = core.ArxivPaperData('0807.3498', insertNew='findOrInsert')
- spnet/core.py defines the core spnet data classes.
- ArxivPaperData represents the data for an arXiv paper.
- For all core classes, obtaining the record for a specified object is as easy as providing its ID.
- The setting
insertNew='findOrInsert'
tells it to first check your mongoDB for this record, but if not found, then to use the class's external query method to retrieve it from an external source (in this case arXiv.org's API), and to insert the result into mongoDB (ensuring that henceforth this record will definitely be in the mongoDB database).
We can inspect the data either using the standard Python dir()
builtin function:
>>> dir(p)
['__class__', '__cmp__', '__delattr__', '__dict__', '__doc__', '__format__',
'__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__',
'__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'__weakref__', '_dbDocDict', '_dbfield', '_get_doc', '_insert_parent', '_isNewInsert',
'_parent_link', '_query_external', '_set_parent', '_spnet_url_base', 'array_append',
'array_del', 'arxiv_comment', 'arxiv_doi', 'arxiv_journal_ref', 'arxiv_primary_category',
'author', 'authorNames', 'author_detail', 'authors', 'check_required_fields', 'coll',
'delete', 'find', 'find_obj', 'find_or_insert', 'get_abstract', 'get_doctag',
'get_downloader_url', 'get_hashtag', 'get_local_url', 'get_source_url', 'get_spnet_url',
'get_value', 'guidislink', 'id', 'insert', 'link', 'links', 'parent', 'published',
'published_parsed', 'set_attrs', 'summary', 'summary_detail', 'tags', 'title',
'title_detail', 'update', 'updated', 'updated_parsed', 'useObjectId']
As usual, dir()
shows a mix of standard methods, data attributes provided by arXiv, and attributes generated by our class.
Alternatively, every core
object also gives direct access to the JSON dictionary stored in the mongoDB database, via its _dbDocDict
attribute:
>>> p._dbDocDict.keys()
['authorNames', 'updated', u'arxiv_doi', 'updated_parsed', 'published_parsed', 'title',
'authors', 'summary_detail', u'arxiv_journal_ref', 'summary', 'links', 'guidislink',
'title_detail', 'tags', 'link', 'author', 'published', 'author_detail', 'id',
u'arxiv_primary_category', u'arxiv_comment']
Note that core
objects automatically mirror these JSON data as object attributes. So we can access them directly as attribute names:
>>> p.arxiv_doi
u'10.3934/jmd.2009.3.159'
>>> p.arxiv_journal_ref
u'J. Mod. Dyn. 3 (2009), no. 2, 159--231'
>>> p.arxiv_comment
u'Errors have been corrected in Section 9 from the prior and published\n versions of this paper. In particular, the formulas associated to homology\n classes of curves corresponding to stable periodic billiard paths in obtuse\n Veech triangles were corrected. See Remark 9.1 of the paper for more\n information. The main results and the results from other sections are\n unaffected. 82 pages, 43 figures'
>>> p.arxiv_primary_category
{'term': u'math.DS', 'scheme': u'http://arxiv.org/schemas/atom'}
>>> p.tags
[{'term': u'math.DS', 'scheme': u'http://arxiv.org/schemas/atom', 'label': None}, {'term': u'37D50 (Primary) 37E15, 51M04 (Secondary)', 'scheme': u'http://arxiv.org/schemas/atom', 'label': None}]
>>> p.updated
u'2013-06-04T02:18:08Z'
>>> p.updated_parsed
datetime.datetime(2013, 6, 4, 3, 18, 8)
>>> p.published
u'2008-07-22T15:09:57Z'
Clearly, we can use these data fields to display the same metadata as arXiv.org shows for this paper.
First take a look at the page source for how arXiv displays this paper. Scroll down to the metatable
div. As you can see, arXiv uses a simple table and CSS class values to layout these metadata. Since selectedpapers.net uses arXiv's CSS, we can more or less copy this table format exactly, and use Jinja2 templating to inject the paper's metadata into this format.
Now take a look at our template for displaying a paper. Scroll down to the point just after the class="abstract"
blockquote. At this point, we could inject arxiv metadata as easily as (e.g. arxiv_doi
):
{{ paper.arxiv.arxiv_doi }}
-
paper
is thecore.Paper
object representing this paper record. -
paper.arxiv
is thecore.ArxivPaperData
object representing the arXiv data for this paper. (Of course, a paper not from arXiv would lack this attribute.)
We could just add these changes to the template, and our feature is done.
If you're curious how the template receives its variables (e.g. paper
), read on. The spnet server follows the popular REST model, which specifies that a data view page would be a GET HTTP request with a URI of the form https://selectedpapers.net/COLLECTION/ID
, where COLLECTION is the name of a particular kind of data, and ID is the unique ID of a particular data object in that collection. E.g. https://selectedpapers.net/arxiv/1308.0729
. This is handled by the spnet code as follows:
-
the root of the webserver is a
spnet.web.Server
object. When a request is received (e.g./arxiv/1308.0729
), the CherryPy dispatcher looks for a corresponding attribute on the server object whose name matches the URI (in this example case, if the server object isserver
, it would look for its attributeserver.arxiv
). It then calls that attribute's default method with the remaining elements of the URI and GET/POST keyword arguments: for our example this would beserver.arxiv.default('1308.0729', **kwargs)
. -
in the spnet code, such collection attributes are generally
spnet.rest.Collection
instances (typically subclassed to provide appropriate methods for that specific collection). These implement REST in the following simple way: first, based on the HTTP request verb (e.g.GET
) it calls its associated request method (formed by just prepending an underscore to the verb, e.g._GET()
), which returns the requested data object (in this example, anspnet.core.Paper
object representing this arXiv paper); second, based on the requested response type (typicallyhtml
), it calls the associated response method (formed by appending the verb and response type, e.g.get_html()
), which returns the response string (i.e. HTML). -
by default,
spnet.rest.Collection
objects automatically look for templates they can bind as response methods. Specifically, any template file of the formVERB_COLLECTION.TYPE
(e.g.get_paper.html
) will automatically be bound as aspnet.view.TemplateView
callable to Collection objects that serve that collection name. For example, theserver.arxiv
object specifies that it servespaper
objects, so it bindsspnet/_templates/get_paper.html
as itsserver.arxiv.get_html()
method.
Finally, let's look at how spnet.view.TemplateView
passes data to the template. The code is pretty trivial so let's just look at it:
class TemplateView(object):
exposed = True
def __init__(self, template, name=None, **kwargs):
self.template = template
self.kwargs = kwargs
self.name = name
def __call__(self, doc=None, **kwargs):
f = self.template.render
kwargs.update(self.kwargs)
session = get_session()
try:
kwargs.update(session['viewArgs'])
except KeyError:
pass
if doc is not None:
kwargs[self.name] = doc
try:
user = session['person']
except KeyError:
user = session['person'] = None
if user and user.force_reload():
user = user.__class__(user._id) # reload from DB
session['person'] = user # save on session
return f(kwargs=kwargs, hasattr=hasattr, enumerate=enumerate,
urlencode=urllib.urlencode, list_people=people_link_list,
getattr=getattr, str=str, map=map_helper, user=user,
display_datetime=display_datetime, timesort=timesort,
recentEvents=recentEventsDeque, len=len,
Selection=webui.Selection, **kwargs) # apply template
For our purpose the only parts of this that matter are
-
doc
is the requested data object, a Python object representing a MongoDB "document". In our example this would be aspnet.core.Paper
object. Note this is passed to the template as a keyword argument whose name is simply the "document name" for this collection, i.e.paper
. -
user
is thespnet.core.Person
object representing the current login. To get this, we just look it up from CherryPy'ssession
dictionary representing the current session. This is passed to the template as theuser
keyword arg. -
any keyword arguments from the GET/POST request are also passed to the template via
kwargs
. -
a number of other convenience functions are also passed to the template, e.g. so we can use
len()
in our templates.
Finally, you can see how the actual "application tree" of REST Collection objects is constructed, by looking at spnet/apptree.py and scrolling down to the get_collections()
function.
Returning back to our original arXiv - DOI feature idea, it may be worthwhile to think a bit more about how these data should be organized, and where such display features are best implemented.
- first, let's consider the relationship between the different "paper databases" that selectedpapers.net currently covers: arXiv, DOI, and PubMed. Note that these are not mutually exclusive, either in reality or in how spnet/core models them. In principle, a paper could have an ID in all three of these databases, and in that case we'd probably like to unify all three of those IDs to a single paper record in selectdpapers.net. That way people commenting on the different IDs (arXiv, DOI or PubMed) would all be integrated into a single conversation about that paper, rather fragmented into different records. This would be very nice, e.g. if a paper is first discussed as an arXiv preprint, and later as a published paper (DOI).
- Concretely, the spnet data model takes advantage of the flexibility of the mongoDB "NoSQL" database. In a regular SQL database we'd have to store arXiv, DOI and PubMed records as three separate tables. MongoDB allows us to "embed" records within another record. Specifically, we have one table that stores Paper records. Each Paper record can have embedded within it an arXiv, DOI and PubMed records (or none of these, if the paper isn't linked to any of those databases). These embedded records can be added at any time; for example, a Paper recorded could initially be created with an embedded
ArxivPaperData
record, and later, when the paper is published, aDoiPaperData
embedded record could be added. - Currently, selectedpapers.net correctly implements that "record unification" for DOI and PubMed, but not for arXiv vs. the other databases (basically because we didn't know how to lookup the DOI for an arXiv paper, and vice versa).
- Now that we have the
arxiv_doi
metadata, we could implement that unification fairly easily. We can follow the model for how PubmedPaperData and DoiPaperData link to each other (e.g. where the PubMed data provide a DOI), by looking at spnet/core.py.
If you look at PubmedPaperData
class you'll see code like:
class PubmedPaperData(EmbeddedDocument):
'store pubmed data for a paper as subdocument of Paper'
...
def _insert_parent(self, d):
'create Paper document in db for this arxiv.id'
try: # connect with DOI record
DOI = d['doi']
return DoiPaperData(DOI=DOI, insertNew='findOrInsert',
getPubmed=False).parent
except KeyError: # no DOI, so save as usual
return Paper(docData=dict(title=d['title'],
authorNames=d['authorNames']))
- the
_insert_parent()
method for any embedded ("subdocument") class implements insertion of a new "parent" document record (in which this subdocument will be embedded). It must return the object representing that newly inserted record. - in the
PubmedPaperData
case, it simply checks whether the pubmed data provide a DOI. If so, it simply creates aDoiPaperData
record (with that DOI), and returns its parent document (i.e. thePaper
document in which thatDoiPaperData
subdocument is embedded. As a result, our newPubmedPaperData
subdocument will be embedded in that samePaper
document. (i.e. thatPaper
document will have both an embeddedDoiPaperData
and embeddedPubmedPaperData
). - of course, if there's no DOI, it has no choice but to create a new
Paper
document filled in with only the minimum required data,title
andauthorNames
.
We can easily apply this model to ArxivPaperData
. All we have to do is add the same try...except
clause. I.e. change the current ArxivPaperData._insert_parent()
code:
class ArxivPaperData(EmbeddedDocument):
...
def _insert_parent(self, d):
'create Paper document in db for this arxiv.id'
return Paper(docData=dict(title=d['title'],
authorNames=d['authorNames']))
to:
class ArxivPaperData(EmbeddedDocument):
...
def _insert_parent(self, d):
'create Paper document in db for this arxiv.id'
try: # connect with DOI record
DOI = d['arxiv_doi']
return DoiPaperData(DOI=DOI, insertNew='findOrInsert').parent
except KeyError: # no DOI, so save as usual
return Paper(docData=dict(title=d['title'],
authorNames=d['authorNames']))