Simple Feature Tutorial

Goal: implementing a basic feature

This tutorial will introduce you to basics of working with spnet data by walking through the steps of implementing a simple feature. Let's try to implement a proposed feature from the issue tracker, display of arXiv metadata. This is a fairly simple feature to implement, because we're assuming we'll work with existing data (no need to write code to get new kinds of data), so we just need to figure out what paper metadata arXiv has already given us, and choose a nice way to present it. For example, we could follow the existing model of how arXiv displays those metadata.

Let's start with an example arXiv paper.

here's how it renders on selectedpapers.net
here's how it renders on arXiv

The DOI would enable spnet to connect this arXiv paper to the final published version. The comments and journal reference would be nice to show too.

Looking at internal spnet data

Now let's see what metadata spnet receives from arXiv, by directly viewing the raw data. To get easy command line access to raw data, we can either start an spnet web server:

python -i web.py

Alternatively we can simply connect to the database manually:

python
>>> import connect
>>> dbconn = connect.init_connection()
>>> import core

At this point we can directly query the data we want:

>>> p = core.ArxivPaperData('0807.3498', insertNew='findOrInsert')

spnet/core.py defines the core spnet data classes.
ArxivPaperData represents the data for an arXiv paper.
For all core classes, obtaining the record for a specified object is as easy as providing its ID.
The setting insertNew='findOrInsert' tells it to first check your mongoDB for this record, but if not found, then to use the class's external query method to retrieve it from an external source (in this case arXiv.org's API), and to insert the result into mongoDB (ensuring that henceforth this record will definitely be in the mongoDB database).

We can inspect the data either using the standard Python dir() builtin function:

>>> dir(p)
['__class__', '__cmp__', '__delattr__', '__dict__', '__doc__', '__format__',
'__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__',
'__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'__weakref__', '_dbDocDict', '_dbfield', '_get_doc', '_insert_parent', '_isNewInsert',
'_parent_link', '_query_external', '_set_parent', '_spnet_url_base', 'array_append',
'array_del', 'arxiv_comment', 'arxiv_doi', 'arxiv_journal_ref', 'arxiv_primary_category',
'author', 'authorNames', 'author_detail', 'authors', 'check_required_fields', 'coll',
'delete', 'find', 'find_obj', 'find_or_insert', 'get_abstract', 'get_doctag',
'get_downloader_url', 'get_hashtag', 'get_local_url', 'get_source_url', 'get_spnet_url',
'get_value', 'guidislink', 'id', 'insert', 'link', 'links', 'parent', 'published',
'published_parsed', 'set_attrs', 'summary', 'summary_detail', 'tags', 'title',
'title_detail', 'update', 'updated', 'updated_parsed', 'useObjectId']

As usual, dir() shows a mix of standard methods, data attributes provided by arXiv, and attributes generated by our class.

Alternatively, every core object also gives direct access to the JSON dictionary stored in the mongoDB database, via its _dbDocDict attribute:

>>> p._dbDocDict.keys()
['authorNames', 'updated', u'arxiv_doi', 'updated_parsed', 'published_parsed', 'title',
'authors', 'summary_detail', u'arxiv_journal_ref', 'summary', 'links', 'guidislink',
'title_detail', 'tags', 'link', 'author', 'published', 'author_detail', 'id',
u'arxiv_primary_category', u'arxiv_comment']
>>> p._dbDocDict['arxiv_doi']
u'10.3934/jmd.2009.3.159'
>>> p._dbDocDict['arxiv_journal_ref']
u'J. Mod. Dyn. 3 (2009), no. 2, 159--231'
>>> p._dbDocDict['arxiv_comment']
u'Errors have been corrected in Section 9 from the prior and published\n  versions of this paper. In particular, the formulas associated to homology\n  classes of curves corresponding to stable periodic billiard paths in obtuse\n  Veech triangles were corrected. See Remark 9.1 of the paper for more\n  information. The main results and the results from other sections are\n  unaffected. 82 pages, 43 figures'
>>> p._dbDocDict['arxiv_primary_category']
{'term': u'math.DS', 'scheme': u'http://arxiv.org/schemas/atom'}
>>> p._dbDocDict['tags']
[{'term': u'math.DS', 'scheme': u'http://arxiv.org/schemas/atom', 'label': None}, {'term': u'37D50 (Primary) 37E15, 51M04 (Secondary)', 'scheme': u'http://arxiv.org/schemas/atom', 'label': None}]
>>> p._dbDocDict['updated']
u'2013-06-04T02:18:08Z'
>>> p._dbDocDict['updated_parsed']
datetime.datetime(2013, 6, 4, 3, 18, 8)
>>> p._dbDocDict['published']
u'2008-07-22T15:09:57Z'

Clearly, we can use these data fields to display the same metadata as arXiv.org shows for this paper.

Deciding how to display the metadata

First take a look at the page source for how arXiv displays this paper. Scroll down to the metatable div. As you can see, arXiv uses a simple table and CSS class values to layout these metadata. Since selectedpapers.net uses arXiv's CSS, we can more or less copy this table format exactly, and use Jinja2 templating to inject the paper's metadata into this format.

Now take a look at our template for displaying a paper. Scroll down to the point just after the class="abstract" blockquote. At this point, we could inject arxiv metadata as easily as (e.g. arxiv_doi):

{{ paper.arxiv.arxiv_doi }}

paper is the core.Paper object representing this paper record.
paper.arxiv is the core.ArxivPaperData object representing the arXiv data for this paper. (Of course, a paper not from arXiv would lack this attribute.)

We could just add these changes to the template, and our feature is done.

Deeper data model considerations

However, in this case it may be worthwhile to think a bit more about how these data should be organized, and where such display features are best implemented.

first, let's consider the relationship between the different "paper databases" that selectedpapers.net currently covers: arXiv, DOI, and PubMed. Note that these are not mutually exclusive, either in reality or in how spnet/core models them. In principle, a paper could have an ID in all three of these databases, and in that case we'd probably like to unify all three of those IDs to a single paper record in selectdpapers.net. That way people commenting on the different IDs (arXiv, DOI or PubMed) would all be integrated into a single conversation about that paper, rather fragmented into different records. This would be very nice, e.g. if a paper is first discussed as an arXiv preprint, and later as a published paper (DOI).
Currently, selectedpapers.net correctly implements that "record unification" for DOI and PubMed, but not for arXiv vs. the other databases.
Now that we have the arxiv_doi metadata, we could implement that unification fairly easily. We can follow the model for how PubmedPaperData and DoiPaperData link to each other (e.g. where the PubMed data provide a DOI), by looking at spnet/core.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple Feature Tutorial

Goal: implementing a basic feature

Looking at internal spnet data

Deciding how to display the metadata

Deeper data model considerations

Clone this wiki locally