-
Notifications
You must be signed in to change notification settings - Fork 2
Terminology
Working with annotator, particularly with the Pipeline convenience class, requires familiarity with some terminology.
Pipeline - A wrapper around the utils
module for working with common annotations related objects like file views, metadata tables, and the relationships between them. Ideally, anything you can do with a Pipeline object can instead be done using the utils
module, but as of this writing (January 2017) that's not yet the case.
View - A generic term for tabular data (such as file views, .csv files, etc.). Normally implemented as a pandas DataFrame.
Data view - The view representing our data. Usually derived from a file view.
Meta view - The view representing our metadata.
Active Column - A so-called "column of interest". Oftentimes our views contain tens or hundreds of columns but we are only interested in a few of them. Marking these columns as "active columns" allows us to stay organized by, for example, only printing active columns to the console or only running verification checks on active columns.
Key - Of critical importance when matching files in the data view with their corresponding metadata in the meta view is a preferably unique key value shared between the two views. If we're smart or lucky, there will already be a key value shared between the data and meta views. But usually we are neither, and we must create a new column in the data view whose values match with values from an appropriate column in the meta view and are derived (using, e.g., a regular expression) from a pre-existing column in the data view. For example, you might apply a regular expression to the name
column in the data view to match the specimenID
column in the meta view.
Links - When annotating data contained in a data view, there are usually a few one-to-one relationships between some columns in the data view to other columns in the meta view. For example, the assayTarget
column in the data view might map to the Histone Mark/Input
column in the meta view. A single one of these mappings is called a link and the totality of these mappings are called links. Usually we want to copy the values from the linked columns in the meta view to their linked column in the data view, but aligned on some key like specimenID
. We call this process of copying and aligning values (i.e., a specific type of "merge" or "join") a link transfer, from the metadata to the data.
Link Transfer - A left join of linked columns in the meta view aligned upon a (preferably) unique key value in the data view. Joined columns are then copied over to their respective linked column (the only difference being that the linked column in the data view often has a different name than its corresponding linked column in the meta view). After being copied over to the appropriately named column, the newly joined columns are deleted from the data view.
Publish - Once we are finished filling out our data view with annotations, we need to push our changes back to Synapse. This is termed publishing rather than pushing, because there is a validation process run before pushing the changes to Synapse to guard against malformed or missing values in the data view. Checking active columns for missing values or cross-checking column values against a provided schema are two such steps involved in the validation process.