Resource References

Status: DRAFT

Summary
Rationale
User Stories
Scope
Design and Implementation Plan
Test/Demo Plan
Unresolved Issues

Summary

The paradigm of storing data in the Opal's database appears to be limited when data are becoming very large (billions of rows, millions of variables) or when data are becoming complex (not a simple 2-dimension dataset, but possibly multiple ones, related to each other as in a domain specific database schema). With such big/complex datasets, it makes no sense to copy them in Opal's space. Only the way to access them (authentication may apply), and eventually the way to retrieve metadata (size, timestamps, description, etc.) should be needed.

When it comes to analyse these large data, most likely in a R/DataSHIELD environment, the same limitations apply because the R server computation capabilities may be limited. Therefore it should be possible to describe computation resources as well.

To address these needs, the concept of a "reference to a resource" is introduced in this specification. A resource can be a dataset (e.g. a file stored in the local file system or in a file store server, a database table, etc.) or a server with some computation capabilities (e.g. execution of the plink command through ssh, connection to some secure web services, etc.).

There is already a standard for describing the reference to a resource: Uniform Resource Locator, URL. This standard is flexible and robust enough (the Web is based on that) to define either a data or service resource location.

This will make Opal more generic, from a data repository to a data trust party that connects data/computation resources providers with users in a controlled R/DataSHIELD environment. As such, this will expand R/DataSHIELD analytic capabilities to large/complex data.

Rationale

Being able to access big or complex datasets, without facing the limitations of copying data into Opal and of enforcing a domain specific data model into a 2 dimensional one. The ultimate usage that is envisioned is the DataSHIELD context, where resources must be fully available on the server side (to conduct analytics) without disclosing private information to the client side.

User Stories

Describe simply who is doing what and how to obtain the result.

#	Who	What	How	Result
1	data manager	registers a new resource	by filling a form in a project	resource is referenced
2	data manager	applies permissions on a resource	by selecting the resource, subject type and name and level of permission	access to the resource is restricted
3	R/DataSHIELD developer	makes use of a new type of resource	by developing some R code that extends the concepts associated to the resource	new type of resources can be used to access data or performing computations
4	developer	defines a new type of resource, with its form	by developing a resource plugin	new types of resources can be created
5	R/DataSHIELD user	assigns a resource to a R/DataSHIELD session	by specifying the project's resource unique name	the resource can be used in the R server to perform analysis

Scope

Opal internal database, web services and GUI.

Design and Implementation Plan

Domain

In order to facilitate the creation of a resource URL, a form can be provided to capture the different parts of the URL in a meaningful manner. The type of the form is related to the scheme of the URL and/or of the usage that will be done with this form.

Examples:

Resource type	URL example
File that can be downloaded	`https://example.org/path/to/the/file.csv`
File that can be read locally	`file:/path/to/the/file.csv` or `file:///path/to/the/file.csv`
MySQL database table	`mysql://example.org/database/table`
MongoDB database collection	`mongodb://example.org:27017/database/collection`
Directory where commands can be executed through SSH	`ssh://example.org:22/path/to/data/directory?exec=plink,ls`
File that can be downloaded through SSH	`scp://example.org:22/path/to/the/file.csv`
Application web services that can be triggered	`https://app.example.org/`
File that can be downloaded from Amazon Web Services S3 file store	`s3://bucket/path/to/the/file/object.vcf`

Resource Interface

The resource interface is rather minimalistic as it only needs to:

be uniquely identified at R/DataSHIELD assignment time,
be convertible into a URL,
have (optional) credentials when authentication applies.

Property/Method	Description
`String: name`	Unique name in the project where it is defined
`String: format`	A (optional) data format name that can be used by R to interpret the data described by the resource, see Coercing a Resource to a R Object
`Credentials: getCredentials()`	Authentication info, see Credentials
`URL: toURL()`	Get resource as a URL

Credentials

Despite it is possible to include in the URL the user info segment in the form of user:password it is strongly not recommended to do this for obvious security reasons. If authentication is required for accessing a resource, the credentials will be stored aside of the URL. Then a DataSHIELD user having a restricted access to the resource, will be able to read the URL but not the associated credentials.

The credentials that are supported are defined by the pair:

Property	Description
`String: identity`	can be a user name or a account identifier or a certificate PEM file path and is optional (because the `secret` may be used to identify the requester)
`String: secret`	can be a password or a token or a private key PEM file path

Grouping Resources

Because several resources can be involved an analysis pipeline (e.g. a computation resource and the files that are to be processed), it should be possible to group some resources, to apply consistent permissions on them and to assign into R this group of resources in one request.

Permissions

A DataSHIELD user can have permission to see dataset's meta-data (the data dictionary) but not the individual level data.

For a resource, an equivalent permission would be to see the URL of the resource and its meta-data (if there are some, e.g. file size) but not the associated credentials.

Resource permission	Description
`View meta-data`	Read only permission without access to the resource's credentials (DataSHIELD compliant permission)
`View`	Read only permission without restriction
`Administrate`	Edit and delete permissions

To use the files stored in the Opal file system, similar permissions need to be applied: see file names, list directory content, but do not download file. Currently a user has all or no permission on a file directory and its content.

File permission	Description
`View meta-data`	Read only permission without download (DataSHIELD compliant permission)
`View`	Read only permission without restriction
`Administrate`	Edit and delete permissions

Plugins

Opal has a plugin architecture that allows to extend some of its features (import/export format for instance). A new type of plugin can be defined, Resource Plugin, which purpose is to declare a new type of resource:

a form to capture resource parameters in a user-friendly manner
make a URL from these parameters
declare the type of the data being described by the resource (if applicable)
test the accessibility of the resource (if possible): get the meta-data (file size, timestamps etc.) or verify that credentials are valid (database or remote server connection)

Base types that will be implemented:

local files
files stored in Opal's file system
files accessible via HTTPS
files accessible via SCP
database connections (MySQL, MariaDB, PostgreSQL, MongoDB)
SSH connection
Shell connection
Amazon Web Services S3 file store

Resource Factory Service

The Resource Plugin is based on the ResourceFactoryService interface that provides ResourceFactory instances that are builders of Resource objects.

ResourceFactoryService Method	Description
`List<ResourceFactory> getResourceFactories()`	Get the factories of resources (several can be defined in a single plugin)

The ResourceFactory class provides schema-forms so that UI can display a form for the resource parameters (the elements of the URL) and for the credentials (the concept of schema-form already exists in Opal, see DatasourceService).

ResourceFactory Method	Description
`JSONObject: getParametersSchemaForm()`	Get the "schema form" object to collect the resource parameters
`JSONObject: getCredentialsSchemaForm()`	Get the "schema form" object to collect the resource credentials
`Resource: createResource(name, parameters, credentials)`	Factory method that creates a Resource object, the parameters being the result of the schema-form data capture

Common Resources Plugin

Most common resources will be implemented in a base plugin. The structure of this plugin should be extensible, i.e. adding new resources should not require Java skills and compilation. Then resources definition will be based on conventions. For each resource type there will be a folder containing:

the parameters form (JSON file)
the credentials form (JSON file)
a javascript function that converts collected parameters and credentials into a resource object (JS file, can be executed by Java)
a properties file (with name, title, description and possibily other settings)
a R script to be executed prior the assignment of a new resource in the R server

These folders are discovered at runtime by the plugin's Java code which will make a ResourceFactory per resource folder.

<resource plugin folder>
├── lib
│   └── resource-plugin.jar
├── resources
│   |── file
│   |   ├── credentials-form.json
│   |   ├── parameters-form.json
│   |   ├── require.R
│   |   ├── settings.properties
│   |   └── toResource.js
│   |── mongodb
│   |   ├── credentials-form.json
│   |   ├── parameters-form.json
│   |   ├── require.R
│   |   ├── settings.properties
│   |   └── toResource.js
│   |── sql
│   |   ├── credentials-form.json
│   |   ├── parameters-form.json
│   |   ├── require.R
│   |   ├── settings.properties
│   |   └── toResource.js
│   └── ssh
│       ├── credentials-form.json
│       ├── parameters-form.json
│       ├── require.R
│       ├── settings.properties
│       └── toResource.js
├── plugin.properties
└── site.properties

Web Services

Resource Plugin Web Services

Web services for the UI to get resource plugins details. These details include everything that is needed to build the UI:

name
title
description
parameters schema-form
credentials schema-form

REST	Description
`GET /resource-plugins`	List the resource plugins
`GET /resource-plugin/{p}`	Get a resource plugin

Project Resources Web Services

Web services to manage resources declared in a project.

REST	Description
`GET /project/{p}/resources`	List the resources of a project
`POST /project/{p}/resources`	Create a new resource in a project
`GET /project/{p}/resource/{res}`	Get a single resource of a project
`PUT /project/{p}/resource/{res}`	Update a resource of a project
`DELETE /project/{p}/resource/{res}`	Delete a resource of a project

Resources R/DataSHIELD Assignment Web Services

Web services to make use of the resources in a R/DataSHIELD environment. The name of the resource is its fully qualified name (i.e. includes the project name): <project>:<name>.

REST	Description
`PUT /r/session/{s}/symbol/{smbl}/_resource?name={res}`	Assign a resource object to symbol in the R session
`PUT /datashield/session/{s}/symbol/{smbl}/_resource?name={res}`	Assign a resource object to symbol in the DataSHIELD session

UI Mockups

Resources are defined by project.

List Resources

In addition to the "Tables" tab in project main page, a new tab "Resources" is the place where project's resources are managed.

Operations can be performed in batch when selecting several resources.

Add/Edit Resource

The resource type choice is provided by the resource plugins. For each of them there is a schema-form that is to be used for getting the URL parameters (scheme, host, port, path segments and query). This parameters form allows to build a URL in a user friendly manner and to prevent the risk of malformed URLs.

Example of a SQL table data resource:

Example of a SSH computation resource:

R/DataSHIELD

This section describes some R/DataSHIELD use cases and how R code can be structured to formalize the resources and the usage that can be made of it in a flexible way.

Resource Object

When a resource is assigned to a R server session, an object of class resource is created by Opal, with eventually an additional class specification to inform R about the type of the data made accessible by the resource.

res <- structure(
  list(
    name = "NA19462.final.cram",
    url = "s3://1000genomes/1000G_2504_high_coverage/data/ERR3239775/NA19462.final.cram",
    identity = "awsaccountid",
    secret = "awssecret",
    format = "cram"
  ),
  class = c("resource", "cram")
)

This resource object has all the information for establishing a connection and for making use of the resource.

Coercing a Resource to a R Object

When relevant it may be possible to coerce the resource to a domain specific R object. This can be achieved by using the S3 class system, usually applied through the extension of as.*() functions.

For instance if the resource represents a SQL table, it should be possible to make a data.frame over it by calling as.data.frame() function.

# S3 function that coerce from a resource as a data.frame
as.data.frame.resource <- function(res) {
  # ...
}

# resource to SQL table in a MySQL database
res <- structure(
  list(
    name = "CNSIM1",
    url = "mysql://db.example.org/datashield/table",
    identity = "dbuser",
    secret = "dbpassword"
  ),
  class = "resource"
)

# coerce the resource to a SQL table as a data.frame
as.data.frame(res)

Another example could be a resource that describes a file (to be downloaded or read locally) and make a data.frame using one of the readers provided by tidyverse.

The data.frame is a generic type. A resource could also be coerced to a domain specific type, such as a Bioconductor Expression Set in a function as.ExpressionSet().

Accessing a Computation Resource

A resource does not necessarily represent data. It can also represent a shell command or an application with web services.

Shell Command

For example, a omic analysis is using the plink command to delegate data computation. This command can be launched locally if the plink command exists locally and if the input file(s) are also available locally. In which case the plink resource could be declared as:

shell:/work/dir?exec=plink,bcftools,ls

Or another case would be that the plink command is to be issued through ssh, then the URL becomes:

ssh://example.org:22/work/dir?exec=plink,bcftools,ls

Whether plink is run locally or remotely affects the way the input file are handled (downloaded or accessed locally).

Application Web Services

In the case of a remote web server exposing web services, dedicated client must exists in the R server in order to know how to authenticate and how to get data or execute some tasks. In some situation, the scheme of the URL could be expanded to qualify the application. For instance, the URL is the location of a CSV file in a Opal file system:

https://opal-demo.obiba.org/ws/files/data/sample.csv

Then the R object class would be csv because we want to be able to coerce the CSV file to a data.frame, but how to use the credentials if we don't know that this remote server is Opal? The solution could be to qualify the type of the application in the scheme, for instance:

opal+https://opal-demo.obiba.org/ws/files/data/sample.csv.

Resource Resolvers and Clients

We then need to have in R two types of resolver:

data format resolver, for driving the data transformations,
URL scheme resolver, for driving the data retrieval.

The data format resolver will make use of the S3 method dispatch system, like with the as.*() functions.

The scheme resolvers could be managed by a resolver registry. This registry and the resolvers could be implemented as R6 classes and defined in a resourcer package.

Let's call ResourceResolver the R6 base class which public functions would be:

ResourceResolver Function	Description
`isFor(x)`	Test whether the resource `x` can be handled by the resolver (function used by the resolver registry to find which one applies)
`newClient(x)`	Create a ResourceClient object from the resource `x`

The ResourceResolver is a factory of ResourceClient:

ResourceClient Function	Description
`getResource()`	Get the resource
`downloadFile(fileext = "")`	From the resource, get the data from the URL into a file with provided file extension
`asDataFrame(...)`	Coerce the resource to a data frame
`getConnection()`	Get the raw connection object that is wrapped. It can be a connection to a file, a database or a remote application
`close()`	Close the connection with the resource

This base class is to be inherited by specialized resource resolver implementations: OpalResolver for instance will download a file from the Opal file system, or could directly produce a data frame from a table or from the downloaded file, or could return a opal client object using the opalr package.

The resourcer package will also expose some functions:

resourcer package Function	Description
`newResource(name, url, identity, secret, class)`	Create a `structure` object which main class attribute is "resource"
`registerResolver(x)`	Register a Resolver object (create the registry object if it does not exists)
`resolveResource(x)`	Find the Resolver object in the registry that is for the resource object `x`
`newResourceClient(x)`	Shortcut function for finding a resource resolver and creating a resource client from a resource object `x`

Example of usage, that turns a remote file stored in an application to a data.frame:

# make a data.frame from a resource
# (this section is to be defined in the resourcer package)
as.data.frame.resource <- function(x, row.names = NULL, optional = FALSE, ...) {
  # find scheme resolver
  resolver <- resourcer::resolveResource(x)
  if (is.null(resolver)) {
    stop("No resource resolver found for ", x$url)
  }
  # coerce resource data to a data.frame
  client <- resolver$newClient(x)
  df <- client$asDataFrame(row.names = row.names, optional = optional, ...)
  client$close()
  df
}

# register a opal resolver
# (this section is to done automatically when the package containing the resolver class is loaded)
resourcer::registerResolver(OpalResolver$new())      # opal+https
# other resolvers
resourcer::registerResolver(SshResolver$new())       # ssh or scp
resourcer::registerResolver(LocalFileResolver$new()) # file
resourcer::registerResolver(SqlTableResolver$new())  # mysql, mariadb or postgresql
resourcer::registerResolver(S3Resolver$new())        # s3
# etc.

# make a csv file resource on a opal server
# (this section is to be done by opal at resource assignment time)
res <- resourcer::newResource(
  name = "CNSIM1",
  url = "opal+https://opal-demo.obiba.org/ws/files/data/CNSIM1.csv"
  secret = "WTxCVpOy3dbNYwQbEAzQrPV2KZQFeQCa",
  format = "csv"
)

# coerce the csv file in the opal server to a data.frame
# (this section is a assignment operation triggered by the R client)
df <- as.data.frame(res)

Another example of usage, that explicitly use the connection object:

# make an application resource on a ssh server
# (this section is to be done by opal at resource assignment time)
res <- resourcer::newResource(
  name = "supercomp1",
  url = "ssh://server1.example.org/work/dir?exec=plink,ls",
  identity = "sshaccountid",
  secret = "sshaccountpwd"
)

# (this section is to done automatically when the package containing the resolver class is loaded)
resourcer::registerResolver(SshResolver$new())

# get ssh client from resource object
client <- ressourcer::newResourceClient(res) # does a ssh::ssh_connect()

# SshClient has some additional functions
files <- client$exec("ls") # exec 'cd /work/dir && ls'

# or use ssh package directly
# (see examples https://ropensci.org/technotes/2018/06/12/ssh-02/)
session <- client$getConnection()
ssh::ssh_exec_internal(session, command = "ls")

# release connection
client$close() # does ssh::ssh_disconnect(session)

Here is a list of R packages that can help in the implementation of some URL scheme resolvers:

URL Scheme	R Package
`s3`	aws.s3
`ssh`	ssh
`scp`	ssh
`shell`	sys
`mysql`	DBI, RMySQL
`mongodb`	nodbi, mongolite
`https`	httr
`opal+https`	opalr
...	...

The httr package has a function for parsing URLs, parse_url(), which can be conveniently used to extract the scheme, host, port, path, query, etc. elements.

Test/Demo Plan

How can the feature be tested or demonstrated. It is important to describe this in fairly great details so anyone can perform the demo or test.

Unresolved Issues

Specify what would be the resource's meta data: how to retrieve them, how to use them in a data catalog.
Specify what would be the resource's identifiers (if it can be coerced to a data.frame)
Explore what would be a OpenCGA resource.
Make a PR for httr::authenticate to be applied to Presto R client (see also this)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly