OpenCGA Catalog Data Models

Catalog Data Models Definition

In this section all the Catalog data models will be explained.

For more detailed information about the Java data models you can browse the source code at Java beans.

User
- Project
  - Study
    - File
    - Job
    - VariableSet
      - Variable
    - Sample
      - AnnotationSet
        
        Annotation
      - Individual
    - ACL
    - Experiment
    - Dataset
    - Cohort
- Tool
  - Manifest

Common fields

Fields that are common between the data models are:

id: a numeric positive identifier which is unique in the whole Catalog. This id can be used in the API and REST web services.
status: Java bean object containing the actual status of the entity (user, project, study...). By default, the accepted ones are READY, TRASHED and DELETED, although some entities extend this java bean and defines more.
attributes: this field can be used by different applications using OpenCGA to store custom information, any well-formed JSON object is accepted.
lastActivity: this field reports when was the last time the data was updated, this is useful when updating a web client interface. []: # (For a lastActivity known value, if the value matches with the stored, it is not necessary to)

Session

Register every login and logout made by the user. The sessionId is valid only while the field logout is empty.

🚧 The session array will be taken care of by a low latency daemon that will be running in the background in the future.

User and Project

This is the root level of the hierarchy. It represents any person registered in the system together with their projects.

Most relevant fields are:

id: Alphanumerical string identifier. This is the only non-numerical Id.
status: Additional status besides the default ones:
- BANNED:
account: Accepted account types:
- GUEST: This account type is for guests only, so they will not be able to create their own projects or studies.
- FULL: This account type is for the most general use that gives complete access to openCGA allowing to define new projects and studies.
configs: 🚧 In the config map, the user will be able to store their own queries and recover them giving them a name.

Example:

{
  "id": "jsmith",
  "name": "John Smith",
  "email": "jsmith@do.co",
  "password": "a04f2825f2227f70e41d58b16c890661a803b453",
  "organization": "ACME",
  "account" : {
    "type" : "full",
    "creationDate" : "20160920090458",
    "expirationDate" : "20170920090458",
    "authOrigin" : "internal"
  },
  "status" : {
    "name" : "READY",
    "date" : "20160920090458",
    "message" : ""
  },
  "lastModified" : "20160920090752582",
  "diskUsage" : -1,
  "diskQuota" : 200000,
  "projects" : [
    {
      "id" : 1,
      "name" : "Default",
      "alias" : "default",
      "creationDate" : "20160920090458",
      "description" : "This is my project description.",
      "organization" : "ACME",
      "status" : {
        "name" : "READY",
        "date" : "20160920090458",
        "message" : ""
      },
      "lastModified" : null,
      "diskUsage" : 0,
      "studies" : [ ],
      "dataStores" : { },
      "attributes" : { }
    }
  ],
  "tools" : [ ],
  "sessions" : [
    {
      "id" : "oQRQcpBRmkCi1rhMJlGi",
      "ip" : "localhost",
      "login" : "20160920090458",
      "logout" : "20160920090458"
    }
  ],
  "configs" : { },
  "attributes" : { }
}

* In this example the array of studies and tools have been omitted. Will be explained below

Study

Main Catalog object. A study is a set of, among others, files, jobs and samples. All the files in a study share location, cypher and sharing options (ACLs).

Most relevant fields are:

type: (to cohort?)
- CASE_SET:
- CONTROL_SET:
- CASE_CONTROL:
- PAIRED:
- FAMILY:
- TRIO:
stats: (to cohort?)
status:
- ACTIVE:
diskUsage: Sum of the diskUsage of all files in the study.
cipher: Mechanism used to cypher all files in study. Accepted values:
- NONE: Without encryption.
- AES_256: not implemented yet
uri: Location of the study. An URI is required instead of a Path because the study could be in different hosts and file systems.

Example:

{
  "id": 15,
  "name": "Study test 1",
  "alias": "std1",
  "type": "FAMILY",
  "creatorId": "jcoll",
  "creationDate": "20141215182938",
  "description": "",
  "status": "ACTIVE",
  "lastActivity": "20141215182938",
  "diskUsage": 0,
  "cipher": "NONE",
  "acl": [ ],
  "experiments": [ ],
  "files": [ ],
  "jobs": [ ],
  "samples": [ ],
  "uri": "hdfs:///data/opencga/catalog2/users/jcoll/projects/14/15/",
  "datasets": [
    {
      "id": 0,
      "name": "bam_test_files",
      "creationDate": "20141215182938",
      "description": " ... ",
      "files": [ 26, 27, 28, 29, 35, 36, 38],
      "attributes": { }
    }
  ],
  "cohorts": [ ],
  "variableSets": [ ],
  "stats": { },
  "attributes": { }
}

* In this example the array of files, jobs and samples have been omitted. Will be explained below

Dataset

Cohort

VariableSet and Variable

File

Most relevant fields are:

type: Accepted values:
- FILE: Any real file stored in the file system.
- FOLDER: File container.
- INDEX: Not a real file. Represents a indexed file in a OpenCGA-Storage Engine. Removed at v0.6.0
format:
- PLAIN:
- GZIP:
- BINARY:
- IMAGE:
- EXECUTABLE:
bioformat:
- VARIANT:
- ALIGNMENT:
- SEQUENCE:
- NONE:
status: File status. For more information, go to File life cycle. Accepted values:
- STAGED: The file is being created.
- READY: File is ready to be used
- MISSING: Physical file is missing.
- TRASHED: Pending for deletion
- DELETED: Deleted file. Irreversible deletion.
jobId and experimentId: Specifies the source of the file. A file can be generated from a job or an experiment.

Example:

{
  "id" : 3,
  "name" : "chr14.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.AMR.vcf.gz",
  "type" : "FILE",
  "format" : "GZIP",
  "bioformat" : "VARIANT",
  "path" : "data/vcf/chr14.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.AMR.vcf.gz",
  "ownerId" : "jcoll",
  "creationDate" : "20141215162449",
  "description" : " ... ",
  "status" : "READY",
  "diskUsage" : 24276833,
  "experimentId" : -1,
  "sampleIds" : [ ],
  "jobId" : -1,
  "acl" : [ ],
  "stats" : { },
  "attributes" : { }
}

Job

Example:

{
  "id" : 138,
  "name" : "Test job",
  "userId" : "jcoll",
  "toolName" : "network-miner",
  "date" : "20141031151537",
  "description" : " ... ",
  "startTime" : 1415632245213,
  "endTime" : 1415632258708,
  "outputError" : "",
  "commandLine" : "/opt/opencga/analysis/network-miner/babelomics/babelomics.sh --tool network-miner  --seedlist 150140.chrom20.ILLUMINA.bwa.CHM1.20131218.bam.bai --significant-value 0.05 --list HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam --list-tags gene --intermediate 1 --outdir /home/cafetero/opencga/catalog/jobs/J_KrOrWfEwkx/ --order ascending --interactome hsa --randoms 500 --components false --group all --o-name result",
  "visits" : -1,
  "status" : "READY",
  "outDirId" : "6",
  "tmpOutDirUri" : "file:///home/cafetero/opencga/catalog/jobs/J_KrOrWfEwkx/",
  "input" : [
    66
  ],
  "tags" : [ ],
  "output" : [
    658,
    659
  ],
  "attributes" : { },
  "executionAttributes" : {
    "type" : "analysis",
    "jobExecutionId" : "268",
    "executionManager" : "SGE",
    "qname" : "normal.q",
    "group" : "cafetero",
    "jobname" : "network-miner_Test_job",
    "end_time" : "Wed Dec 10 11:10:06 2014",
    "jobnumber" : 268,
    "failed" : 0,
    "start_time" : "Wed Dec 10 11:10:06 2014",
    "hostname" : "host001",
    "qsub_time" : "Wed Dec 10 11:10:00 2014",
    "mem" : "0.000",
    "cpu" : "0.049",
    "exit_status" : 0
  }
}

Sample

Example:

{
  "id" : 19,
  "name" : "SMP00096",
  "source" : "",
  "individual" : null,
  "description" : " ... ",
  "annotationSets" : [
    {
      "name" : "Basic annotation",
      "variableSetId" : 21,
      "annotations" : [
        { "id" : "NAME",      "value" : "Glennie the platypus" },
        { "id" : "BORN-DATE", "value" : "20071000000000" },
        { "id" : "GENDER",    "value" : "FEMALE" }
        { "id" : "PHEN",      "value" : "CASE" }
        { "id" : "WEIGHT",    "value" : 25.38 }
      ],
      "date" : "20141216135957",
      "attributes" : { }
    }
  ]
}

OpenCGA is an open source project and it is freely available.

General

OpenCGA Catalog

OpenCGA Storage

About

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCGA Catalog Data Models

Catalog Data Models Definition

Common fields

Session

User and Project

Study

Dataset

Cohort

VariableSet and Variable

File

Job

Sample

Clone this wiki locally