From 937dd471cb3286957b5ddfae6de873609c9df5dd Mon Sep 17 00:00:00 2001 From: Bas Leenknegt Date: Wed, 7 Dec 2022 13:18:19 +0100 Subject: [PATCH] Update docs on custom namespaces and cna long format --- docs/File-Formats.md | 210 +++++++++++++++++++++++++++++-------------- 1 file changed, 142 insertions(+), 68 deletions(-) diff --git a/docs/File-Formats.md b/docs/File-Formats.md index 6413f245f79..37592f25ebc 100644 --- a/docs/File-Formats.md +++ b/docs/File-Formats.md @@ -22,6 +22,7 @@ * [Generic Assay](#generic-assay) * [Arm Level CNA Data](#arm-level-cna-data) * [Resource Data](#resource-data) + * [Custom namespace columns](#custom-namespace-columns) # Introduction @@ -275,12 +276,19 @@ The Clinical Data Dictionary from MSKCC is used to normalize clinical data, and ## Discrete Copy Number Data The discrete copy number data file contain values that would be derived from copy-number analysis algorithms like [GISTIC 2.0](https://www.ncbi.nlm.nih.gov/sites/entrez?term=18077431) or [RAE](https://www.ncbi.nlm.nih.gov/sites/entrez?term=18784837). GISTIC 2.0 can be [installed](https://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=216&p=t) or run online using the GISTIC 2.0 module on [GenePattern](https://cloud.genepattern.org). For some help on using GISTIC 2.0, check the [Data Loading: Tips and Best Practices](Data-Loading-Tips-and-Best-Practices.md) page. When loading case list data, the `_cna` case list is required. See the [case list section](#case-lists). -### Meta file +### Wide vs Long format +For CNA data two formats are supported: the wide, and the long format: +- **Wide format**: a matrix, where each row is a gene, and each column is a sample +- **Long format**: not a matrix, each row is a gene-sample combination; this makes the file longer + +### Wide format + +#### Meta file The meta file is comprised of the following fields: 1. **cancer_study_identifier**: same value as specified in [study meta file](#cancer-study) 2. **genetic_alteration_type**: COPY_NUMBER_ALTERATION -3. **datatype**: DISCRETE +3. **datatype**: `DISCRETE` 4. **stable_id**: gistic, cna, cna_rae or cna_consensus 5. **show_profile_in_analysis_tab**: true 6. **profile_name**: A name for the discrete copy number data, e.g., "Putative copy-number alterations from GISTIC" @@ -289,7 +297,7 @@ The meta file is comprised of the following fields: 9. **gene_panel (Optional)**: gene panel stable id 10. **pd_annotations_filename (Optional)**: name of [custom driver annotations file](File-Formats.md#custom-driver-annotations-file) -### Example +##### Example An example metadata file could be named meta_cna.txt and its contents could be: ``` cancer_study_identifier: brca_tcga_pub @@ -303,8 +311,7 @@ data_filename: data_cna.txt pd_annotations_filename: data_cna_pd_annotations.txt ``` -### Data file - +#### Data file For each gene (row) in the data file, the following columns are required in the order specified: One or both of: @@ -321,7 +328,7 @@ For each gene-sample combination, a copy number level is specified: - "1" indicates a low-level gain - "2" is a high-level amplification. -### Example +##### Example An example data file which includes the required column header would look like: ``` Hugo_SymbolEntrez_Gene_IdSAMPLE_ID_1SAMPLE_ID_2... @@ -331,6 +338,53 @@ AGRN37579020... ... ````` +### Long format + +#### Meta file +The meta file of **wide format** is comprised of the following fields: + +1. **cancer_study_identifier**: same value as specified in [study meta file](#cancer-study) +2. **genetic_alteration_type**: COPY_NUMBER_ALTERATION +3. **datatype**: `DISCRETE_LONG` + Note: It will end up as datatype `DISCRETE` in the database, because the LONG data format is only relevant while importing. +4. **stable_id**: gistic, cna, cna_rae or cna_consensus +5. **show_profile_in_analysis_tab**: true +6. **profile_name**: A name for the discrete copy number data, e.g., "Putative copy-number alterations from GISTIC" +7. **profile_description**: A description of the copy number data, e.g., "Putative copy-number from GISTIC 2.0. Values: -2 = homozygous deletion; -1 = hemizygous deletion; 0 = neutral / no change; 1 = gain; 2 = high level amplification." +8. **data_filename**: your datafile +9. **gene_panel (Optional)**: gene panel stable id +10. **namespaces (Optional)**: Comma-delimited list of `namespaces` to import. + +##### Example +An example metadata file could be named meta_cna.txt and its contents could be: +``` +cancer_study_identifier: brca_tcga_pub +genetic_alteration_type: COPY_NUMBER_ALTERATION +datatype: DISCRETE_LONG +stable_id: gistic +show_profile_in_analysis_tab: true +profile_name: Putative copy-number alterations from GISTIC +profile_description: Putative copy-number from GISTIC 2.0. Values: -2 = homozygous deletion; -1 = hemizygous deletion; 0 = neutral / no change; 1 = gain; 2 = high level amplification. +data_filename: data_cna.txt +namespaces: MyNamespace,MyNamespace2 +``` + +#### Data file +Each row contains a row-sample combination. Custom driver annotations are added as columns to the data file, just like custom namespace columns. + +##### Example +An example data file which includes the required column header would look like: +``` +Hugo_Symbol Entrez_Gene_Id Sample_Id Value cbp_driver cbp_driver_annotation cbp_driver_tiers cbp_driver_tiers_annotation MyNamespace.column1 +ACAP3 116983 TCGA-A2-A04U-01 2 Putative_Passenger Test passenger Class 2 Class annotation value1 +... +``` + +#### Adding your own discrete copy number columns +Additional columns can be added to the discrete copy number **long** data file. In this way, the portal will parse and store your own CNA fields in the database. + +See [Custom namespace columns](#custom-namespace-columns) for more information on adding custom columns to data files. + ### Custom driver annotations file Custom driver annotations can be defined for discrete copy number data. These annotations can be used to complement or replace default driver annotation resources OncoKB and HotSpots. @@ -718,67 +772,13 @@ You can learn more about configuring these annotations in the [portal.properties ![schreenshot mutation color menu](/images/screenshot-mutation-color-menu.png) ### Adding your own mutation annotation columns -Adding additional mutation annotation columns to the cBioPortal mutation data file rows can also be done. In this way, the portal will parse and store your own MAF fields in the database. For example, mutation data that you find on cBioPortal.org comes from MAF files that have been further enriched with information from [mutationassessor.org](https://mutationassessor.org/), which leads to a "Mutation Assessor" column in the [mutation table](https://www.cbioportal.org/index.do?cancer_study_list=acc_tcga&cancer_study_id=acc_tcga&genetic_profile_ids_PROFILE_MUTATION_EXTENDED=acc_tcga_mutations&Z_SCORE_THRESHOLD=2.0&RPPA_SCORE_THRESHOLD=2.0&data_priority=0&case_set_id=acc_tcga_sequenced&case_ids=&patient_case_select=sample&gene_set_choice=user-defined-list&gene_list=ZFPM1&clinical_param_selection=null&tab_index=tab_visualize&Action=Submit). - -### Adding mutation annotation columns through namespaces -Additional columns may also be added into the cBioPortal mutation data file and imported through the namespace mechanism. Any columns starting with a prefix specified in the `namespaces` field in the metafile will be imported into the database. Namespace columns should be formatted as the namespace and namespace attribute seperated with a period (e.g ASCN.total_copy_number where ASCN is the namespace and total_copy_number is the attribute). - -An example cBioPortal mutation data file with the following **additional** columns: -``` -ASCN.total_copy_number ASCN.clonal MUTATION.name MUTATION.type -``` -imported with the following `namespaces` field in the metafile: -``` -namespaces: ascn -``` -will import the `ASCN.total_copy_number` and `ASCN.clonal` column into the database. `MUTATION.name` and `MUTATION.type` will be ignored because `mutation` is not specified in the `namespaces` field. - -## Representation of namespace columns by mutation API endpoints - -Columns added through namespaces will be returned by mutation API endpoints. Namespace data will be available in -the `namespaceColumn` -of respective JSON representations of mutation records. The `namespaceColumns` property will be a JSON object where -namespace data is keyed by name of the namespace in lowercase. For instance, when namespace `ZYGOSITY` is defined in the -meta file and the data file has column `ZYGOSITY.status` with value `Homozygous` for a mutation row, the API will return -the following JSON record for this mutation (only relevant fields are shown): - -``` -{ - "namespaceColumns": { - "ZYGOSITY": { - "status": "Homozygous" - } - }, -} -``` - -Note: ASCN namespace data is not exported via the `namespaceColumns` field. - -## Representation of namespace columns in the cBioPortal frontend - -Namespace columns will be added as columns to mutation tables in Patient View and Results View. The case of the -namespace in the column header will be as specified in the mutations meta file and the column name will be capitalized. -For instance, this metafile entry: - -```shell -namespaces: Zygosity -``` - -and this column header: - -```shell -ZYGOSITY.status -``` - -will show in the mutation table with column name: - -```shell -Zygosity Status -``` - -Note: namespace columns are recognized by a case-insensitive match of the namespace reported in the mutations meta file -and the first word in the column header. +Additional mutation annotation columns can be added to the cBioPortal mutation data file. In this way, the portal will +parse and store your own MAF fields in the database. For example, mutation data that you find on cBioPortal.org comes +from MAF files that have been further enriched with information +from [mutationassessor.org](https://mutationassessor.org/), which leads to a "Mutation Assessor" column in +the [mutation table](https://www.cbioportal.org/index.do?cancer_study_list=acc_tcga&cancer_study_id=acc_tcga&genetic_profile_ids_PROFILE_MUTATION_EXTENDED=acc_tcga_mutations&Z_SCORE_THRESHOLD=2.0&RPPA_SCORE_THRESHOLD=2.0&data_priority=0&case_set_id=acc_tcga_sequenced&case_ids=&patient_case_select=sample&gene_set_choice=user-defined-list&gene_list=ZFPM1&clinical_param_selection=null&tab_index=tab_visualize&Action=Submit). +See [Custom namespace columns](#custom-namespace-columns) for more information on adding custom columns to data files. ### Allele specific copy number (ASCN) annotations Allele specific copy number (ASCN) annotation is also supported and may be added using namespaces, described [here](#adding-mutation-annotation-columns-through-namespaces). If ASCN data is present in the cBioPortal mutation data file, the deployed cBioPortal instance will display additional columns in the mutation table showing ASCN data. @@ -977,6 +977,12 @@ A structural variant data file is a tab-delimited file with one structural varia For an example see [datahub](https://github.com/cBioPortal/datahub/blob/master/public/msk_impact_2017/data_sv.txt). For an example see [datahub](https://github.com/cBioPortal/datahub/blob/master/public/msk_impact_2017/data_sv.txt). At a minimum `Sample_Id`, either `Site1_Hugo_Symbol`/ `Site1_Entrez_Gene_Id` or `Site2_Hugo_Symbol`/ `Site2_Entrez_Gene_Id` and `SV_Status` are required. For the stuctural variant tab visualization (still in development) one needs to provide those field as well as `Site1_Ensembl_Transcript_Id`, `Site2_Ensembl_Transcript_Id`, `Site1_Region` and `Site2_Region`. Some of the other columns are shown at several other pages on the website. The `Class`, `Annotation` and `Event_Info` columns are shown prominently on several locations. **Note**: We strongly recommend all the data providers to submit genomic locations in addition to required fields for future visualization features. +### Adding your own structural variant columns +Additional mutation annotation columns can be added to the structural variant data file. In this way, the portal will +parse and store your own structural variant fields in the database. + +See [Custom namespace columns](#custom-namespace-columns) for more information on adding custom columns to data files. + ## Fusion Data **⚠️ DEPRECATED Use the: [SV format](#structural-variant-data) instead** @@ -1550,10 +1556,13 @@ shown at the left side of the plot. When `value_sort_order` is `ASC` the x-axis values to the left. When `value_sort_order` is `DESC` the x-axis will be in descending order with larger values to the left. ### Note on `generic_entity_meta_properties` -All meta properties must be specified in the `generic_entity_meta_properties` field. Every meta property listed here must appear as a column header in the corresponding data file. It's highly recommend to add `NAME`, `DESCRIPTION` and an optional `URL` to get the best visualization on OncoPrint tab and Plots tab. +All meta properties must be specified in the `generic_entity_meta_properties` field. Every meta property listed here +must appear as a column header in the corresponding data file. It's highly recommend to add `NAME`, `DESCRIPTION` and an +optional `URL` to get the best visualization on OncoPrint tab and Plots tab. ### Note on `patient_level` -Generic Assay data will be considered `sample_level` data if the `patient_level` property is missing or set to `false`. In addition, the patient or sample identifiers need to be included in the [Clinical Data](#clinical-data) file. +Generic Assay data will be considered `sample_level` data if the `patient_level` property is missing or set to `false`. +In addition, the patient or sample identifiers need to be included in the [Clinical Data](#clinical-data) file. ### Note on `Generic Assay` genetic_alteration_type and datatype All generic assay data is registered to be of the type of `genetic_alteration_type` and data type can choose from `LIMIT-VALUE`, `CATEGORICAL` and `BINARY`. @@ -1676,3 +1685,68 @@ The study resource file should follow this format, it has two **required** colum RESOURCE_IDURL STUDY_SPONSORShttps://url-to-study-sponsors + +## Custom namespace columns + +### Adding annotation columns through namespaces +Custom columns can be added to the data files of mutations, structural variants and discrete copy number (long) data. +The columns can be imported through the namespace mechanism into a database table column called `ANNOTATION_JSON`. Any columns starting with a prefix specified in the `namespaces` field in the metafile will be imported into the database. Namespace columns should be formatted as the namespace and namespace attribute seperated with a period (e.g `ASCN.total_copy_number` where `ASCN` is the namespace and `total_copy_number` is the attribute). + +An example cBioPortal mutation data file with the following **additional** columns: +``` +ASCN.total_copy_number ASCN.clonal MUTATION.name MUTATION.type +``` +imported with the following `namespaces` field in the metafile: +``` +namespaces: ascn +``` +will import the `ASCN.total_copy_number` and `ASCN.clonal` column into the database. `MUTATION.name` and `MUTATION.type` +will be ignored because `mutation` is not specified in the `namespaces` field. + +## Representation of namespace columns by mutation API endpoints + +Columns added through namespaces will be returned by relevant mutation, discrete copy number and structural variant API +endpoints. Namespace data will be available in the `namespaceColumn` of respective JSON representations of mutation +records. The `namespaceColumns` property will be a JSON object where namespace data is keyed by name of the namespace in +lowercase. For instance, when namespace `ZYGOSITY` is defined in the meta file and the data file has column +`ZYGOSITY.status` with value `Homozygous` for a mutation row, the API will return the following JSON record for this +mutation (only relevant fields are shown): + +``` +{ + "namespaceColumns": { + "ZYGOSITY": { + "status": "Homozygous" + } + }, +} +``` + +Note: ASCN namespace data is not exported via the `namespaceColumns` field. + +## Representation of namespace columns in the cBioPortal frontend + +Namespace columns will be added as columns to mutation, structural variant and copy number alteration tables in Patient +View and Results View. The case of the namespace in the column header will be as specified in the mutations meta file +and the column name will be capitalized. + +For instance, this metafile entry: + +```shell +namespaces: Zygosity +``` + +and this column header: + +```shell +ZYGOSITY.status +``` + +will show in the mutation table with column name: + +```shell +Zygosity Status +``` + +Note: namespace columns are recognized by a case-insensitive match of the namespace reported in the meta file +and the first word in the column header.