Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update FAQ for Upcoming GDC Data Release #10894

Merged
merged 16 commits into from
Aug 1, 2024
33 changes: 28 additions & 5 deletions docs/user-guide/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@
* [The data today is different than the last time i looked. What happened?](/user-guide/faq.md#the-data-today-is-different-than-the-last-time-i-looked-what-happened)
* [How do I access data from AACR Project GENIE?](/user-guide/faq.md#how-do-i-access-data-from-aacr-project-genie)
* [TCGA](/user-guide/faq.md#tcga)
* [How does TCGA data in cBioPortal compare to TCGA data in Genome Data Commons?](/user-guide/faq.md#how-does-tcga-data-in-cbioportal-compare-to-tcga-data-in-genome-data-commons)
* [How do the TCGA studies sourced from Genomic Data Commons (GDC) compare to the other TCGA datasets? Which one should I use?](#how-do-the-tcga-studies-sourced-from-genomic-data-commons-gdc-compare-to-the-other-tcga-datasets-which-one-should-i-use)
* [How is mutation data loaded for legacy TCGA studies?](#how-is-mutation-data-loaded-for-legacy-tcga-studies)
* [What happened to TCGA Provisional datasets?](/user-guide/faq.md#what-happened-to-tcga-provisional-datasets)
jamesqo marked this conversation as resolved.
Show resolved Hide resolved
* [What are TCGA Firehose Legacy datasets and how do they compare to the publication-associated datasets and the PanCancer Atlas datasets?](/user-guide/faq.md#what-are-tcga-firehose-legacy-datasets-and-how-do-they-compare-to-the-publication-associated-datasets-and-the-pancancer-atlas-datasets)
* [Where do the thresholded copy number call in TCGA Firehose Legacy data come from?](/user-guide/faq.md#where-do-the-thresholded-copy-number-call-in-tcga-firehose-legacy-data-come-from)
Expand All @@ -50,6 +51,8 @@
* [What is the difference between a “splice site” mutation and a “splice region” mutation?](/user-guide/faq.md#what-is-the-difference-between-a-splice-site-mutation-and-a-splice-region-mutation)
* [What do “Amplification”, “Gain”, “Deep Deletion”, “Shallow Deletion” and "-2", "-1", "0", "1", and "2" mean in the copy-number data?](/user-guide/faq.md#what-do-amplification-gain-deep-deletion-shallow-deletion-and--2--1-0-1-and-2-mean-in-the-copy-number-data)
* [What is GISTIC? What is RAE?](/user-guide/faq.md#what-is-gistic-what-is-rae)
* [What is ASCAT and how is it used in cBioPortal?](#what-is-ascat-and-how-is-it-used-in-cbioportal)
jamesqo marked this conversation as resolved.
Show resolved Hide resolved
* [How is ASCAT copy number data converted to GISTIC?](#how-is-ascat-copy-number-data-converted-to-gistic)
jamesqo marked this conversation as resolved.
Show resolved Hide resolved
* [RNA](/user-guide/faq.md#rna)
* [Does the portal store raw or probe-level data?](/user-guide/faq.md#does-the-portal-store-raw-or-probe-level-data)
* [What are mRNA and microRNA Z-Scores?](/user-guide/faq.md#what-are-mrna-and-microrna-z-scores)
Expand Down Expand Up @@ -126,6 +129,8 @@ You can bookmark your query results and share the URL with collaborators. We sto
The cBioPortal is an exploratory analysis tool for exploring large-scale cancer genomic data sets that hosts data from large consortium efforts, like [TCGA](https://cancergenome.nih.gov/) and [TARGET](https://ocg.cancer.gov/programs/target), as well as publications from individual labs. You can quickly view genomic alterations across a set of patients, across a set of cancer types, perform survival analysis and perform group comparisons. If you want to explore specific genes or a pathway of interest in one or more cancer types, the cBioPortal is probably where you want to start.

By contrast, the [Genomic Data Commons (GDC)](https://gdc.cancer.gov/) aims to be the definitive place for full-download and access to all data generated by TCGA and TARGET. If you want to download raw mRNA expression files or full segmented copy number files, the GDC is probably where you want to start.

As of August 2024, the public cBioPortal contains datasets sourced from the GDC through [ISB-CGC BigQuery](https://portal.isb-cgc.org). Currently TCGA and CPTAC are supported, with more programs coming in the future. For an explanation of how these studies differ from their non-GDC counterparts, [see below](#how-do-the-tcga-studies-sourced-from-genomic-data-commons-gdc-compare-to-the-other-tcga-datasets-which-one-should-i-use).
jamesqo marked this conversation as resolved.
Show resolved Hide resolved
#### Does the cBioPortal provide a Web Service API? R interface? MATLAB interface?
Yes, the cBioPortal provides a [Swagger API](https://www.cbioportal.org/api/swagger-ui.html), and [R/MATLAB interfaces](/web-API-and-Clients.md#r-client).
#### Can I use cBioPortal with my own data?
Expand Down Expand Up @@ -169,7 +174,7 @@ Check out the [Data Sets Page](https://www.cbioportal.org/datasets) for the comp
#### Which resources are integrated for variant annotation?
cBioPortal supports the annotation of variants from several different databases. These databases provide information about the recurrence of, or prior knowledge about, specific amino acid changes. For each variant, the number of occurrences of mutations at the same amino acid position present in the COSMIC database are reported. Furthermore, variants are annotated as “hotspots” if the amino acid positions were found to be recurrent linear hotspots, as defined by the Cancer Hotspots method ([cancerhotspots.org](https://www.cancerhotspots.org/)), or three-dimensional hotspots, as defined by 3D Hotspots ([3dhotspots.org](https://www.3dhotspots.org/)). Prior knowledge about variants, including clinical actionability information, is provided from three different sources: OncoKB ([www.oncokb.org](https://www.oncokb.org/)), CIViC ([civicdb.org](https://civicdb.org/)), as well as My Cancer Genome ([mycancergenome.org](https://www.mycancergenome.org/)). For OncoKB, exact levels of clinical actionability are displayed in cBioPortal, as defined by [the OncoKB paper](https://ascopubs.org/doi/full/10.1200/PO.17.00011).
#### What version of the human reference genome is being used in cBioPortal?
The [public cBioPortal](https://www.cbioportal.org) is currently using hg19/GRCh37.
The [public cBioPortal](https://www.cbioportal.org) largely uses hg19/GRCh37. However, there are plans to incorporate more hg38/GRCh38 studies in the future; for example, the newly incorporated NCI-CRDC studies use GRCh38.
jamesqo marked this conversation as resolved.
Show resolved Hide resolved
#### How does cBioPortal handle duplicate samples or sample IDs across different studies?
The cBioPortal generally assumes that samples or patients that have the same ID are actually the same. This is important for cross-cancer queries, where each sample should only be counted once. If a sample is part of multiple cancer cohorts, its alterations are only counted once in the Mutations tab (it will be listed multiple times in the table, but is only counted once in the lollipop plot). However, other tabs (including OncoPrint and Cancer Types Summary) will count the sample twice - for this reason, we advise against querying multiple studies that contain the same samples (e.g., TCGA PanCancer Atlas and TCGA Firehose Legacy).
#### Are there any normal tissue samples available through cBioPortal?
Expand All @@ -186,14 +191,18 @@ If you need to reference an old version of a dataset, you can find previous vers
Data from AACR Project GENIE are provided in a [dedicated instance of cBioPortal](https://www.cbioportal.org/genie/). You can also download GENIE data from the [Synapse Platform](https://synapse.org/genie). Note that you will need to register before accessing the data. Additional information about AACR Project GENIE can be found on the [AACR website](https://www.aacr.org/Research/Research/Pages/aacr-project-genie.aspx).

### TCGA
#### How does TCGA data in cBioPortal compare to TCGA data in Genome Data Commons?
We do not currently load the mutation data from the GDC. Instead, we have the original mutation data generated by the individual TCGA sequencing centers. The source of the data is the Broad Firehose (or the publication pages for data that matches a specific manuscript). These data are usually a combination of two mutation callers, but they differ by center (typically a variant caller like MuTect plus an indel caller), and sequencing centers have modified their mutation calling pipelines over time.
#### How do the TCGA studies sourced from Genomic Data Commons (GDC) compare to the other TCGA datasets? Which one should I use?
jamesqo marked this conversation as resolved.
Show resolved Hide resolved
The TCGA-GDC studies mirror the [Cancer Gateway in the Cloud (ISB-CGC)](https://portal.isb-cgc.org) that is hosted on Google BigQuery, which in turn mirrors GDC. They are updated frequently as new GDC releases come out, as opposed to the older TCGA datasets which are largely frozen. They also use a newer version of the human reference genome, GRCh38 instead of GRCh37.

Because these datasets are intended to be a pure reflection of what is available inside ISB-CGC, there may be gaps in data availability between the two. For example, nearly all samples have been profiled for mutations inside the legacy studies, but TCGA-GDC is still missing a significant fraction of mutation data.
#### How is mutation data loaded for legacy TCGA studies?
For non-GDC TCGA studies, we have the original mutation data generated by the individual TCGA sequencing centers. The source of the data is the Broad Firehose (or the publication pages for data that matches a specific manuscript). These data are usually a combination of two mutation callers, but they differ by center (typically a variant caller like MuTect plus an indel caller), and sequencing centers have modified their mutation calling pipelines over time.
jamesqo marked this conversation as resolved.
Show resolved Hide resolved
jamesqo marked this conversation as resolved.
Show resolved Hide resolved
#### What happened to TCGA Provisional datasets?
We renamed TCGA Provisional datasets to TCGA Firehose Legacy to better reflect that this data comes from a legacy processing pipeline. The exact same data is now available in TCGA Firehose Legacy studies.
#### What are TCGA Firehose Legacy datasets and how do they compare to the publication-associated datasets and the PanCancer Atlas datasets?
The Firehose Legacy dataset (formerly Provisional datasets) for each TCGA cancer type contains all data available from the Broad Firehose. The publication datasets reflect the data that were used for each of the publications. The samples in a published dataset are usually a subset of the firehose legacy dataset, since manuscripts were often written before TCGA completed their goal of sequencing 500 tumors.

There can be differences between firehose legacy and published data. For example, the mutation data in the publication usually underwent more QC, and false positives might have been removed or, in rare cases, false negatives added. RNA-Seq and copy-number values may also differ slightly, as different versions of analysis pipelines could have been used. Additionally, due to additional curation during the publication process, the clinical data for the publication may be of higher quality or may contain a few more data elements, sometimes derived from the genomic data (e.g., genomic subtypes).
There can be differences between Firehose Legacy and published data. For example, the mutation data in the publication usually underwent more QC, and false positives might have been removed or, in rare cases, false negatives added. RNA-Seq and copy-number values may also differ slightly, as different versions of analysis pipelines could have been used. Additionally, due to additional curation during the publication process, the clinical data for the publication may be of higher quality or may contain a few more data elements, sometimes derived from the genomic data (e.g., genomic subtypes).

The TCGA PanCancer Atlas datasets derive from an effort to unify TCGA data across all tumor types. Publications resulting from this effort can be found at the [TCGA PanCancer Atlas site](https://www.cell.com/pb-assets/consortium/pancanceratlas/pancani3/index.html). In the cBioPortal, data from the PanCancer Atlas is divided by tumor type, but these studies have uniform clinical elements, consistent processing and normalization of mutations, copy number, mRNA data and are ideally processed for comparative analyses.
#### Where do the thresholded copy number call in TCGA Firehose Legacy data come from?
Expand Down Expand Up @@ -229,6 +238,20 @@ Copy number data sets within the portal are often generated by the [GISTIC](http
For TCGA studies, the table in allthresholded.bygenes.txt (which is the part of the GISTIC output that is used to determine the copy-number status of each gene in each sample in cBioPortal) is obtained by applying both low- and high-level thresholds to to the gene copy levels of all the samples. The entries with value +/- 2 exceed the high-level thresholds for amplifications/deep deletions, and those with +/- 1 exceed the low-level thresholds but not the high-level thresholds. The low-level thresholds are just the 'ampthresh' and 'delthresh' noise threshold input values to GISTIC (typically 0.1 or 0.3) and are the same for every thresholds.

By contrast, the high-level thresholds are calculated on a sample-by-sample basis and are based on the maximum (or minimum) median arm-level amplification (or deletion) copy number found in the sample. The idea, for deletions anyway, is that this level is a good approximation for hemizygous losses given the purity and ploidy of the sample. The actual cutoffs used for each sample can be found in a table in the output file sample_cutoffs.txt. All GISTIC output files for TCGA are available at: gdac.broadinstitute.org.
#### What is ASCAT and how is it used in cBioPortal?
[ASCAT (Allele-Specific Copy number Analysis of Tumors)](https://www.pnas.org/doi/full/10.1073/pnas.1009843107) is a tool/algorithm designed to analyze allele-specific copy number variations (CNVs) in tumor DNA. Copy number data from the GDC analysis pipelines is provided in ASCAT format; more detail is available on the [GDC website](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/#ascat-pipelines). ASCAT data is not supported directly by cBioPortal, instead it is first converted to GISTIC format as described below.
#### How is ASCAT copy number data converted to GISTIC?
The following conversion threhsolds are applied to the total copy number (TCN) from ASCAT:

| ASCAT Value | GISTIC Value | Meaning |
|---|---|---|
| TCN = 0 | -2 | Deep loss |
| TCN = 1 | -1 | Single-copy loss |
| TCN = 2 | 0 | Diploid |
| 2 < TCN < 6 | 1 | Low-level gain |
| 6 ≤ TCN | 2 | Amplification |

The final conversion threshold (6 ≤ TCN) is somewhat flexible and can vary between different studies depending on the data used. We chose 6 and applied it to all GDC studies after seeing that it resulted in the most consistency between TCGA-GDC and PanCancer Atlas.

### RNA
#### Does the portal store raw or probe-level data?
Expand Down
Loading