Skip to content

Aurora Status Documentation

Anish Chakka edited this page Sep 10, 2020 · 18 revisions

1. Data

The data transferred can be found in pghbio under the folder:

/pghbio/aurora/data/aurora

Under this folder you can see the following sub-folders.

  1. DNA-Seq
  2. RNA-Seq
  3. Methylation
  4. HE_slides_101719
  5. Metadata
  6. XMLs

Softlinks of all the files were used to organize by patient. They can be found in:

/pghbio/aurora/data/ByPatient

1.1. Data Changes

Email from Anthony Bryan (09/09/20):

The relabel for AER5 NT1 has been completed at NCH and the updated XML has been uploaded. The new and old barcodes are provided below for your reference.

Old_Barcode New_Barcode Biospecimen Barcode Bottom
AUR-AER5-NT1-A-2-1-D-A745-37 AUR-AER5-TTP2-B-2-1-D-A745-37 175152452
AUR-AER5-NT1-A-2-1-R-A742-41 AUR-AER5-TTP2-B-2-1-R-A742-41 175152357
AUR-AER5-NT1-A-2-1-D-A738-43 AUR-AER5-TTP2-B-2-1-D-A738-43 175156004

2. DNA-Seq

2.1. Location

The DNA-Seq files can be found in:

achakka: DNA-Seq$ pwd
/pghbio/aurora/data/aurora/DNA-Seq
achakka: DNA-Seq$ tree -d -L 2
.
├── Metadata
├── WES
│   ├── Annotation
│   ├── BAMs
│   ├── clonality_cellularity
│   └── VCFs
└── WGS
    ├── Annotation
    ├── BAMs
    ├── CNVs
    └── VCFs

2.2. Files summary

The DNA-Seq data includes data for WES and WGS samples.

For WES, we have the following files:

  1. Fastq
  2. BAMs
  3. VCFs
  4. Annotation VCFs
  5. Cellularity and Clonality

The summary of files can be seen below

File_type No. of files
BAMs 198
BAIs 198
VCFs 193
Ann VCFs 200
Cellularity & Clonality 47

For WGS, we have the following files:

  1. Fastq
  2. BAMs
  3. VCFs
  4. Annotation VCFs
  5. CNVs
File_type No. of files
BAMs 193
BAIs 193
VCFs 193
Ann VCFs 193
CNV CSV 191
CNV modeled PNGs 191
CNV denoised PNGs 191

A table summarizing the number of files can be seen below:

2.3. VCF Annotation

WES and WGS VCF files were annotated using Vcf2maf and CRAVAT

2.3.1 Vcf2maf

WES and WGS files were filtered to keep only variants which were labeled as PASS. The filtered files were then annotated using Vcf2maf to results MAF files.

The files were combined into one single MAF file and shared.

2.3.2 CRAVAT

CRAVAT (https://run.opencravat.org/submit/nocache/index.html) takes input in the VCF format or it's own tab separated format. We used MAF files to obtain input format required for CRAVAT, as VCF files were creating issues.

The CRAVAT results can be found in the folder: WES_144samples_CRAVAT_input.maf/

The CRAVAT produced 4 TSV files:

  1. WES_144samples_CRAVAT_input.maf.gene.tsv
  2. WES_144samples_CRAVAT_input.maf.mapping.tsv
  3. WES_144samples_CRAVAT_input.maf.sample.tsv
  4. WES_144samples_CRAVAT_input.maf.variant.tsv

Filtered Variants and Depth values
The variant file (WES_144samples_CRAVAT_input.maf.variant.tsv) is huge (~7GB), so as requested by Dr. Adrian Lee, the variants that were not interesting i.e., do not result in amino acid change were filtered out. The annotations which were filtered out are: 2kb downstream 2kb upstream 3-prime utr 5-prime utr synonymous intron unknown blanks (any empty cells)

CRAVAT results did not have depth values in them. So, depth values obtained from MAF file were added this filtered variant list as well. It can be found in file: cravat_variants_filtered_with_depth.tsv. 7 new columns were added and their description can be found below:

  1. t_depth - tumor overall depth
  2. t_ref_count - tumor ref allele depth
  3. t_alt_count - tumor alt allele depth
  4. t_af - tumor allele frequency
  5. n_depth - normal overall depth
  6. n_ref_count - normal ref allele depth
  7. n_alt_count - normal alt allele depth

Notes:

  1. If a variant is present in more than one sample, the sample name column lists all the samples which have this variant. They are separated by ‘;’.
  2. So, to maintain the same format, each of the 7 depth values were separated by ‘;’ for multi sample variants.

For e.g., (selected only sample_name and depth columns) Variants present in only one sample and multiple samples with the depth values look like below:

base:samples	t_depth	t_ref_count	t_alt_count	t_af	n_depth	n_ref_count	n_alt_count
A738_H06	8	3	5	0.625	571	571	0
A738_H06;A739_C08	14;62	5;52	9;10	0.642857142857143;0.161290322580645	665;690	665;690	0;0

Significant Genes List
A list of 74 significant genes was provided. WES_144samples_CRAVAT_input.maf.variant.tsv was filtered to keep variants found only in these 74 genes. All variants were kept and no other filters were applied. Depth values were added to this file as well. It can be found in: cravat_variants_significant_genes_with_depth.tsv

2.4. Notes

  1. 7 WES samples were re-analyzed. We received 5 BAMs and 7 annotation VCFs.
  2. Clonality and cellularity information were provided with 47 normal patient id's.

2.5. Issues

  1. 11 patients seem to have tumor name mismatches in the VCF files. This is true for both WES and WGS samples.
  2. There are 144 mutect2 variant calling VCF files.
  3. 53 VCF files have a mismatch with tumor name and the VCF file.
  4. These 53 VCF files belong to 11 patients. The patients are as below: AUR-AE5H AUR-AER4 AUR-AER6 AUR-AF99 AUR-AFE4 AUR-AFE5 AUR-AFEA AUR-AFEC AUR-AFKB AUR-AFSO AUR-AFUL
  5. Ben acknowledged the issue and we are yet to receive the corrected VCF files.
  6. The CNV files provided includes only the ratios at gene level. We need segmentation files as well.

3. RNA-Seq

3.1. Location

The RNA-Seq files can be found in the folder:

achakka: RNA-Seq$ pwd
/pghbio/aurora/data/aurora/RNA-Seq
achakka: RNA-Seq$ tree -d -L 2
.
└── Aurora
    ├── BAMs
    ├── Fastq
    ├── Level4
    ├── MAFs
    ├── Quants
    └── VCFs

3.2. Files summary

The RNA-Seq data includes files for Fastq, BAMs, quants, VCFs and MAfs

A table summarizing the number of files can be seen below:

File_type No. of files
Fastq R1 253
Fastq R2 253
BAMs 253
BAIs 253
Quant Raw 253
Quant Gene level 253
Quant Gene normalized 253
Level4 (SABCS) 3
VCFs 252
MAFs 2

3.3. Notes

  1. Initially we received 237 fastqs, BAMs, BAIs and Quants (raw, gene and gene_normalized)
  2. Later 16 RNA-Seq samples were resequenced and their fastqs, BAMs, BAIs and Quants (raw, gene and gene_normalized) were added to the list, bringing the total to 253 samples.
  3. Few samples, which were completely absent before were part of the 16 re-sequenced samples. So, it's not strictly re-sequencing as we have few new samples.
  4. Level4 files, which include 125 samples from SABCS conference were provided.
  5. We later received VCFs as well, but only for 252 samples. Upon further inspection, we noticed on 251 aliquots overlap.
  6. MAF files were also given. The one big file (~65GB), has variants from all VCFs (assumed).
  7. A small MAF file which includes only variants from genes brca1, brca2, palb2, atm and chek2 was also provided. This is assumed to be a subset.

3.4. Issues

3.4.1. Re-sequencing issues

  1. It was assumed the re-sequenced samples were better than the previous ones. But that's not always the case.
  2. Among the 125 SABCS samples, sometimes older samples were picked instead of the new re-sequenced samples.

3.4.2. VCF issues

  1. We found 251 common aliquots (samples) between my sheet and their mapping file.
  2. AUR-AFE4-TTM2-A-1-1-R-A742-41 and AUR-AFE5-TTM10-A-1-1-R-A742-41 is missing in the VCF mapping file.
  3. AUR-AE5H-TTM1-A-1-1-R-A742-41 is missing in our list. I double checked the list of files originally provided and can find no record of this sample.
  4. Interestingly AUR-AE5H-TTM1 is absent in RNA-Seq, but present in WES/WGS and Methylation.

3.5. SABCS

For SABCS conference, the best 125 RNA-Seq aliqout samples were selected. The 16 re-seqeunced samples are not necessarily a part of these

4. Methylation

Four level of files were provided for Methylation data, and they were submitted as V1 and V2.

4.1. Location

The files are located in the following directories:

achakka: Methylation$ pwd
/pghbio/aurora/data/aurora/Methylation
achakka: Methylation$ tree -d -L 4
.
├── V1
│   └── VARI_MethylationEPIC
│       └── MethylationEPIC__Current
│           ├── MethylationEPIC_Level_1.Batch.Version
│           ├── MethylationEPIC_Level_2.Batch.Version
│           ├── MethylationEPIC_Level_3.Batch.Version
│           └── MethylationEPIC_Level_4.Version
└── V2
    └── VARI_MethylationEPIC
        ├── MethylationEPIC_Archives
        │   └── MethylationEPIC_Level_4.Version
        └── MethylationEPIC__Current
            ├── MethylationEPIC_Level_1.Batch.Version
            ├── MethylationEPIC_Level_2.Batch.Version
            ├── MethylationEPIC_Level_3.Batch.Version
            └── MethylationEPIC_Level_4.Version

4.2. Files summary

File_type No. of files
IDAT (level1) 396
TSV (level2) 198
TSV (level3) 198
TSV (level4) 1
TSV (level5) 1

4.3. Notes

  1. V1 files included all levels upto level 5
  2. V2 files, which were reanalyzed were also provided. But we have not been provided with level 5 file yet.
  3. The following email on Aug 21, 2020:
Adrian,
Can you review the file Toshi sent to see if it has the type of annotation we need to create the freeze set.
I've also moved the data to Box under Molecular Data/Freeze Set

Anish,
Please me a note of this file and add it to the documentation.

thanks
uMa

4.4. Issues

  1. The level 4 files provided only the probe id's for V2.
  2. We need to know if the annotations will be provided for them, or we need to annotated it by ourselves.

5. H&E Slides

5.1. Location

H&E slides have files in .svs format. The files can be found in the following location.

achakka: HE_slides_101719$ pwd
/pghbio/aurora/data/aurora/HE_slides_101719
achakka: HE_slides_101719$ tree -d
.
├── Metadata
├── nationwidechildrens.org_BRCA.tissue_images.Level_1.517.0.AUR
└── nationwidechildrens.org_BRCA.tissue_images.Level_1.620.0.AUR

5.2. Files

In total there are 228 H&E .svs images.

6. Metadata

Includes pipeline and metadata information.

6.1. Location

The metadata files can be found in:

achakka: Metadata$ pwd
/pghbio/aurora/data/aurora/Metadata
achakka: Metadata$ tree
.
├── DNA-Seq
│   ├── 180718_Aurora_NCH.xlsx
│   └── Aurora_DNAseq_methods.docx
├── Documentation
├── HandE_slides
│   ├── AUR\ DNA\ A78F\ VA.xlsx
│   ├── AUR\ DNA\ A78G\ IGM.xlsx
│   ├── AUR\ RNA\ A78E\ UNC.xlsx
│   └── slide_to_aur_Id_Link.xlsx
├── Meth
│   ├── Data.description_180921.pdf
│   └── Meth_sample.mapping.tsv
├── Misc
│   ├── Aurora_DATA_2018-09-27_1152.csv
│   ├── Aurora_Patientid_Clinical_Combined_wLabels_10_3_2018.xlsx
│   └── Aurora_specimen_ids.xlsx
└── RNA-Seq
    ├── b1.AURORA_pilot_RNAseq.metadata.final.tsv
    ├── b2_metadata.tsv
    ├── Nextflow
    │   ├── readme_email.txt
    │   └── RNA-Seq_Aurora.zip
    └── RNA.vcf.mapping.file.txt

7. cBio

On Hold

7.1. Issues

  1. We need discrete CN values. We have only been provided with log2 rations.
  2. MAF files are needed for variant calling. We have them in VCF format.

8. ESTIMATE

To calculate tumor purity, ESIMATE algorithm was run on gene expression data.

8.1. Input

The required input files for this is just gene expression data set, with HUGO symbols.

I used two datasets for calculating tumor purity, which gives us two results. The datasets are:

  1. SABCS 125 samples expression dataset provided in Box: AURORA_UQN_125_log2_w0_70p_batch_corrected_mdctr.st.txt
  2. Gene level count data from 253 samples available in PSC. /pghbio/aurora/data/aurora/RNA-Seq/Aurora/Quants/gene

8.1.1. SABCS

The file with log2 values provided in Box was directly used for analysis and the results were generated.

It wasn't clear if log2 values or unlog should be used. So, I unlogged the data as well and ran the algorithm. Both produced the same results.

library(estimate)

## SAPS samples
# log values
filterCommonGenes(input.f="../../Results/RNA-Seq/AURORA_UQN_125_log2_w0_70p_batch_corrected_mdctr.st.txt", 
                  output.f="../../Results/RNA-Seq/ESTIMATE/SAPS/genes_log.gct", 
                  id="GeneSymbol")
estimateScore("../../Results/RNA-Seq/ESTIMATE/SAPS/genes_log.gct", 
              "../../Results/RNA-Seq/ESTIMATE/SAPS/estimate_score_log.gct", 
              platform="illumina")

8.1.2. 253 samples

The gene level counts from 253 samples were used for this analysis. The first two rows includes the sample name and header. So, this had to be removed before proceeding with the analysis.

The raw gene level counts were normalized and converted to CPM values using edgeR package. This CPM matrix was provided as input for ESTIMATE algorithm.

library("edgeR")
library(estimate)

# mapping between aliquots and file names (gene-level quants)
map.file <- read.delim2("rnaseq_mapping.txt")
files <- list.files("../../Results/RNA-Seq/Quants/gene", pattern="*salmon_gene.txt", full.names=TRUE)

# list of gene level quant files
metadata <- as.data.frame(files)
metadata$filenames <- gsub("../../Results/RNA-Seq/Quants/gene/", "", files)

# merge mapping file with filenames
metadata <- merge.data.frame(metadata, map.file, by.x = "filenames", by.y = "Gene")

## get count data as dge object
counts = readDGE(files)$counts
colnames(counts) = metadata$BCR.aliquot.barcode

# normalize
d = DGEList(counts=counts)
# Estimate normalization factors using
d = calcNormFactors(d, method="TMM")

# simple design
d = estimateCommonDisp(d)
d = estimateTagwiseDisp(d)
n_cpms = cpm(d, normalized.lib.sizes = TRUE)

# add gene names as a column
gene_names_n <- row.names(n_cpms)
n_cpms <- cbind(gene_names_n, n_cpms)

# write the cpm output
write.table(n_cpms, file = "../../Results/RNA-Seq/Counts/Aurora_253_samples_gene_counts_cpm_TMMnormalized.txt",
            quote = FALSE, sep = "\t", row.names = FALSE)

## ESTIMATE tumor purity
filterCommonGenes(input.f="../../Results/RNA-Seq/Counts/Aurora_253_samples_gene_counts_cpm_TMMnormalized.txt", 
                  output.f="../../Results/RNA-Seq/ESTIMATE/253_samples/genes_ncpms.gct", 
                  id="GeneSymbol")
estimateScore("../../Results/RNA-Seq/ESTIMATE/253_samples/genes_ncpms.gct", 
              "../../Results/RNA-Seq/ESTIMATE/253_samples/estimate_score_ncpms.gct", 
              platform="illumina")

8.2. Results

The results of both datasets can be found in the excel sheet estimateresults.xlsx file on Box.

The stromal, immune and tumor scores of the SABCS (125) samples in the SABCS dataset and does not overlap with SABCS 125 samples in the all 253 samples dataset.

8.3. Notes

  1. Using log2 or unlog values does not seem to make a difference in the scores.
  2. The scores for 125 SABCS samples are different between the datasets.
  3. As a sanity check, mapping of file names with aliquot barcodes was rechecked based on the columns names provided by the SABCS dataset. The file mappings were right, and the difference in values is not because of incorrect file mappings.

9. Sample Mislabeling

Dr. Adrian Lee's email (Jun 23, 2020)

AFUG TTP1 and AER6 TTM4

Raw and processed sequencing files (e.g. the whole sample folder from that patient) will be archived at PSC and made not available.

For processed data at Box:

  1. Susana to remove samples from RNAseq count matrix
  2. Toshi to remove samples from Level 4 and 5 data frames
  3. Ben we will delete the clonality and CNV files for these cases as these remain as individual files.

AER5 NT1 and AER5 TTM2 (TTP3 as well?)

One suggestion is to create dummy names e.g AER5 TTPX (AER5 NT1) and AER5 NTX (AER5 TTM1) and then virtually link files to these dummy names. [Note AER5 NT1 is actually a primary taken at autopsy].

Alternative is to rename at the source e.g. bioXML, REDCAP and all files downstream. We will discuss what this would look like and report back next Tues.

AEF9 TTM1, 2, and 4

Needs some more discussion

Dr. Adrian Lee's email (July 2, 2020)

Dear All,
A summary of what I heard on Tues and how we intend to implement this (perhaps with final approval after next Tues meeting).
AFUG TTP1 and AER6 TTM4 – will proceed with archiving of raw files at PSC out of view, and removal of downstream analysis files.

AER5 NT1 and AER5 TTM2

AER5 TTM2 - as it was harvested as a tumor goal is to leave it as TTM2, but clearly has low purity – will leave labeling as is and note that it was used as a normal for DNA analysis.
AER5 NT1 – clearly has primary breast tumor in it. Will rename to either AER5 NT1_TTPX or AER5 TTPX. I don’t think it really matters how this is handled, perhaps former is clearer (AER5 NT1_TTPX). BTW not sure if the X could throw a spanner in any analysis/file searching etc and whether it should just be the next number e.g. TTP4.

Given that we will rename AER5 NT1 some comments and questions for each site: REDCAP – Katie you will add notes on the switch in Redcap. Are there any data fields in this sample that need changing. As we use this to harvest clinical data and link to XML and downstream data files we need to know what you think will change in Redcap, if anything, and if it changes a pull of the new data will be important. XML - We think it best if NCH alters the XML for this sample and then send us the new file – if we alter the XML it will be conflicting with the one at NCH.

Once we have the edited XML we will use this to link to the clinical data and all downstream files. We can then do a ‘find and replace’ on file names.
I enclosed the slides we showed on Tues on how this will happen on the backend.

All the best
Adrian

tags: Aurora