GitHub - melissacline/CAGI5-BRCA-Assessment

CAGI5-BRCA-Assessment

The assessment can be repeated with the following steps:

Download the data files from CAGI. Make the src and data subdirectories. Create the ../data/actual and ../data/prediction subdirectories. Copy the ENIGMA file CAGI_megamultifac_2018-06-01.txt into actual. Copy the prediction files into ../data/prediction.
Use these commands to build the file ../data/actual/variant.summary.txt. Import into Google sheets. Label the domains by hand. Build a pivot table to serve as Table 1 from the paper.

cut -f1,3,4 ../data/predictions.merged.txt |tail -n +2 |awk '{ntokens = split($1,tokens,  ":c."); split(tokens[2], subtokens, "[ACGT_]"); print tokens[1] "\t" subtokens[1] "\t" $2 "\t" $3}' |sort -n -k2  |grep BRCA1  > ../data/actual/variant.summary.txt
cut -f1,3,4 ../data/predictions.merged.txt |tail -n +2 |awk '{ntokens = split($1,tokens,  ":c."); split(tokens[2], subtokens, "[ACGT_]"); print tokens[1] "\t" subtokens[1] "\t" $2 "\t" $3}' |sort -n -k2  |grep BRCA2  >> ../data/actual/variant.summary.txt

Create the subdirectory ../data/vep_raw. Create the file vep_input.txt, which contains the cDNA HGVS representation of each variant. Use the canonical RefSeq transcripts NM_007294.3 for BRCA1 and NM_000059.3 for BRCA2. The nucleotide HGVS suffix is in the ENIGMA data. Run VEP (online) with vep_raw/vep_input.txt, generating vep_raw/vep_output.txt. Preprocess as shown, to add VEP as a new "predictor"

vep_preprocess.bash

Merge the actual clinical signficance and predictions

mergeData.py > ../data/predictions.merged.txt

Relabel the header rows and method descriptions by hand. Sort the rows to put the methods in alphabetical order.

Creates ../data/predictions.merged.labeled.sorted.txt

Import this file into a spreadsheet. Creates Supplemental Table 1.

5b. This step is presented for the reader convenience. The authors of LEAP included data that indicate which features were most important in the prediction of each variant, for LEAP 1 and LEAP 2. The following command line generates a sorted list of the features that were most important for LEAP 1 for variants classified as benign or likely benign:

tail -n +2 predictions.merged.labeled.sorted.txt \
  | awk -F'\t' '{ if ($3 < 3) { print $28 }}' |awk -F'=' '{ print $2}' \
  | awk -F';' '{ for (ii = 1; ii <= NF; ii++) { print $ii}}' \
  | sed 's/^ [0-9].//' |sort |uniq -c |sort -n -r |more

To see the equivalent results for variants classified as pathogenic or likely pathogenic, in the first awk command, replace if ($3 < 3) with if ($3 > 3)

To see the equivalent results for LEAP 2, in the first awk command, print $31 rather than $28.

Generate Supplemental Figures 1 and 2, Figure 3, and the heatmap of Figure 1 as follows:

Rscript boxplots.heatmap.all.predictions.R

To finish Figure 1, the legend is added by hand in Google Slides.

Compute the assessment statistics, with R

Rscript computeAssessmentStats.R

Creates assessment.stats.txt

Sort the lines in assessment.stats.txt and clean up the labels by hand to create assessment.stats.sorted.txt. Import this file into an EXCEL spreadsheet to create Supplemental Table S2.
Generate Figure 3, Supplemental Figues 1 and 2, and the dendrogram of Figure 1 as follows:

Rscript boxplots.heatmap.all.predictions.R

Generate Figure 2 as follows:

Rscript summary.stat.barcharts.R

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAGI5-BRCA-Assessment

About

Releases

Packages

Languages

melissacline/CAGI5-BRCA-Assessment

Folders and files

Latest commit

History

Repository files navigation

CAGI5-BRCA-Assessment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages