Questions for homework 3 #21

k8hertweck · 2019-10-22T22:29:07Z

Include questions about homework 3 here! @gavinha and I will be happy to help answer them.

LABrumage · 2019-10-23T18:26:57Z

Is the code for 1a and 1b supposed to go all in the same code block below 1b? Also, for 1a printing out the first five unique TCGA IDs, is that supposed to be drawn from sorting the overlapping segments? Thank you!

k8hertweck · 2019-10-23T19:20:38Z

@LABrumage Sorry about leaving out a code chunk! Please add one for 1a.

@gavinha, can you speak to this question:

for 1a printing out the first five unique TCGA IDs, is that supposed to be drawn from sorting the overlapping segments?

rcsegura · 2019-10-24T01:45:25Z

We've (@LABrumage + @sbest0128) been trying to get the Segment_Mean for the copy number segments in problem 1b, but we are stuck. We have the overlaps, but we can't filter out which means are associated with the overlap values - is there anything that we should try doing?

gavinha · 2019-10-24T03:25:19Z

We've (@LABrumage + @sbest0128) been trying to get the Segment_Mean for the copy number segments in problem 1b, but we are stuck. We have the overlaps, but we can't filter out which means are associated with the overlap values - is there anything that we should try doing?

Hi @rcsegura

Please refer to the R Markdown file from the lecture:

tfcb_2019/lectures/lecture08/Lecture8_GenomicData.Rmd

Lines 165 to 166 in b4828fb

    
           segs.tile2.means <- segs.gr[ind.segs.overlap.tile2]$Segment_Mean 
        
           mean(segs.tile2.means)

Let me know if you still have any questions.

Best,
Gavin

gavinha · 2019-10-24T03:27:07Z

@LABrumage Sorry about leaving out a code chunk! Please add one for 1a.

@gavinha, can you speak to this question:

for 1a printing out the first five unique TCGA IDs, is that supposed to be drawn from sorting the overlapping segments?

Yes, after finding the overlapping ranges for segs.gr, use the unique function, followed by the sort function. Then, print out the first 5 rows.

k8hertweck · 2019-10-24T04:10:35Z

I received the following question via email:

For question 1a, it is asking for overlaps with the regions chr8:128,746,347-128,755,810. Does this mean that for chr8:128, we are only looking for overlap at that specific base pair?

In other words, would that mean that the start and stop ranges would both be 128?

The commas are not separating numbers, but instead represent numbers higher than 128 million

k8hertweck · 2019-10-24T23:58:56Z

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

LABrumage · 2019-10-25T00:22:57Z

@gavinha is this on the right track for 1b (where object overlap1b is the GRanges object containing the overlaps I found)?
ind.segs.overlap <-subjectHits(overlap1b)
segs.gr[ind.segs.overlap]
segs.means <- segs.gr[ind.segs.overlap]$Segment_Mean
mean(segs.means)
I got 0.1747237 as the mean of the segment means

gavinha · 2019-10-25T00:23:36Z

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here:

tfcb_2019/lectures/lecture08/Lecture8_GenomicData.Rmd

Line 164 in b4828fb

segs.gr[ind.segs.overlap.tile2]

Let me know if you have any further questions.

gavinha · 2019-10-25T00:27:30Z

@gavinha is this on the right track for 1b (where object overlap1b is the GRanges object containing the overlaps I found)?
ind.segs.overlap <-subjectHits(overlap1b)
segs.gr[ind.segs.overlap]
segs.means <- segs.gr[ind.segs.overlap]$Segment_Mean
mean(segs.means)
I got 0.1747237 as the mean of the segment means

Those specific lines of code look correct. Ifoverlap1b is the output of the findOverlaps function, then it is a Hits object (not GRanges) with the indices of overlap in your query and subject.

Also, make sure you are using the correct function to assign ind.segs.overlap; which should you use subjectHits or queryHits? Make sure you use the one that matches the segs.gr when you called findOverlaps

gavinha · 2019-10-25T01:03:01Z

Hi class,

In Problem 4d, if you used data.table as described in:

tfcb_2019/lectures/lecture08/Lecture8_GenomicData.Rmd

Lines 244 to 246 in b4828fb

    
           genoData <- data.table(do.call(cbind, geno(vcf))) 
        
           colnames(genoData) <- rownames(geno(header(vcf))) 
        
           genoData

you may notice that in the R Markdown itself shows something like this:


GT<list> | DP<list> | GQ<list> | ADALL<list> | AD<list> | IGT<list> | IPS<list> | PS<list>
<chr [1]>    <int [1]>    <int [1]>    <int [2]>    <int [2]>    <chr [1]>    <chr [1]>    <chr [1]>
<chr [1]>    <int [1]>    <int [1]>    <int [2]>    <int [2]>    <chr [1]>    <chr [1]>    <chr [1]>
<chr [1]>    <int [1]>    <int [1]>    <int [2]>    <int [2]>    <chr [1]>    <chr [1]>    <chr [1]

Don't worry! Once you Knit, the output will be fine.

Hope this helps.
Gavin

k8hertweck · 2019-10-25T23:40:01Z

I've talked to a few folks who are struggling to conceptualize homework 3. If you haven't worked with genomic data and/or haven't spent a lot of time working in R, it's possible this homework will be particularly challenging. Here are a few things that may help to keep in mind:

The homework questions rely heavily on the code provided in the lecture 8 notebooks. Read the homework questions carefully, noting when there are references to specific parts of the notebooks.
Work through the code in the lecture 8 notebooks line-by-line (not just in chunks), and think about what is happening each step of the way. Make extra code comments for yourself so you understand what each line is doing.
Look up the help documentation for functions that appear to be doing "heavy lifting" in manipulating the genomic data.
Inspect the output objects for each cell, including the type of object and what information is contained within them. This will help you troubleshoot.
If any of the terms used by the help documentation or lecture 8 notebooks are unclear, reference the lecture 7 materials to help you understand the biology behind the code.

A few other more specific answers to questions I've received about genomic data/analyses:

Where did these data come from? The human genome (hg19, loaded as a part of a Bioconductor package) is used to analyze the BRCA.genome_wide_snp_6_broad_Level_3_scna.seg dataset. This is a segment data format file, which you can read about in Lecture8_GenomicData.Rmd, as well as lecture 7 (starting at line 39). For more information about the Segment_Mean, I recommend this article.
What is tiling, and why does it matter? Tiling is a method of splitting the genome into chunks so that portions of chromosomes can be assessed and summarized independently. It's a way to assess whether variation is consistent across the genome, or differs among parts.

gavinha · 2019-10-26T00:35:08Z

Thanks @k8hertweck
If you also want more information about the Copy Number Segmentation file, see the documentation here:
https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/

zyaffe · 2019-10-26T22:27:36Z

What VCF file are we supposed to load for problem #4? 'GIAB_highconf_v.3.3.2.vcf.gz' from lecture 7?

k8hertweck · 2019-10-26T22:39:50Z

What VCF file are we supposed to load for problem #4? 'GIAB_highconf_v.3.3.2.vcf.gz' from lecture 7?

Yes, that's the correct file, @zyaffe

LABrumage · 2019-10-27T19:18:41Z

For Problem 4d where we have to combine the genotype info into a table, all of the necessary columns appear to be present but the entries in the table are <chr [1]>, <int[1]>, and so forth. Is this what the entries are supposed to read as? I used the vcf object where I loaded info for chromosome 8 (defined back in 4a).
Here's my code for 4d:
genoData <- data.table(do.call(cbind, geno(vcf)))
colnames(genoData) <- rownames(geno(header(vcf)))
genoData

alexgal8 · 2019-10-28T00:36:56Z

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here:

tfcb_2019/lectures/lecture08/Lecture8_GenomicData.Rmd

Line 164 in b4828fb

segs.gr[ind.segs.overlap.tile2]

Let me know if you have any further questions.

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:
GRanges object with 0 ranges and 3 metadata columns:
seqnames ranges strand | Sample Num_Probes Segment_Mean
|

seqinfo: 24 sequences from hg19 genome

Do I need to type something in addition to segs.gr[ind.segs.overlap.tile2]? When I try to change the end to a different tile, it won't let me, the only tile that works is 2. Also, how do I identify the TCGA id? Is it labeled?

Amandakr713 · 2019-10-28T01:47:08Z

For 2d, I found two windows that have the highest count of deletion segments. Should we be looking for only one window as the answer?

gavinha · 2019-10-28T04:07:02Z

For Problem 4d where we have to combine the genotype info into a table, all of the necessary columns appear to be present but the entries in the table are <chr [1]>, <int[1]>, and so forth. Is this what the entries are supposed to read as? I used the vcf object where I loaded info for chromosome 8 (defined back in 4a).
Here's my code for 4d:
genoData <- data.table(do.call(cbind, geno(vcf)))
colnames(genoData) <- rownames(geno(header(vcf)))
genoData

Hi @LABrumage

Please refer to my message 3 days ago about question 4d.

gavinha · 2019-10-28T04:19:51Z

For 2d, I found two windows that have the highest count of deletion segments. Should we be looking for only one window as the answer?

Hi @Amandakr713

You can return both or just one of them.

gavinha · 2019-10-28T04:27:45Z

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here:

tfcb_2019/lectures/lecture08/Lecture8_GenomicData.Rmd

Line 164 in b4828fb

segs.gr[ind.segs.overlap.tile2]

Let me know if you have any further questions.

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:

GRanges object with 0 ranges and 3 metadata columns:
seqnames ranges strand | Sample Num_Probes Segment_Mean
|
seqinfo: 24 sequences from hg19 genome

Do I need to type something in addition to segs.gr[ind.segs.overlap.tile2]? When I try to change the end to a different tile, it won't let me, the only tile that works is 2. Also, how do I identify the TCGA id? Is it labeled?

Hi @alexgal8

The variable that you use to index segs.gr (i.e. what you put between the square brackets). The indices are supposed to refer to the rows in segs.gr that overlap chr8:128746347-128755810. Please refer to the lecture notes on how to find overlaps.

tfcb_2019/lectures/lecture08/Lecture8_GenomicData.Rmd

Lines 142 to 150 in fce205c

    
           ## 3.2 Finding overlapping ranges 
        
           One of the most useful features of `GenomicRanges` is to simply identify the ranges that overlap between two `GRanges` objects. The `findOverlaps` function is a basic method in the `GRanges` class for finding the indices of the elements that overlap between two `GRanges`. The argmuents `query` for your main `tiles.subset` and `subject` for the `segs.gr`. The `type` argument describes the type of overlap, such as `any`, `within`, `start`, `end`, `equal`, and there are additional arguments for criteria for overlap such as `minoverlap` size. 
        
           In this example, we will find which copy number alteration segments from `segs.gr` overlap in *any* way with our ranges in `tiles.subset` (`17:35000000-37000000`). For the criteria of any overlap, we set `type = "any"`. 
        
           ```{r} 
        
           hits1 <- findOverlaps(query = tiles.subset, subject = segs.gr, type = "any") 
        
           class(hits1) 
        
           hits1 
        
           ```

Amandakr713 · 2019-10-28T04:29:31Z

In 4e, where do the DNAString and DNAStringSet objects exist?

gavinha · 2019-10-28T04:34:39Z

In 4e, where do the DNAString and DNAStringSet objects exist?

Hi @Amandakr713

In the lecture notes:

tfcb_2019/lectures/lecture08/Lecture8_GenomicData.Rmd

Lines 220 to 223 in fce205c

    
           The `rowRanges` function will return a `GRanges` object containing the coordinates, REF/ALT bases, quality, and filtering status of the variants. 
        
           ```{r} 
        
           rowRanges(vcf) 
        
           ```

Find the column referring to the reference base REF and alternate base ALT in this object. To access the columns, use rowRanges(vcf)$REF.

alexgal8 · 2019-10-28T05:09:12Z

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here:

tfcb_2019/lectures/lecture08/Lecture8_GenomicData.Rmd

Line 164 in b4828fb

segs.gr[ind.segs.overlap.tile2]

Let me know if you have any further questions.

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:

GRanges object with 0 ranges and 3 metadata columns:
seqnames ranges strand | Sample Num_Probes Segment_Mean
|
seqinfo: 24 sequences from hg19 genome
Do I need to type something in addition to segs.gr[ind.segs.overlap.tile2]? When I try to change the end to a different tile, it won't let me, the only tile that works is 2. Also, how do I identify the TCGA id? Is it labeled?

Hi @alexgal8

The variable that you use to index segs.gr (i.e. what you put between the square brackets). The indices are supposed to refer to the rows in segs.gr that overlap chr8:128746347-128755810. Please refer to the lecture notes on how to find overlaps.

tfcb_2019/lectures/lecture08/Lecture8_GenomicData.Rmd

Lines 142 to 150 in fce205c

## 3.2 Finding overlapping ranges

One of the most useful features of `GenomicRanges` is to simply identify the ranges that overlap between two `GRanges` objects. The `findOverlaps` function is a basic method in the `GRanges` class for finding the indices of the elements that overlap between two `GRanges`. The argmuents `query` for your main `tiles.subset` and `subject` for the `segs.gr`. The `type` argument describes the type of overlap, such as `any`, `within`, `start`, `end`, `equal`, and there are additional arguments for criteria for overlap such as `minoverlap` size.

In this example, we will find which copy number alteration segments from `segs.gr` overlap in *any* way with our ranges in `tiles.subset` (`17:35000000-37000000`). For the criteria of any overlap, we set `type = "any"`.

```{r}

hits1 <- findOverlaps(query = tiles.subset, subject = segs.gr, type = "any")

class(hits1)

hits1

```

Yes, I've got the overlaps but now I'm trying to query the correct rows from the original 'segs.gr'.

Amandakr713 · 2019-10-28T05:26:40Z

Thank you @gavinha !
For 4f, some of the REF and ALT columns contain multiple bases (ex: GCA). Did I execute the code incorrectly? I thought we should only expect single bases in these columns.

gavinha · 2019-10-28T06:13:56Z

Thank you @gavinha !
For 4f, some of the REF and ALT columns contain multiple bases (ex: GCA). Did I execute the code incorrectly? I thought we should only expect single bases in these columns.

Hi @Amandakr713
Some of the alleles do contain multiple bases representing short, multi-base INDELs.

Thank you for bringing up this concern. I just realized that this question is an older version that is much more challenging. I had come up with a slightly different version but did not manage to upload the changes. I will talk to Kate about this but likely, all of 4f will be a question only for bonus marks.

For your reference, the more straight-forward version of the question is simply a change in the region of interest to chr8:129127000-129128000. This region only contains 5 SNPs and the alleles are single bases. Manually writing out the answer would be much easier, as you can tell.

Reporting the answer with these 5 SNPs in chr8:129127000-129128000 will also get full marks.

I not sure if other students actually come to GitHub to read these comments so Kate will send an announcement about this.

Thanks for bringing this to my attention and sorry for the inconvenience.

alexgal8 · 2019-10-28T06:24:55Z

@gavinha In regard to problem 2 and finding copy number alterations, can you direct me to the part in the lecture that addresses this? I remember you going over it but now I can't find the section in the lecture materials. Thanks!

gavinha · 2019-10-28T06:25:49Z

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here:

tfcb_2019/lectures/lecture08/Lecture8_GenomicData.Rmd

Line 164 in b4828fb

segs.gr[ind.segs.overlap.tile2]

Let me know if you have any further questions.

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:

GRanges object with 0 ranges and 3 metadata columns:
seqnames ranges strand | Sample Num_Probes Segment_Mean
|
seqinfo: 24 sequences from hg19 genome
Do I need to type something in addition to segs.gr[ind.segs.overlap.tile2]? When I try to change the end to a different tile, it won't let me, the only tile that works is 2. Also, how do I identify the TCGA id? Is it labeled?

Hi @alexgal8
The variable that you use to index segs.gr (i.e. what you put between the square brackets). The indices are supposed to refer to the rows in segs.gr that overlap chr8:128746347-128755810. Please refer to the lecture notes on how to find overlaps.

tfcb_2019/lectures/lecture08/Lecture8_GenomicData.Rmd

Lines 142 to 150 in fce205c

## 3.2 Finding overlapping ranges

One of the most useful features of `GenomicRanges` is to simply identify the ranges that overlap between two `GRanges` objects. The `findOverlaps` function is a basic method in the `GRanges` class for finding the indices of the elements that overlap between two `GRanges`. The argmuents `query` for your main `tiles.subset` and `subject` for the `segs.gr`. The `type` argument describes the type of overlap, such as `any`, `within`, `start`, `end`, `equal`, and there are additional arguments for criteria for overlap such as `minoverlap` size.

In this example, we will find which copy number alteration segments from `segs.gr` overlap in *any* way with our ranges in `tiles.subset` (`17:35000000-37000000`). For the criteria of any overlap, we set `type = "any"`.

```{r}

hits1 <- findOverlaps(query = tiles.subset, subject = segs.gr, type = "any")

class(hits1)

hits1

```

Yes, I've got the overlaps but now I'm trying to query the correct rows from the original 'segs.gr'.

Assuming you used something like hits <- findOverlaps(query = ..., subjects = ...), then you can find the overlapping indices for segs.gr using either queryHits(hits) or subjectHits(hits) depending on whether segs.gr is the query or the subject in the findOverlaps command.

gavinha · 2019-10-28T06:30:15Z

@gavinha In regard to problem 2 and finding copy number alterations, can you direct me to the part in the lecture that addresses this? I remember you going over it but now I can't find the section in the lecture materials. Thanks!

@alexgal8
Please refer to Section 3 of the lecture notes:

tfcb_2019/lectures/lecture08/Lecture8_GenomicData.Rmd

Line 121 in fce205c

# 3. Operations and features of GenomicRanges

You need to make use of basic operations/functionality of Granges objects to answer the questions. There is nothing specific about copy number alterations for Problem 2. In fact, you are still using the same copy number alteration segments from Problem 1.

k8hertweck · 2019-10-28T17:08:39Z

Please see this corrected language for question 4f:

f. Find the phased SNPs for both haplotypes within the region chr8: 129,127,000-129,128,000. (4 points)

Given the lateness of this change, and difficulty in communicating updates, all answers to question 4f will be counted as extra credit, and all reasonable genomic regions will be acceptable for receiving at least partial extra credit for this question.

zyaffe · 2019-10-28T21:27:19Z

For 2c-e, are we using the same tiles in chromosome X that we made in 2b or should we be using separate tiling over chromosomes 1-22 + X instead?

k8hertweck · 2019-10-28T21:45:59Z

@zyaffe The question intended for all chromosomes, but we'll accept either interpretation.

alexgal8 · 2019-10-28T23:05:03Z

For 2d, I have the list of the overlaps and segment means but I am stuck on what to do next. How do I group ranges by tile? How to I sort by < -0.3? (I am still trying to learn R basics and am struggling to find everything I need to include with sort() to get it to work with these data).

gavinha · 2019-10-28T23:35:15Z

For 2d, I have the list of the overlaps and segment means but I am stuck on what to do next. How do I group ranges by tile? How to I sort by < -0.3? (I am still trying to learn R basics and am struggling to find everything I need to include with sort() to get it to work with these data).

During the lecture, I showed an example (not in the notes) on how to select segments from a GRanges object based on a condition applied to the Segment_Mean. Specifically, I showed how you would find deletions. To get you started:

## returns indices (i.e. row numbers) in segs.gr which have Segment_Mean less than -0.3 
ind <- which(segs.gr$Segment_Mean < -0.3)   
## look at the rows in segs.gr that have Segment_Mean < -0.3
segs.gr[ind]

Once you have this, then perform the overlap.

I am at Fred Hutch today until 5pm and available for office hours tomorrow 9:30-11am. My office is M1-B869 in the Arnold Building.

yliu234 · 2019-10-29T04:09:10Z

Hi @gavinha and @k8hertweck, for the bonus question (2 points): provide code to determine the two haplotype sequences, can you elaborate on what the question is looking for? How does it differ from question 4f?

gavinha · 2019-10-29T17:44:22Z

Hi @yliu234

The question asks you to print out the sequence of alleles for each of the two haplotypes. You can simply read it off the screen and type it out manually.
You can also write code to extract the alleles automatically from the VCF object; this code is how you would generalize to analyzing any region in the genome.

Hope this makes sense.
Gavin

k8hertweck added the homework questions about homework label Oct 31, 2019

Questions for homework 3 #21

Questions for homework 3 #21

Comments

k8hertweck commented Oct 22, 2019

LABrumage commented Oct 23, 2019

k8hertweck commented Oct 23, 2019

rcsegura commented Oct 24, 2019

gavinha commented Oct 24, 2019

gavinha commented Oct 24, 2019 • edited Loading

k8hertweck commented Oct 24, 2019

k8hertweck commented Oct 24, 2019

LABrumage commented Oct 25, 2019

gavinha commented Oct 25, 2019

gavinha commented Oct 25, 2019 • edited Loading

gavinha commented Oct 25, 2019 • edited Loading

k8hertweck commented Oct 25, 2019

gavinha commented Oct 26, 2019

zyaffe commented Oct 26, 2019

k8hertweck commented Oct 26, 2019

LABrumage commented Oct 27, 2019 • edited Loading

alexgal8 commented Oct 28, 2019 • edited Loading

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results: GRanges object with 0 ranges and 3 metadata columns: seqnames ranges strand | Sample Num_Probes Segment_Mean |

Amandakr713 commented Oct 28, 2019

gavinha commented Oct 28, 2019

gavinha commented Oct 28, 2019

gavinha commented Oct 28, 2019

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:

Amandakr713 commented Oct 28, 2019

gavinha commented Oct 28, 2019

alexgal8 commented Oct 28, 2019

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:

Amandakr713 commented Oct 28, 2019

gavinha commented Oct 28, 2019

alexgal8 commented Oct 28, 2019

gavinha commented Oct 28, 2019

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:

gavinha commented Oct 28, 2019

k8hertweck commented Oct 28, 2019 • edited Loading

zyaffe commented Oct 28, 2019

k8hertweck commented Oct 28, 2019

alexgal8 commented Oct 28, 2019

gavinha commented Oct 28, 2019

yliu234 commented Oct 29, 2019

gavinha commented Oct 29, 2019

gavinha commented Oct 24, 2019 •

edited

Loading

gavinha commented Oct 25, 2019 •

edited

Loading

gavinha commented Oct 25, 2019 •

edited

Loading

LABrumage commented Oct 27, 2019 •

edited

Loading

alexgal8 commented Oct 28, 2019 •

edited

Loading

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:
GRanges object with 0 ranges and 3 metadata columns:
seqnames ranges strand | Sample Num_Probes Segment_Mean
|

k8hertweck commented Oct 28, 2019 •

edited

Loading