Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions for homework 3 #21

Open
k8hertweck opened this issue Oct 22, 2019 · 36 comments
Open

Questions for homework 3 #21

k8hertweck opened this issue Oct 22, 2019 · 36 comments
Labels
homework questions about homework

Comments

@k8hertweck
Copy link
Collaborator

Include questions about homework 3 here! @gavinha and I will be happy to help answer them.

@LABrumage
Copy link

Is the code for 1a and 1b supposed to go all in the same code block below 1b? Also, for 1a printing out the first five unique TCGA IDs, is that supposed to be drawn from sorting the overlapping segments? Thank you!

@k8hertweck
Copy link
Collaborator Author

@LABrumage Sorry about leaving out a code chunk! Please add one for 1a.

@gavinha, can you speak to this question:

for 1a printing out the first five unique TCGA IDs, is that supposed to be drawn from sorting the overlapping segments?

@rcsegura
Copy link

We've (@LABrumage + @sbest0128) been trying to get the Segment_Mean for the copy number segments in problem 1b, but we are stuck. We have the overlaps, but we can't filter out which means are associated with the overlap values - is there anything that we should try doing?

@gavinha
Copy link
Contributor

gavinha commented Oct 24, 2019

We've (@LABrumage + @sbest0128) been trying to get the Segment_Mean for the copy number segments in problem 1b, but we are stuck. We have the overlaps, but we can't filter out which means are associated with the overlap values - is there anything that we should try doing?

Hi @rcsegura

Please refer to the R Markdown file from the lecture:

segs.tile2.means <- segs.gr[ind.segs.overlap.tile2]$Segment_Mean
mean(segs.tile2.means)

Let me know if you still have any questions.

Best,
Gavin

@gavinha
Copy link
Contributor

gavinha commented Oct 24, 2019

@LABrumage Sorry about leaving out a code chunk! Please add one for 1a.

@gavinha, can you speak to this question:

for 1a printing out the first five unique TCGA IDs, is that supposed to be drawn from sorting the overlapping segments?

Yes, after finding the overlapping ranges for segs.gr, use the unique function, followed by the sort function. Then, print out the first 5 rows.

@k8hertweck
Copy link
Collaborator Author

I received the following question via email:

For question 1a, it is asking for overlaps with the regions chr8:128,746,347-128,755,810. Does this mean that for chr8:128, we are only looking for overlap at that specific base pair?

In other words, would that mean that the start and stop ranges would both be 128?

The commas are not separating numbers, but instead represent numbers higher than 128 million

@k8hertweck
Copy link
Collaborator Author

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

@LABrumage
Copy link

@gavinha is this on the right track for 1b (where object overlap1b is the GRanges object containing the overlaps I found)?
ind.segs.overlap <-subjectHits(overlap1b)
segs.gr[ind.segs.overlap]
segs.means <- segs.gr[ind.segs.overlap]$Segment_Mean
mean(segs.means)
I got 0.1747237 as the mean of the segment means

@gavinha
Copy link
Contributor

gavinha commented Oct 25, 2019

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here:

segs.gr[ind.segs.overlap.tile2]

Let me know if you have any further questions.

@gavinha
Copy link
Contributor

gavinha commented Oct 25, 2019

@gavinha is this on the right track for 1b (where object overlap1b is the GRanges object containing the overlaps I found)?
ind.segs.overlap <-subjectHits(overlap1b)
segs.gr[ind.segs.overlap]
segs.means <- segs.gr[ind.segs.overlap]$Segment_Mean
mean(segs.means)
I got 0.1747237 as the mean of the segment means

Those specific lines of code look correct. Ifoverlap1b is the output of the findOverlaps function, then it is a Hits object (not GRanges) with the indices of overlap in your query and subject.

Also, make sure you are using the correct function to assign ind.segs.overlap; which should you use subjectHits or queryHits? Make sure you use the one that matches the segs.gr when you called findOverlaps

@gavinha
Copy link
Contributor

gavinha commented Oct 25, 2019

Hi class,

In Problem 4d, if you used data.table as described in:

genoData <- data.table(do.call(cbind, geno(vcf)))
colnames(genoData) <- rownames(geno(header(vcf)))
genoData

you may notice that in the R Markdown itself shows something like this:


GT<list> | DP<list> | GQ<list> | ADALL<list> | AD<list> | IGT<list> | IPS<list> | PS<list>
<chr [1]>    <int [1]>    <int [1]>    <int [2]>    <int [2]>    <chr [1]>    <chr [1]>    <chr [1]>
<chr [1]>    <int [1]>    <int [1]>    <int [2]>    <int [2]>    <chr [1]>    <chr [1]>    <chr [1]>
<chr [1]>    <int [1]>    <int [1]>    <int [2]>    <int [2]>    <chr [1]>    <chr [1]>    <chr [1]

Don't worry! Once you Knit, the output will be fine.

Hope this helps.
Gavin

@k8hertweck
Copy link
Collaborator Author

I've talked to a few folks who are struggling to conceptualize homework 3. If you haven't worked with genomic data and/or haven't spent a lot of time working in R, it's possible this homework will be particularly challenging. Here are a few things that may help to keep in mind:

  • The homework questions rely heavily on the code provided in the lecture 8 notebooks. Read the homework questions carefully, noting when there are references to specific parts of the notebooks.
  • Work through the code in the lecture 8 notebooks line-by-line (not just in chunks), and think about what is happening each step of the way. Make extra code comments for yourself so you understand what each line is doing.
  • Look up the help documentation for functions that appear to be doing "heavy lifting" in manipulating the genomic data.
  • Inspect the output objects for each cell, including the type of object and what information is contained within them. This will help you troubleshoot.
  • If any of the terms used by the help documentation or lecture 8 notebooks are unclear, reference the lecture 7 materials to help you understand the biology behind the code.

A few other more specific answers to questions I've received about genomic data/analyses:

  • Where did these data come from? The human genome (hg19, loaded as a part of a Bioconductor package) is used to analyze the BRCA.genome_wide_snp_6_broad_Level_3_scna.seg dataset. This is a segment data format file, which you can read about in Lecture8_GenomicData.Rmd, as well as lecture 7 (starting at line 39). For more information about the Segment_Mean, I recommend this article.
  • What is tiling, and why does it matter? Tiling is a method of splitting the genome into chunks so that portions of chromosomes can be assessed and summarized independently. It's a way to assess whether variation is consistent across the genome, or differs among parts.

@gavinha
Copy link
Contributor

gavinha commented Oct 26, 2019

Thanks @k8hertweck
If you also want more information about the Copy Number Segmentation file, see the documentation here:
https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/

@zyaffe
Copy link

zyaffe commented Oct 26, 2019

What VCF file are we supposed to load for problem #4? 'GIAB_highconf_v.3.3.2.vcf.gz' from lecture 7?

@k8hertweck
Copy link
Collaborator Author

What VCF file are we supposed to load for problem #4? 'GIAB_highconf_v.3.3.2.vcf.gz' from lecture 7?

Yes, that's the correct file, @zyaffe

@LABrumage
Copy link

LABrumage commented Oct 27, 2019

For Problem 4d where we have to combine the genotype info into a table, all of the necessary columns appear to be present but the entries in the table are <chr [1]>, <int[1]>, and so forth. Is this what the entries are supposed to read as? I used the vcf object where I loaded info for chromosome 8 (defined back in 4a).
Here's my code for 4d:
genoData <- data.table(do.call(cbind, geno(vcf)))
colnames(genoData) <- rownames(geno(header(vcf)))
genoData

@alexgal8
Copy link

alexgal8 commented Oct 28, 2019

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here:

segs.gr[ind.segs.overlap.tile2]

Let me know if you have any further questions.

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:
GRanges object with 0 ranges and 3 metadata columns:
seqnames ranges strand | Sample Num_Probes Segment_Mean
|

seqinfo: 24 sequences from hg19 genome

Do I need to type something in addition to segs.gr[ind.segs.overlap.tile2]? When I try to change the end to a different tile, it won't let me, the only tile that works is 2. Also, how do I identify the TCGA id? Is it labeled?

@Amandakr713
Copy link

For 2d, I found two windows that have the highest count of deletion segments. Should we be looking for only one window as the answer?

@gavinha
Copy link
Contributor

gavinha commented Oct 28, 2019

For Problem 4d where we have to combine the genotype info into a table, all of the necessary columns appear to be present but the entries in the table are <chr [1]>, <int[1]>, and so forth. Is this what the entries are supposed to read as? I used the vcf object where I loaded info for chromosome 8 (defined back in 4a).
Here's my code for 4d:
genoData <- data.table(do.call(cbind, geno(vcf)))
colnames(genoData) <- rownames(geno(header(vcf)))
genoData

Hi @LABrumage

Please refer to my message 3 days ago about question 4d.

@gavinha
Copy link
Contributor

gavinha commented Oct 28, 2019

For 2d, I found two windows that have the highest count of deletion segments. Should we be looking for only one window as the answer?

Hi @Amandakr713

You can return both or just one of them.

@gavinha
Copy link
Contributor

gavinha commented Oct 28, 2019

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here:

segs.gr[ind.segs.overlap.tile2]

Let me know if you have any further questions.

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:

GRanges object with 0 ranges and 3 metadata columns:
seqnames ranges strand | Sample Num_Probes Segment_Mean
|
seqinfo: 24 sequences from hg19 genome

Do I need to type something in addition to segs.gr[ind.segs.overlap.tile2]? When I try to change the end to a different tile, it won't let me, the only tile that works is 2. Also, how do I identify the TCGA id? Is it labeled?

Hi @alexgal8

The variable that you use to index segs.gr (i.e. what you put between the square brackets). The indices are supposed to refer to the rows in segs.gr that overlap chr8:128746347-128755810. Please refer to the lecture notes on how to find overlaps.

## 3.2 Finding overlapping ranges
One of the most useful features of `GenomicRanges` is to simply identify the ranges that overlap between two `GRanges` objects. The `findOverlaps` function is a basic method in the `GRanges` class for finding the indices of the elements that overlap between two `GRanges`. The argmuents `query` for your main `tiles.subset` and `subject` for the `segs.gr`. The `type` argument describes the type of overlap, such as `any`, `within`, `start`, `end`, `equal`, and there are additional arguments for criteria for overlap such as `minoverlap` size.
In this example, we will find which copy number alteration segments from `segs.gr` overlap in *any* way with our ranges in `tiles.subset` (`17:35000000-37000000`). For the criteria of any overlap, we set `type = "any"`.
```{r}
hits1 <- findOverlaps(query = tiles.subset, subject = segs.gr, type = "any")
class(hits1)
hits1
```

@Amandakr713
Copy link

In 4e, where do the DNAString and DNAStringSet objects exist?

@gavinha
Copy link
Contributor

gavinha commented Oct 28, 2019

In 4e, where do the DNAString and DNAStringSet objects exist?

Hi @Amandakr713

In the lecture notes:

The `rowRanges` function will return a `GRanges` object containing the coordinates, REF/ALT bases, quality, and filtering status of the variants.
```{r}
rowRanges(vcf)
```

Find the column referring to the reference base REF and alternate base ALT in this object. To access the columns, use rowRanges(vcf)$REF.

@alexgal8
Copy link

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here:

segs.gr[ind.segs.overlap.tile2]

Let me know if you have any further questions.

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:

GRanges object with 0 ranges and 3 metadata columns:
seqnames ranges strand | Sample Num_Probes Segment_Mean
|
seqinfo: 24 sequences from hg19 genome
Do I need to type something in addition to segs.gr[ind.segs.overlap.tile2]? When I try to change the end to a different tile, it won't let me, the only tile that works is 2. Also, how do I identify the TCGA id? Is it labeled?

Hi @alexgal8

The variable that you use to index segs.gr (i.e. what you put between the square brackets). The indices are supposed to refer to the rows in segs.gr that overlap chr8:128746347-128755810. Please refer to the lecture notes on how to find overlaps.

## 3.2 Finding overlapping ranges
One of the most useful features of `GenomicRanges` is to simply identify the ranges that overlap between two `GRanges` objects. The `findOverlaps` function is a basic method in the `GRanges` class for finding the indices of the elements that overlap between two `GRanges`. The argmuents `query` for your main `tiles.subset` and `subject` for the `segs.gr`. The `type` argument describes the type of overlap, such as `any`, `within`, `start`, `end`, `equal`, and there are additional arguments for criteria for overlap such as `minoverlap` size.
In this example, we will find which copy number alteration segments from `segs.gr` overlap in *any* way with our ranges in `tiles.subset` (`17:35000000-37000000`). For the criteria of any overlap, we set `type = "any"`.
```{r}
hits1 <- findOverlaps(query = tiles.subset, subject = segs.gr, type = "any")
class(hits1)
hits1
```

Yes, I've got the overlaps but now I'm trying to query the correct rows from the original 'segs.gr'.

@Amandakr713
Copy link

Thank you @gavinha !
For 4f, some of the REF and ALT columns contain multiple bases (ex: GCA). Did I execute the code incorrectly? I thought we should only expect single bases in these columns.

@gavinha
Copy link
Contributor

gavinha commented Oct 28, 2019

Thank you @gavinha !
For 4f, some of the REF and ALT columns contain multiple bases (ex: GCA). Did I execute the code incorrectly? I thought we should only expect single bases in these columns.

Hi @Amandakr713
Some of the alleles do contain multiple bases representing short, multi-base INDELs.

Thank you for bringing up this concern. I just realized that this question is an older version that is much more challenging. I had come up with a slightly different version but did not manage to upload the changes. I will talk to Kate about this but likely, all of 4f will be a question only for bonus marks.

For your reference, the more straight-forward version of the question is simply a change in the region of interest to chr8:129127000-129128000. This region only contains 5 SNPs and the alleles are single bases. Manually writing out the answer would be much easier, as you can tell.

Reporting the answer with these 5 SNPs in chr8:129127000-129128000 will also get full marks.

I not sure if other students actually come to GitHub to read these comments so Kate will send an announcement about this.

Thanks for bringing this to my attention and sorry for the inconvenience.

@alexgal8
Copy link

@gavinha In regard to problem 2 and finding copy number alterations, can you direct me to the part in the lecture that addresses this? I remember you going over it but now I can't find the section in the lecture materials. Thanks!

@gavinha
Copy link
Contributor

gavinha commented Oct 28, 2019

Another question for @gavinha, regarding 1a: after creating a GRanges object representing the overlapping segments, it looks like the TCGA ids aren't present any longer?

After using findOverlaps, you need to use the indices to query the correct rows from the original segs.gr. See the Lecture notes here:

segs.gr[ind.segs.overlap.tile2]

Let me know if you have any further questions.

When I run this segs.gr[ind.segs.overlap.tile2] after using find0verlaps, I get no results:

GRanges object with 0 ranges and 3 metadata columns:
seqnames ranges strand | Sample Num_Probes Segment_Mean
|
seqinfo: 24 sequences from hg19 genome
Do I need to type something in addition to segs.gr[ind.segs.overlap.tile2]? When I try to change the end to a different tile, it won't let me, the only tile that works is 2. Also, how do I identify the TCGA id? Is it labeled?

Hi @alexgal8
The variable that you use to index segs.gr (i.e. what you put between the square brackets). The indices are supposed to refer to the rows in segs.gr that overlap chr8:128746347-128755810. Please refer to the lecture notes on how to find overlaps.

## 3.2 Finding overlapping ranges
One of the most useful features of `GenomicRanges` is to simply identify the ranges that overlap between two `GRanges` objects. The `findOverlaps` function is a basic method in the `GRanges` class for finding the indices of the elements that overlap between two `GRanges`. The argmuents `query` for your main `tiles.subset` and `subject` for the `segs.gr`. The `type` argument describes the type of overlap, such as `any`, `within`, `start`, `end`, `equal`, and there are additional arguments for criteria for overlap such as `minoverlap` size.
In this example, we will find which copy number alteration segments from `segs.gr` overlap in *any* way with our ranges in `tiles.subset` (`17:35000000-37000000`). For the criteria of any overlap, we set `type = "any"`.
```{r}
hits1 <- findOverlaps(query = tiles.subset, subject = segs.gr, type = "any")
class(hits1)
hits1
```

Yes, I've got the overlaps but now I'm trying to query the correct rows from the original 'segs.gr'.

Assuming you used something like hits <- findOverlaps(query = ..., subjects = ...), then you can find the overlapping indices for segs.gr using either queryHits(hits) or subjectHits(hits) depending on whether segs.gr is the query or the subject in the findOverlaps command.

@gavinha
Copy link
Contributor

gavinha commented Oct 28, 2019

@gavinha In regard to problem 2 and finding copy number alterations, can you direct me to the part in the lecture that addresses this? I remember you going over it but now I can't find the section in the lecture materials. Thanks!

@alexgal8
Please refer to Section 3 of the lecture notes:

# 3. Operations and features of GenomicRanges

You need to make use of basic operations/functionality of Granges objects to answer the questions. There is nothing specific about copy number alterations for Problem 2. In fact, you are still using the same copy number alteration segments from Problem 1.

@k8hertweck
Copy link
Collaborator Author

k8hertweck commented Oct 28, 2019

Please see this corrected language for question 4f:

f. Find the phased SNPs for both haplotypes within the region chr8: 129,127,000-129,128,000. (4 points)

Given the lateness of this change, and difficulty in communicating updates, all answers to question 4f will be counted as extra credit, and all reasonable genomic regions will be acceptable for receiving at least partial extra credit for this question.

@zyaffe
Copy link

zyaffe commented Oct 28, 2019

For 2c-e, are we using the same tiles in chromosome X that we made in 2b or should we be using separate tiling over chromosomes 1-22 + X instead?

@k8hertweck
Copy link
Collaborator Author

@zyaffe The question intended for all chromosomes, but we'll accept either interpretation.

@alexgal8
Copy link

For 2d, I have the list of the overlaps and segment means but I am stuck on what to do next. How do I group ranges by tile? How to I sort by < -0.3? (I am still trying to learn R basics and am struggling to find everything I need to include with sort() to get it to work with these data).

@gavinha
Copy link
Contributor

gavinha commented Oct 28, 2019

For 2d, I have the list of the overlaps and segment means but I am stuck on what to do next. How do I group ranges by tile? How to I sort by < -0.3? (I am still trying to learn R basics and am struggling to find everything I need to include with sort() to get it to work with these data).

During the lecture, I showed an example (not in the notes) on how to select segments from a GRanges object based on a condition applied to the Segment_Mean. Specifically, I showed how you would find deletions. To get you started:

## returns indices (i.e. row numbers) in segs.gr which have Segment_Mean less than -0.3 
ind <- which(segs.gr$Segment_Mean < -0.3)   
## look at the rows in segs.gr that have Segment_Mean < -0.3
segs.gr[ind] 

Once you have this, then perform the overlap.

I am at Fred Hutch today until 5pm and available for office hours tomorrow 9:30-11am. My office is M1-B869 in the Arnold Building.

@yliu234
Copy link

yliu234 commented Oct 29, 2019

Hi @gavinha and @k8hertweck, for the bonus question (2 points): provide code to determine the two haplotype sequences, can you elaborate on what the question is looking for? How does it differ from question 4f?

@gavinha
Copy link
Contributor

gavinha commented Oct 29, 2019

Hi @yliu234

The question asks you to print out the sequence of alleles for each of the two haplotypes. You can simply read it off the screen and type it out manually.
You can also write code to extract the alleles automatically from the VCF object; this code is how you would generalize to analyzing any region in the genome.

Hope this makes sense.
Gavin

@k8hertweck k8hertweck added the homework questions about homework label Oct 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
homework questions about homework
Projects
None yet
Development

No branches or pull requests

8 participants