Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gene_Centic_Coding Unable to Analyze Gene #9

Closed
kwdoyle opened this issue Oct 24, 2023 · 10 comments
Closed

Gene_Centic_Coding Unable to Analyze Gene #9

kwdoyle opened this issue Oct 24, 2023 · 10 comments

Comments

@kwdoyle
Copy link

kwdoyle commented Oct 24, 2023

Hello,

While running theGene_Centic_Coding function, I noticed a strange issue while processing through a list of genes for a specific chromosome.

On any given gene, the function seems to work properly until the internal coding function attempts to run the STAAR function:

try(pvalues <- STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, 
    rare_maf_cutoff = rare_maf_cutoff, rv_num_cutoff = rv_num_cutoff), 
    silent = silent)

I am receiving the following error, and thus no results from the current gene:

Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff,  :     Dimensions don't match for genotype and annotation!

This error occurs virtually for all genes. Looking into this, it appears the issue is how the annotation data is subset for the final list of variants that are lof in plof:

Anno.Int.PHRED.sub.category <- Anno.Int.PHRED.sub[lof.in.plof, ]

When I run this, lof.in.plof is a vector of NAs, TRUEs, and FALSEs, with the number of TRUEs corresponding to the final filtered number of variants to use (in my case, 5). When the annotation data in Anno.Int.PHRED.sub is subset using this vector, however, the final dimensions of the table still contain the number of rows that correspond to the previous number of variants (which, in my case, was 129).

The Geno matrix has the dimensions [n samples x 5 variants]. When Anno.Int.PHRED.sub.category is passed to the STAAR function, however, its dimensions are still [n samples x 129 variants], causing the error.

If I wrap the which function around lof.in.plof, the dimensions of the resulting table are [n samples x 5] and STAAR is able to run properly and gives no error:

Anno.Int.PHRED.sub.category <- Anno.Int.PHRED.sub[which(lof.in.plof),]

I assume this fix makes sense and there shouldn't be a reason Anno.Int.PHRED.sub.category should still contain rows with NA data..? The final dimensions of this annotation table should indeed match that of the genotype matrix, no?

@xihaoli
Copy link
Owner

xihaoli commented Oct 24, 2023

Hi @kwdoyle,

Thanks for your questions. I haven't encountered this issue before, since I don't think lof.in.plof should contain NA for any variant based on its definition. If you want to delve deeper, I would recommend you re-annotate your genotype data using the FAVOR Essential Database and see if this persists.

Best,
Xihao

@xihaoli
Copy link
Owner

xihaoli commented Oct 24, 2023

On another note, what are those variants with lof.in.plof being NA in your dataset? Are these SNVs or indels? Feel free to send an example or two for such variants.

Thank you!

@kwdoyle
Copy link
Author

kwdoyle commented Oct 26, 2023

So all variants do indeed have a value for lof.in.coding, which is reassuring. It seems like the NA issue in lof.in.plof is because not all variants have a value for MetaSVM_pred, which is used in its creation. A few, for example, would be 5-1253777-G,T 5-1253795-G,A 5-1253804-G,A. For chromosome 5, at least, all are SNVs.

@kwdoyle
Copy link
Author

kwdoyle commented Oct 26, 2023

I think this is another data-loading issue. MetaSVM should be either "T" or "D", which it is in the merged csv, yet the values are "TRUE" and "" within the annotated GDS. I'll have to check why this is happening.

@xihaoli
Copy link
Owner

xihaoli commented Oct 26, 2023

Hi @kwdoyle,

Thanks so much for your input. The issue might still be related to auto-assign the column classes. Please feel free to send me an email and if you'd like we can do a quick call to get it right.

Best,
Xihao

@xihaoli
Copy link
Owner

xihaoli commented Oct 26, 2023

In brief, if you annotate your genotype data using the FAVOR Essential Database, this issue should not persist.

@kwdoyle
Copy link
Author

kwdoyle commented Oct 26, 2023

Yes, so this issue was due to read_csv in gds2agds.r assigning the wrong column class to MetaSVM_pred. The "T" or "D" values were read in as logicals, converting any "T"s to TRUEs and the "D"s to NAs. This incorrect data was then used to annotate the GDS.

Since I chose to add all 160 annotations from the FAVOR database, it would have been inconvenient to assign the column classes for each one within read_csv. I instead used data.table::fread to read in the data, which correctly assigned all column classes.

For reference, I checked which columns were read in differently between read_csv and fread:

image

Anything read in by read_csv as logical would potentially be a problem, and fread does correctly read those in instead.

@xihaoli
Copy link
Owner

xihaoli commented Oct 26, 2023

Thank you @kwdoyle. This is very helpful!

If you would like to contribute some documents/scripts that you use, please let me know. I can add you as a collaborator of the STAARpipeline-Tutorial repo so that you can contribute to this section.

Best,
Xihao

@kwdoyle
Copy link
Author

kwdoyle commented Oct 27, 2023

That would be great, as I've been making some modifications to these scripts to be generally applicable to other scenarios. Mainly, being independent from the Harvard cluster job IDs used to select the current chromosome to analyze.

@xihaoli
Copy link
Owner

xihaoli commented Oct 27, 2023

Sounds perfect, thank you @kwdoyle! I've invited you to be part of the STAARpipeline-Tutorial repo. Look forward to your contributions!

p.s. I'll close this issue and the other issue in the STAARpipeline-Tutorial repo.

Best,
Xihao

@xihaoli xihaoli closed this as completed Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants