-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emtpy .mtx #10
Comments
Hi Chris The weights.tsv file columns are: the allele name, weight of the allele in the expectation maximization (EM) algorithm, and the number of reads explained by that allele. By default it reports the top 10 alleles for each gene. You should also see a pairs file, where the first two columns are the allele name and the third column is the number of reads explained by this pair of alleles - this is relevant when the individual is heterozygous. By default it reports the top 5 pairs for each gene. The genotyping algorithm assumes the individual is diploid and will not work properly if there are cells from more than one individual in the dataset. Please let me know if you have further questions |
Hi Charlotte (@cdarby), Thanks for the quick response. I agree that I don't necessarily need the .mtx file for my purposes, I was just curious if this could suggest that the pipeline is not working properly. For the record, here's the output log: Done separate de Bruijn graph construction Done separate de Bruijn graph construction Done separate de Bruijn graph construction On a semi-related note, are there any 'red-flags' that you could look for in the outputs that would suggest the HLA typing was inaccurate? I do not have HLA-type ground-truth for these samples and I am basically trying to determine whether cells with the predicted the HLA-type would engage in an allogeneic response. For example, when I look at the 'pairs' file from one run, I see that the top 5 pairs explain similar proportions of the total reads (e.g., +/- 0.2%): Given this result, how confident would you be in claiming that these cells are heterozygous at HLA-A with the 02:52 and 24:234 alleles? Chris |
Follow-up question: I understand that the algorithm won't work properly if more than one donor is present in the dataset. However, is this also true if you use donor-specific cell ID lists for the input? For example, I assigned cells into donor groups using an in silico genotyping approach and then processed the same bam file using these different cell ID sets -- this resulted in the same set of HLA alleles being identified for the two unrelated donors. If there's no way to handle this within the algorithm itself, what are your thoughts on parsing the FASTQs according to the cell ID lists before generating BAM files which can then be used for scHLAcount? |
Hi Chris Quick note - the genotyping algorithm is still a work-in-progress, although none of the team are actively developing it right now. Limited testing was performed on samples with known HLA type. You may consider running other tools designed for genotyping from other sequencing types (e.g. bulk RNA-seq) on your reads to see if they agree with the scHLAcount results. For the HLA-A results you show, I would say there is not enough evidence to prefer one of the two-field resolution types over the other because the top pairs are nearly exactly tied. It does seem to give evidence for the genotype at one-field resolution of A02/A24 though. In terms of giving you confidence that the scHLAcount genotype agrees with the genotype that would be reported by HLA typing assays, we just didn't get to that point with validating the performance of the genotyping algorithm. If I were to brainstorm how you might validate the genotypes from scHLAcount or another software, perhaps you could get the coding sequences of the top alleles from the IMGT/HLA database, remap the HLA-A reads to the allele sequences using Bowtie2/BWA-MEM for example, and explore whether there are fewer mismatches in the alignments to the personalized alleles compared to the STAR alignments (from Cell Ranger) to the reference genome. I see you're using 3'GEX data. Anecdotally, we noticed that genotyping from 3'GEX was less reliable than from 5'GEX, possibly due to lower coverage in the typing exons (of HLA class I). We did not test any datasets using V1 chemistry, so I can't speak to any considerations for that dataset. It looks like the command line output you posted only completed the database graph construction and did not report analysis of any reads, so I am surprised that genotypes were reported. (e.g. there should be a log entry that says "Processed X reads, Y pseudoaligned to HLA reference".) Did it crash? I checked the code and it appears that scHLAcount only ignores reads with barcodes that are not in the list you provide for the molecule counting step and uses all reads for the genotyping step. This would explain why you see the same results when providing different barcode files. To manually separate the files, given your lists of cell barcodes, I suppose you could write a script to scan the "CB" tags in the BAM files and match them with the ones on the lists. Charlotte |
Hi @cdarby
Thanks for making this tool! I'm having some trouble interpreting the outputs, and I'm hoping you can help.
So I've run the pipeline twice now on two different 3' scRNA-seq PBMC datasets -- one is the Frozen Donor B data from Zheng et al (2017, Nat Comm - V1 chemistry), and another is a dataset of 4 unrelated PBMC donors that I generated (V2 chemistry). It may be worth noting that I can split the cellID list by donor if necessary, but I ran it as a batch here.
The pipeline completes successfully, and I see distinct allele labels in the labels.tsv file for each output. However, I have no counts in the .mtx file for either dataset, and I don't really know what to make of the weights.tsv output.
If the .mtx matrix is 'empty', does this mean that scHLAcount is not accurately characterizing the data? My end goal for this analysis is to get an HLA-type for each donor (as you would normally do with standard HLA typing assays).
Chris McGinnis
Here's my script, for reference:
./sc_hla_count --bam /Users/gartnerlab/Desktop/scHLAcount/possorted_genome_bam.bam --cell-barcodes /Users/gartnerlab/Desktop/SLNR_PBMC/REVIEW/cellIDs_B.txt -d /Users/gartnerlab/Desktop/scHLAcount/IMGT/
./sc_hla_count --bam /Volumes/MULTIseq_1/SLNR_PBMC/crRNA/SLNR_L2/outs/possorted_genome_bam.bam --cell-barcodes /Users/gartnerlab/Desktop/SLNR_PBMC/REVIEW/cellIDs_ficoll.txt -d /Users/gartnerlab/Desktop/scHLAcount/IMGT/
The text was updated successfully, but these errors were encountered: