-
Notifications
You must be signed in to change notification settings - Fork 0
Manual Questions
FASTQ files are text files used to store sequence data (usually nucleotide or amino acid sequences) together with their quality score. Every record consists of 4 lines:
- A line starting with @ followed by the sequence identifier. Sometime the identifier is followed by a description of the sequence
- A line containing the biological sequence itself
- A line starting with +
- A line storing the quality scores associated to the nucleotides/amino acids stored in line 2.
Quality scores (Phred) can be stored using several different encodings. There are several different encoding. Some of the most popular are Sanger and Solexa/Illumina. There are 3 different versions of the Illumina encoding. All encoding methods use ASCII symbols to encode an integer score. The boundaries of this score varies depending on the encoding standard.
Paired end reads can be identified in different ways. In files containing Illumina reads, forward reads are identified by appending /1 at the end of the read identifier. Reverse reads are instead marked by /2.
The quality in terms of Phred scores, the datasets at my disposal were OK. That being said, as already mentioned in the analysis journal, both DNA and RNA FASTQ files were affected by contamination problems. Furthermore DNA FASTQ files had many very short (< 200bp) reads.
What can generate the issues you observe in your data? Can these cause any problems during subsequent analyses?
Contamination issues can be cause by human errors that can occur every time the samples are manipulated. If the contamination is significant enough, then yes, it can have a negative effect on the subsequent analyses
For DNA data, most of the raw reads were discarded by the correction and trimming procedure.
- # of raw reads: 215.276
- # of reads after trimming: 8.804
For RNA data the fraction of reads that were discarded by the trimming procedure varied across samples. For Continuous samples around 10% of the reads were discarded, while for Batch samples, the fraction was in the order of 40%.
If the number of surviving reads is to low, this can compromise the subsequent analyses steps. For instance if the coverage of DNA reads drops below 20x, my genome assembly might become too fragmented. For RNA data, if the coverage is too low, there is a risk that my transcriptome assembly is missing lowly expressed transcripts
The quality was improved after the trimming procedure
I did not use Trimmomatic for read trimming, however the LEADING parameter indicates how many bases should be cut off at the start of the reads, while TRAILING defines the number of bases that should be dropped from the end of a read. The SLIDINGWINDOW parameter indicates the size of the window and the average quality that should be used when performing the trimming. This settings can be used to alter the sensibility of the trimming by increasing the window size and/or lowering the average score. Vice versa by decreasing the window size and/or raising the average score threshold it is possible to increase the sensibility of the trimming software.
I can get information about read length distribution. Information about overlap coverage and number of contigs.
Ideally I expect only one (2 if we count the provirus), but I will probably get few more. In the end I only obtained 3 contigs (and none of them was the provirus sequence)
Unitigs can be seen as contigs with an higher degree of confidence, as they do not contain sequences with conflicting information. Contigs are often made up by one or more unitig(s)
A scaffold is a set of oriented contigs. Scaffold might contain gaps, while contig usually do not have gaps (and if they do, they are usually quite small)
What are the kmers? What kmer(s) should you use? What are the problems and benefits of choosing a small kmer? And a big kmer?
Kmers for a given sequence, are all the words (sequences) of length n that are contained in the original sequence. Benefits of small kmers:
- decrease amount of space necessary to store the original sequence
- increases the chance for all the kmers to overlap and thus the chances of constructing a complete de Bruijn graph is also increased
Benefits of long kmers:
- Easier to reconstruct the genome, as there are fewer paths to traverse the De-Brujin graph
- Allow to reconstruct regions rich in short repeats
Some assemblers can include a read-correction step before doing the assembly. What is this step doing?
This step tries to generate sequences with lower error rate, by using multiple overlapping reads
Not really, but I am aware that FASTA files can also contain other letters. For instance, N can be used to indicate a generic nucleic acid. The following is the list of letters allowed in FASTA files: A, C, G, T, U, R, Y, K, M, S, W, B, D, H, V, N, -.
What do measures like N50, N90, etc. mean? How can they help you evaluate the quality of your assembly? Which measure is the best to summarize the quality of the assembly (N50, number of ORFs, completeness, total size, longest contig ...)
N50, N90 etc. are a set of statistics that are used to describe the quality of a genome assembly in terms of sequence length and contiguity. I don't believe there is one single measurement capable of evaluating all the different aspects that affect an assembly quality. That being said, generally speaking I'd say that assembly coverage, N50, BUSCO score, number of missasemblies and number of ORFs, when used together are sufficient to accurately describe the quality of an assembly.
I think my assembly it's better, as I removed the overlapping region at the 3', while the assembly from the paper includes this overlap.
See question above.
Prokka predicts all sorts of features. Coding regions, rRNA, tRNA, ncRNA, CRISPR regions etc. Generally speaking I believe that rRNA and tRNA are the two types of feature that are easy to reliably detect.
How many features of each kind are detected in your contigs? Do you detect the same number of features than the authors? How do they differ?
See the journal for details. My annotation has slightly more genes than the one from the paper, while for rRNA and tRNA the numbers are the same.
Mainly because of transcript splicing, that can produce many different isoforms of the same transcripts. For these reasons to produce an high quality annotation, it's usually a good idea to also sequence the transcriptome
How many genes are annotated as ‘hypothetical protein’? Why is that so? How would you tackle that problem?
I don't know the exact number, but I'd say roughly 1/4 of the genes. To solve this issue I would use the hypothetical protein sequence to search on protein databases (such as UniProt) using tBLASTx. If sequence approaches do not work, then I would use some of the methods based on protein structure, like homology modeling or threading.
98% for the genome assembly and 92% for the transcriptome.
What potential issues can cause mRNA reads not to map properly to gene in the chromosome? Do you expect this to differ between prokaryotic and eukaryotic projects?
Aside for sequencing errors, contamination could cause a low mapping rate. Also using a non splice-aware aligner to map eukaryotic mRNAs to the genome.