Skip to content
This repository has been archived by the owner on Mar 13, 2020. It is now read-only.

Manual Questions

Roberto Rossini edited this page May 30, 2019 · 2 revisions

Answers to suggested questions from the lab manual

Reads Quality Control

What is the structure of a FASTQ file?

FASTQ files are text files used to store sequence data (usually nucleotide or amino acid sequences) together with their quality score. Every record consists of 4 lines:

  • A line starting with @ followed by the sequence identifier. Sometime the identifier is followed by a description of the sequence
  • A line containing the biological sequence itself
  • A line starting with +
  • A line storing the quality scores associated to the nucleotides/amino acids stored in line 2.

How is the quality of the data stored in the FASTQ files? How are paired reads identified?

Quality scores (Phred) can be stored using several different encodings. There are several different encoding. Some of the most popular are Sanger and Solexa/Illumina. There are 3 different versions of the Illumina encoding. All encoding methods use ASCII symbols to encode an integer score. The boundaries of this score varies depending on the encoding standard.

Paired end reads can be identified in different ways. In files containing Illumina reads, forward reads are identified by appending /1 at the end of the read identifier. Reverse reads are instead marked by /2.

How is the quality of your data?

The quality in terms of Phred scores, the datasets at my disposal were OK. That being said, as already mentioned in the analysis journal, both DNA and RNA FASTQ files were affected by contamination problems. Furthermore DNA FASTQ files had many very short (< 200bp) reads.

What can generate the issues you observe in your data? Can these cause any problems during subsequent analyses?

Contamination issues can be cause by human errors that can occur every time the samples are manipulated. If the contamination is significant enough, then yes, it can have a negative effect on the subsequent analyses

Reads Preprocessing

How many reads have been discarded after trimming?

For DNA data, most of the raw reads were discarded by the correction and trimming procedure.

  • # of raw reads: 215.276
  • # of reads after trimming: 8.804

For RNA data the fraction of reads that were discarded by the trimming procedure varied across samples. For Continuous samples around 10% of the reads were discarded, while for Batch samples, the fraction was in the order of 40%.

How can this affect your future analyses and results?

If the number of surviving reads is to low, this can compromise the subsequent analyses steps. For instance if the coverage of DNA reads drops below 20x, my genome assembly might become too fragmented. For RNA data, if the coverage is too low, there is a risk that my transcriptome assembly is missing lowly expressed transcripts

How is the quality of your data after trimming?

The quality was improved after the trimming procedure

What do the LEADING, TRAILING and SLIDINGWINDOW options do?

I did not use Trimmomatic for read trimming, however the LEADING parameter indicates how many bases should be cut off at the start of the reads, while TRAILING defines the number of bases that should be dropped from the end of a read. The SLIDINGWINDOW parameter indicates the size of the window and the average quality that should be used when performing the trimming. This settings can be used to alter the sensibility of the trimming by increasing the window size and/or lowering the average score. Vice versa by decreasing the window size and/or raising the average score threshold it is possible to increase the sensibility of the trimming software.

Genome assembly

What information can you get from the plots and reports given by the assembler (if you get any)?

I can get information about read length distribution. Information about overlap coverage and number of contigs.

How many contigs do you expect? How many do you obtain?

Ideally I expect only one (2 if we count the provirus), but I will probably get few more. In the end I only obtained 3 contigs (and none of them was the provirus sequence)

What is the difference between a ‘contig’ and a ‘unitig’?

Unitigs can be seen as contigs with an higher degree of confidence, as they do not contain sequences with conflicting information. Contigs are often made up by one or more unitig(s)

What is the difference between a ‘contig’ and a ‘scaffold’?

A scaffold is a set of oriented contigs. Scaffold might contain gaps, while contig usually do not have gaps (and if they do, they are usually quite small)

What are the kmers? What kmer(s) should you use? What are the problems and benefits of choosing a small kmer? And a big kmer?

Kmers for a given sequence, are all the words (sequences) of length n that are contained in the original sequence. Benefits of small kmers:

  • decrease amount of space necessary to store the original sequence
  • increases the chance for all the kmers to overlap and thus the chances of constructing a complete de Bruijn graph is also increased

Benefits of long kmers:

  • Easier to reconstruct the genome, as there are fewer paths to traverse the De-Brujin graph
  • Allow to reconstruct regions rich in short repeats

Some assemblers can include a read-correction step before doing the assembly. What is this step doing?

This step tries to generate sequences with lower error rate, by using multiple overlapping reads

Can you see any other letter apart from AGTC in your assembly? If so, what are those?

Not really, but I am aware that FASTA files can also contain other letters. For instance, N can be used to indicate a generic nucleic acid. The following is the list of letters allowed in FASTA files: A, C, G, T, U, R, Y, K, M, S, W, B, D, H, V, N, -.

Genome Assembly Evaluation

What do measures like N50, N90, etc. mean? How can they help you evaluate the quality of your assembly? Which measure is the best to summarize the quality of the assembly (N50, number of ORFs, completeness, total size, longest contig ...)

N50, N90 etc. are a set of statistics that are used to describe the quality of a genome assembly in terms of sequence length and contiguity. I don't believe there is one single measurement capable of evaluating all the different aspects that affect an assembly quality. That being said, generally speaking I'd say that assembly coverage, N50, BUSCO score, number of missasemblies and number of ORFs, when used together are sufficient to accurately describe the quality of an assembly.

How does your assembly compare with the reference assembly? What can have caused the differences?

I think my assembly it's better, as I removed the overlapping region at the 3', while the assembly from the paper includes this overlap.

Why do you think your assembly is better/worse than the public one?

See question above.

Genome Assembly

What types of features are detected by the software? Which ones are more reliable a priori?

Prokka predicts all sorts of features. Coding regions, rRNA, tRNA, ncRNA, CRISPR regions etc. Generally speaking I believe that rRNA and tRNA are the two types of feature that are easy to reliably detect.

How many features of each kind are detected in your contigs? Do you detect the same number of features than the authors? How do they differ?

See the journal for details. My annotation has slightly more genes than the one from the paper, while for rRNA and tRNA the numbers are the same.

Why is more difficult to do the functional annotation in eukaryotic genomes?

Mainly because of transcript splicing, that can produce many different isoforms of the same transcripts. For these reasons to produce an high quality annotation, it's usually a good idea to also sequence the transcriptome

How many genes are annotated as ‘hypothetical protein’? Why is that so? How would you tackle that problem?

I don't know the exact number, but I'd say roughly 1/4 of the genes. To solve this issue I would use the hypothetical protein sequence to search on protein databases (such as UniProt) using tBLASTx. If sequence approaches do not work, then I would use some of the methods based on protein structure, like homology modeling or threading.

Mapping

What percentage of your reads map back to your contigs? Why do you think that is?

98% for the genome assembly and 92% for the transcriptome.

What potential issues can cause mRNA reads not to map properly to gene in the chromosome? Do you expect this to differ between prokaryotic and eukaryotic projects?

Aside for sequencing errors, contamination could cause a low mapping rate. Also using a non splice-aware aligner to map eukaryotic mRNAs to the genome.

Clone this wiki locally