This tutorial is a step-by-step guide for using SciApps to perform bulked segregant analysis. The example data used here is from studying the sorghum Ms9 gene, located in chromosome 2, which encodes a PHD-finger transcription factor critical for pollen development (reference). The ms9 mutant plants (Mu574, right in the image) are crossed by the WT BTx623 pollen (left) to generate the F1 seeds. The F1 plants produced F2 seeds through self-fertilization. Leaf tissues from 20 confirmed F2 mutants are pooled, quality-checked, and subjected to 150-bp paired-end sequencing on an Illumina X-10 instrument.
To reduce the total computation time, we start the tutorial with variant filtering, skipping the alignment of raw reads with Bowtie2 and calling SNPs with bcftools. SNPs are filtered by the EMS mutation type (G to A or C to T), read depth, background, and minimum allele frequency. SnpEff is then used to annotate the filtered SNPs. Annotated SNPs are passed to SIFT4G for predicting amino acid substitution effects. Finally, results from all three steps are combined for visualization using the app, bsa_viewer.
bsa_viewer provides an interactive visualization interface for confirming whether the candidate genes are true or false positives. Then, at the last, we use Ensembl Plants/Gramene to identify mutant lines that contained independent mutation alleles in the candidate gene for further verification.
Example Data
Apps:
App name | App link | Description | Notes/other links |
---|---|---|---|
Bowtie2 | Bowtie2-2.3.2 | Fast and sensitive read alignment | Bowtie2 documentation |
bcftools_call | bcftools_call-1.8 | SNP/indel calling | bcftools documentation |
bcftools_filter | bcftools_filter-1.8 | SNP/indel filtering | bcftools documentation |
SnpEff | SnpEff-4.3.1 | Annotating variants | SnpEff documentation |
SIFT4G | SIFT4G-0.0.1 | A faster version of SIFT that predicts whether an amino acid substitution affects protein function | SIFT4G documentation |
bsa_viewer | bsa_viewer-0.0.1 | Interactive visualization of variants and segregation | Shiny documentation |
This is one-time operation. Please login to SciApps directly if you have completed this step before.
Log into CyVerse User portal at https://user.cyverse.org.
By default, you will be under the 'Services' page, click on 'AVAILABLE', then 'REQUEST ACCESS' to SciApps.
Click on 'MY SERVICES', then click on 'LAUNCH' for Discovery Environment.
Once in Discovery Environment, click to open the 'Data' window. You should see the sci_data folder under your root folder:/iplant/home/YOUR_USER_NAME/sci_data.
This step will demo how to upload data (using CyVerse Discovery Environment) to the sci_data folder for accessing from SciApps.
Click sci_data folder to open it.
Click 'Upload', then 'Import from URL' to import this URL: https://data.cyverse.org/dav-anon/iplant/home/lwang/sci_data/results/bcftools_call-1.8_8ac63e0f-53da-427d-b58f-a379f1b10443/snp_bw2_ms9_1.vcf.gz
Note
Alternatively, you can click the above URL to download the file to your computer, then use 'Simple Upload from Desktop' to upload the file.
Note
This may take a few minutes. You can check the status by clicking the 'Bell' on the top right corner of DE. Once importing completed, 'Refresh' the window to see the file. This is a variant file in gzipped VCF format from aligning the raw reads to Sorghum v3 assembly with Bowtie2 and calling variants with bcftools.
Warning
If you are using the Chrome web browser and have grammarly turned on, the 'Import from URL' button will not be activated after pasting the URL. You can turn off grammarly for the page and reload the import form or switch to a different web browser.
Alternatively, use Cyberduck or iCommands for bulk data transfer to the sci_data folder.
This step should take less than 2 minutes with the example data. Three output files will be generated.
'LAUNCH' SciApps from your CyVerse user portal, or log into SciApps with your CyVerse user credentials at https://www.SciApps.org.
Click the Variant analysis category (left panel) to find or search for bcftools_filter, then click to load bcftools_filter-1.8.
Under “Specify the variant file”, click Browse DataStore, then navigate to the sci_data folder (under 'home'); select the variant file and click 'Select and Close'.
Tip
Click 'Refresh' if you can not see a newly uploaded file.
Leave other parameters as default, and click Submit Job. You will be asked to confirm; click "Submit". You will be prompted to check the job status in the right panel.
Note
Click the info (i) icon to check the analysis status. The 'eye' icon (for visualization) is grayed out before the job is completed.
- Specify the maximum read depth: Filtered out SNPs that might be in the repeat region
- Specify the minimum read depth: Filtered out SNPs that might be due to sequencing or alignment error
- Keep EMS snps only: Whether or not to just keep EMS mutations (GC to AT)
- Specify the minimum variant ...: Only SNPs with allele frequency above the threshold will be kept
Once COMPLETED, click '1: bcftools_filter-1.8' (from the History panel) to expand outputs. There are three output files.
Note
bsa_plot.txt.gz is the file containing allele frequencies and P-values for feeding into the 'bsa_viewer' app. flt_snp_bw2_ms9_1.vcf.gz and flt_snp_bw2_ms9_1.vcf.gz.tbi are filtered variant file and its index file.
This step annotates the filtered SNPs with SnpEff and outputs an annotated VCF file.
Click the Variant analysis category (left panel) to find or search for SnpEff, then click to load SnpEff-4.3.1.
Click 1: bcftools_filter-1.8 in the History panel to expand its outputs, then drag and drop flt_snp_bw2_ms9_1.vcf.gz into the Specify the variant file field.
Leave others as defaults, then click the "Submit Job" button.
Note
The annotation file is optional. Any variant that intersects an interval defined in it will be annotated using the "name" field (fourth column) of the input annotation file (in bed format).
Once COMPLETED, click '2: SnpEff-4.3.1' to expand outputs.
Note
There are three output files:
- genes.txt.gz: a text file summarizing the number of variant types per gene
- snpEff_flt_snp_bw2_ms9_1.vcf.gz: an annotated VCF file
- summary.html: an HTML file containing summary statistics about the variants and their annotations
For nonsynonymous SNPs, we use SIFT to predict whether they will alter the protein function.
Click the Variant analysis category (left panel) to find or search for SIFT4G, then click to load SIFT4G-0.0.1.
Click 2: SnpEff-4.3.1 in the History panel to expand its outputs, then drag and drop snpEff_flt_bw2_ms9_1.vcf.gz into the Specify the variant file field.
Leave other parameters as default, and click the "Submit Job" button.
Note
- Check Multitranscripts: If Yes, estimate the mutation effect on each isoform
- Is the variant file sorted: If No, the variant file will be sorted
Once COMPLETED, click '3: SIFT4G-0.0.1' to expand its outputs.
Note
There are two output files:
- annotations_flt_snp_bw2_ms9_1.xls.fz: an XLS file with variant annotation
- predictions_flt_snp_bw2_ms9_1.vcf.gz: a VCF file with variant effect prediction
This step combines the results from Step 3, 4, and 5, and homologous genes grabbed from Gramene/ensemblPlants. The output file, bsa_plot.view.tgz, can be interactively visualized through a Shiny app.
Click the Variant analysis category (left panel) to find or search for bsa_viewer, then click to load bsa_viewer-0.0.1.
Click to expand outputs of the three jobs in the History panel, then drag and drop outputs to the input fields as shown below:
Click the "Submit Job" button. Once COMPLETED, click the 'eye' icon for the bsa_viewer-0.0.1 job in the History panel to open the following dialog window. Select the output file bsa_plot.view.tgz, then click 'Visualize' to open the Shiny app.
Warning
The interactive BSA viewer will be displayed in a new tab of your web browser window, so please check if pop-ups from SciApps are blocked by your browser and disable it if needed.
By default, the BSA viewer displays the linking probability plots along the chromosome for 'All chromosomes', with a blue horizontal line indicating the 10-5 significance threshold. As shown below, we can use the significance threshold to rule out two candidate genes in chromosome 5.
Note
T-test is used to test whether a region of the chromosome is segregated in the population (low P-values) or not (high P-values). A blue horizontal line is drawn to indicate the 10-5 significance threshold.
Nonsynonymous SNPs are marked as blue circles in the plot and filled with the red color if it is significant. Stop-gain mutations, mutations at splice donor or acceptor sites, or missense mutations with a SIFT score <= 0.05 and median info <= 3.25 are considered as having significant effects.
Nonsynonymous SNPs are also displayed in the table with the associated gene ids, paralogous genes from Arabidopsis thaliana and Oryza sativa ssp. japonica, SnpEff annotation, and SIFT score.
As shown above, two candidate genes (red dots) are located on chromosome 2. Choose "Chromosome 2" in the left panel to focus on Chromosome 2 for both the plot and the table below the plot.
Tip
Clicking the nonsynonymous SNP on the plot will highlight the SNP in the table, and vice versa.
Choose a different window size (in the left panel) to estimate the linking probability.
Use the left panel to switch to the SNP ratio plot, then, if needed, use the slider bar to change the smoothness of the fitted curve for the plot.
With two candidate genes left from the last step (red dots above the threshold), Sobic.002G221000(Sb02g026200) looks promising since it encodes a PHD-finger transcription factor that is critical for pollen development in Arabidopsis (ref). Click on the gene id Sobic.002G221000(Sb02g026200) to query the EMS SNP database. It returns that ARS178 has a significant mutation on the same gene. You can acquire the seeds, plant, and cross both lines for a complementation test.
Step 8: Finding mutant lines with significant mutation on the same candidate gene using Ensembl Plants/Gramene (Optional)
Alternatively, there is an EMS SNP database available at Ensembl Plants/Gramene. With the database, we can find the mutant lines that carry the independent mutations in the same gene.
- Go to Ensembl Plants.
- Select Sorghum bicolor under "All genomes".
- Search for SORBI_3002G221000 and click SORBI_3002G221000 (after 'Gene ID') to open the gene page.
- Click the Variant table under "Genetic Variation" from the left panel.
- Filter SNPs by SIFT score <= 0.05 to find that SNP tmp_2_61310404_C_T is the only one left. Click tmp_2_61310404_C_T to open the Variant page.
- Click 247 sample genotypes (in the six-row). Then sort the Genotype twice or until seeing C|T as the first one. The mutation is from the EMS pool named ARS178.
This tutorial covers how to use SciApps for bulked segregant Analysis, including accessing data in CyVerse Data Store, launching jobs, visualizing results, and using the EMS SNP database or Ensembl Plants/Gramene to find the mutant lines that carry different mutations in the same gene. The analysis can also be done as an automatic workflow, available as BSA-Seq under Workflow/Public workflows. The diagram of the workflow is shown below.
Fix or improve this documentation
- Search for an answer: |CyVerse Learning Center|
- Ask us for help: click |Intercom| on the lower right-hand side of the page
- Report an issue or submit a change: |Github Repo Link|
- Send feedback: Tutorials@CyVerse.org
Learning Center Home