The Bulked Segregant Analysis Tutorial

This tutorial is a step-by-step guide for using SciApps to perform bulked segregant analysis. The example data used here is from studying the sorghum Ms9 gene, located in chromosome 2, which encodes a PHD-finger transcription factor critical for pollen development (reference). The ms9 mutant plants (Mu574, right in the image) are crossed by the WT BTx623 pollen (left) to generate the F1 seeds. The F1 plants produced F2 seeds through self-fertilization. Leaf tissues from 20 confirmed F2 mutants are pooled, quality-checked, and subjected to 150-bp paired-end sequencing on an Illumina X-10 instrument.

To reduce the total computation time, we start the tutorial with variant filtering, skipping the alignment of raw reads with Bowtie2 and calling SNPs with bcftools. SNPs are filtered by the EMS mutation type (G to A or C to T), read depth, background, and minimum allele frequency. SnpEff is then used to annotate the filtered SNPs. Annotated SNPs are passed to SIFT4G for predicting amino acid substitution effects. Finally, results from all three steps are combined for visualization using the app, bsa_viewer.

bsa_viewer provides an interactive visualization interface for confirming whether the candidate genes are true or false positives. Then, at the last, we use Ensembl Plants/Gramene to identify mutant lines that contained independent mutation alleles in the candidate gene for further verification.

Example Data

Apps:

App name	App link	Description	Notes/other links
Bowtie2	Bowtie2-2.3.2	Fast and sensitive read alignment	Bowtie2 documentation
bcftools_call	bcftools_call-1.8	SNP/indel calling	bcftools documentation
bcftools_filter	bcftools_filter-1.8	SNP/indel filtering	bcftools documentation
SnpEff	SnpEff-4.3.1	Annotating variants	SnpEff documentation
SIFT4G	SIFT4G-0.0.1	A faster version of SIFT that predicts whether an amino acid substitution affects protein function	SIFT4G documentation
bsa_viewer	bsa_viewer-0.0.1	Interactive visualization of variants and segregation	Shiny documentation

Step 1: Requiring access to SciApps

This is one-time operation. Please login to SciApps directly if you have completed this step before.

Log into CyVerse User portal at https://user.cyverse.org.

By default, you will be under the 'Services' page, click on 'AVAILABLE', then 'REQUEST ACCESS' to SciApps.

Click on 'MY SERVICES', then click on 'LAUNCH' for Discovery Environment.

Once in Discovery Environment, click to open the 'Data' window. You should see the sci_data folder under your root folder:/iplant/home/YOUR_USER_NAME/sci_data.

Step 2: Uploading data for SciApps

This step will demo how to upload data (using CyVerse Discovery Environment) to the sci_data folder for accessing from SciApps.

Click sci_data folder to open it.

Click 'Upload', then 'Import from URL' to import this URL: https://data.cyverse.org/dav-anon/iplant/home/lwang/sci_data/results/bcftools_call-1.8_8ac63e0f-53da-427d-b58f-a379f1b10443/snp_bw2_ms9_1.vcf.gz

Note

Alternatively, you can click the above URL to download the file to your computer, then use 'Simple Upload from Desktop' to upload the file.

Note

This may take a few minutes. You can check the status by clicking the 'Bell' on the top right corner of DE. Once importing completed, 'Refresh' the window to see the file. This is a variant file in gzipped VCF format from aligning the raw reads to Sorghum v3 assembly with Bowtie2 and calling variants with bcftools.

Warning

If you are using the Chrome web browser and have grammarly turned on, the 'Import from URL' button will not be activated after pasting the URL. You can turn off grammarly for the page and reload the import form or switch to a different web browser.

Alternatively, use Cyberduck or iCommands for bulk data transfer to the sci_data folder.

Step 3: Variant filtering with bcftools_filter

This step should take less than 2 minutes with the example data. Three output files will be generated.

'LAUNCH' SciApps from your CyVerse user portal, or log into SciApps with your CyVerse user credentials at https://www.SciApps.org.

Click the Variant analysis category (left panel) to find or search for bcftools_filter, then click to load bcftools_filter-1.8.

Under “Specify the variant file”, click Browse DataStore, then navigate to the sci_data folder (under 'home'); select the variant file and click 'Select and Close'.

Tip

Click 'Refresh' if you can not see a newly uploaded file.

Leave other parameters as default, and click Submit Job. You will be asked to confirm; click "Submit". You will be prompted to check the job status in the right panel.

Note

Click the info (i) icon to check the analysis status. The 'eye' icon (for visualization) is grayed out before the job is completed.

Specify the maximum read depth: Filtered out SNPs that might be in the repeat region

Specify the minimum read depth: Filtered out SNPs that might be due to sequencing or alignment error

Keep EMS snps only: Whether or not to just keep EMS mutations (GC to AT)

Specify the minimum variant ...: Only SNPs with allele frequency above the threshold will be kept

Once COMPLETED, click '1: bcftools_filter-1.8' (from the History panel) to expand outputs. There are three output files.

Note

bsa_plot.txt.gz is the file containing allele frequencies and P-values for feeding into the 'bsa_viewer' app. flt_snp_bw2_ms9_1.vcf.gz and flt_snp_bw2_ms9_1.vcf.gz.tbi are filtered variant file and its index file.

Step 4: Annotating variant with SnpEff

This step annotates the filtered SNPs with SnpEff and outputs an annotated VCF file.

Click the Variant analysis category (left panel) to find or search for SnpEff, then click to load SnpEff-4.3.1.

Click 1: bcftools_filter-1.8 in the History panel to expand its outputs, then drag and drop flt_snp_bw2_ms9_1.vcf.gz into the Specify the variant file field.

Leave others as defaults, then click the "Submit Job" button.

Note

The annotation file is optional. Any variant that intersects an interval defined in it will be annotated using the "name" field (fourth column) of the input annotation file (in bed format).

Once COMPLETED, click '2: SnpEff-4.3.1' to expand outputs.

Note

There are three output files:

genes.txt.gz: a text file summarizing the number of variant types per gene

snpEff_flt_snp_bw2_ms9_1.vcf.gz: an annotated VCF file

summary.html: an HTML file containing summary statistics about the variants and their annotations

Step 5: Predicting variant effects with SIFT

For nonsynonymous SNPs, we use SIFT to predict whether they will alter the protein function.

Click the Variant analysis category (left panel) to find or search for SIFT4G, then click to load SIFT4G-0.0.1.

Click 2: SnpEff-4.3.1 in the History panel to expand its outputs, then drag and drop snpEff_flt_bw2_ms9_1.vcf.gz into the Specify the variant file field.

Leave other parameters as default, and click the "Submit Job" button.

Note

Check Multitranscripts: If Yes, estimate the mutation effect on each isoform

Is the variant file sorted: If No, the variant file will be sorted

Once COMPLETED, click '3: SIFT4G-0.0.1' to expand its outputs.

Note

There are two output files:

annotations_flt_snp_bw2_ms9_1.xls.fz: an XLS file with variant annotation

predictions_flt_snp_bw2_ms9_1.vcf.gz: a VCF file with variant effect prediction

Step 6: Visualizing EMS outputs

This step combines the results from Step 3, 4, and 5, and homologous genes grabbed from Gramene/ensemblPlants. The output file, bsa_plot.view.tgz, can be interactively visualized through a Shiny app.

Click the Variant analysis category (left panel) to find or search for bsa_viewer, then click to load bsa_viewer-0.0.1.

Click to expand outputs of the three jobs in the History panel, then drag and drop outputs to the input fields as shown below:

Click the "Submit Job" button. Once COMPLETED, click the 'eye' icon for the bsa_viewer-0.0.1 job in the History panel to open the following dialog window. Select the output file bsa_plot.view.tgz, then click 'Visualize' to open the Shiny app.

Warning

The interactive BSA viewer will be displayed in a new tab of your web browser window, so please check if pop-ups from SciApps are blocked by your browser and disable it if needed.

By default, the BSA viewer displays the linking probability plots along the chromosome for 'All chromosomes', with a blue horizontal line indicating the 10^-5 significance threshold. As shown below, we can use the significance threshold to rule out two candidate genes in chromosome 5.

Note

T-test is used to test whether a region of the chromosome is segregated in the population (low P-values) or not (high P-values). A blue horizontal line is drawn to indicate the 10^-5 significance threshold.

Nonsynonymous SNPs are marked as blue circles in the plot and filled with the red color if it is significant. Stop-gain mutations, mutations at splice donor or acceptor sites, or missense mutations with a SIFT score <= 0.05 and median info <= 3.25 are considered as having significant effects.

Nonsynonymous SNPs are also displayed in the table with the associated gene ids, paralogous genes from Arabidopsis thaliana and Oryza sativa ssp. japonica, SnpEff annotation, and SIFT score.

As shown above, two candidate genes (red dots) are located on chromosome 2. Choose "Chromosome 2" in the left panel to focus on Chromosome 2 for both the plot and the table below the plot.

Tip

Clicking the nonsynonymous SNP on the plot will highlight the SNP in the table, and vice versa.

Choose a different window size (in the left panel) to estimate the linking probability.

Use the left panel to switch to the SNP ratio plot, then, if needed, use the slider bar to change the smoothness of the fitted curve for the plot.

Step 7: Finding mutant lines with significant mutation on the same candidate gene

With two candidate genes left from the last step (red dots above the threshold), Sobic.002G221000(Sb02g026200) looks promising since it encodes a PHD-finger transcription factor that is critical for pollen development in Arabidopsis (ref). Click on the gene id Sobic.002G221000(Sb02g026200) to query the EMS SNP database. It returns that ARS178 has a significant mutation on the same gene. You can acquire the seeds, plant, and cross both lines for a complementation test.

Step 8: Finding mutant lines with significant mutation on the same candidate gene using Ensembl Plants/Gramene (Optional)

Alternatively, there is an EMS SNP database available at Ensembl Plants/Gramene. With the database, we can find the mutant lines that carry the independent mutations in the same gene.

Go to Ensembl Plants.

Select Sorghum bicolor under "All genomes".

Search for SORBI_3002G221000 and click SORBI_3002G221000 (after 'Gene ID') to open the gene page.

Click the Variant table under "Genetic Variation" from the left panel.

Filter SNPs by SIFT score <= 0.05 to find that SNP tmp_2_61310404_C_T is the only one left. Click tmp_2_61310404_C_T to open the Variant page.

Click 247 sample genotypes (in the six-row). Then sort the Genotype twice or until seeing C|T as the first one. The mutation is from the EMS pool named ARS178.

Summary

This tutorial covers how to use SciApps for bulked segregant Analysis, including accessing data in CyVerse Data Store, launching jobs, visualizing results, and using the EMS SNP database or Ensembl Plants/Gramene to find the mutant lines that carry different mutations in the same gene. The analysis can also be done as an automatic workflow, available as BSA-Seq under Workflow/Public workflows. The diagram of the workflow is shown below.

More help and additional information

Efficient Identification of Causal Mutations through Sequencing of Bulked F2 from Two Allelic Bloomless Mutants of Sorghum bicolor

Fix or improve this documentation

Search for an answer: |CyVerse Learning Center|
Ask us for help: click |Intercom| on the lower right-hand side of the page
Report an issue or submit a change: |Github Repo Link|
Send feedback: Tutorials@CyVerse.org

Learning Center Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bsa.rst

bsa.rst

The Bulked Segregant Analysis Tutorial

Step 1: Requiring access to SciApps

Step 2: Uploading data for SciApps

Step 3: Variant filtering with bcftools_filter

Step 4: Annotating variant with SnpEff

Step 5: Predicting variant effects with SIFT

Step 6: Visualizing EMS outputs

Step 7: Finding mutant lines with significant mutation on the same candidate gene

Step 8: Finding mutant lines with significant mutation on the same candidate gene using Ensembl Plants/Gramene (Optional)

Summary

More help and additional information

Files

bsa.rst

Latest commit

History

bsa.rst

File metadata and controls

The Bulked Segregant Analysis Tutorial

Step 1: Requiring access to SciApps

Step 2: Uploading data for SciApps

Step 3: Variant filtering with bcftools_filter

Step 4: Annotating variant with SnpEff

Step 5: Predicting variant effects with SIFT

Step 6: Visualizing EMS outputs

Step 7: Finding mutant lines with significant mutation on the same candidate gene

Step 8: Finding mutant lines with significant mutation on the same candidate gene using Ensembl Plants/Gramene (Optional)

Summary

More help and additional information