Guest Instructor: Haikuo Li, Ph.D., Yale University
February 2025
In this lab section, you will:
→ Analyze and visualize fragment inserts in snATAC-seq data
→ Understand quality control procedures in snATAC-seq data analysis
→ Get familiar with samtools
→ Learn basic Python coding skills
→ (Optional) Downstream snATAC-seq data analysis and data mining
Here, we will have a main lab task which is primarily based on Python.
You may run Python with the Yale HPC (recommended) (https://beng469.ycrc.yale.edu/), or with your labtop or other resources.
You may run Python in Jupyter Notebook (recommended) or in the Linux Shell interface.
• Have pysam
, pandas
and matplotlib
installed in your Python.
o If you are using the Yale HPC, load the miniconda module, create a new conda environment containing python, pysam, pandas, matplotlib and jupyter. Unix scripts provided below:
##you must enter a computation node to do anything. So, salloc
salloc
##this command makes sure you have no modules loading now
module purge
## now let's create a new miniconda environment
module load miniconda
conda create -n atac_class python jupyter jupyterlab pysam matplotlib pandas
#enter y when the system asks you. This takes 3-5 minutes
conda activate atac_class
#you shouldn't see any errors with this command
ycrc_conda_env.sh update
# now you can find this new miniconda environment on Yale HPC jupyter notebook
• No matter whether you use the Yale HPC or not, test by running these 3 commands in Python. Make sure they are all installed, and you shouldn’t see any error messages.
import pysam
import collections
import matplotlib.pyplot as plt
We will also use samtools (https://www.htslib.org/), which is a package used in the Linux Shell interface. If you are using the Yale HPC, you may simply have samtools ready to be used by this Unix command:
module load SAMtools
• If you are not using the Yale HPC, make sure samtools is installed. To check successful installation, you may run:
samtools --version
• We will download some BAM files provided by Cusanovich and Hill, et al. (database link: https://atlas.gs.washington.edu/mouse-atac/data/).
• First, we will analyze the Cerebellum BAM data (Cerebellum_62216.bam). Since the original BAM file is big (2.1G), we generated a 10% downsampled subset for you (Link: https://docs.google.com/uc?export=download&id=1bubrwts2I_J-woZTVwyzlkIuyMX9_3rL). You should download this subset and upload it to your own Linux workspace.
• Second, choose any tissue you like, other than the cerebellum, that is available in this database. Download its BAM file (Note: not the .bam.bai which is the index file) and make sure the .bam file is available in your own Linux workspace (you may use wget
).
• Do some online search and learn what a SAM/BAM file is.
• (Optional) For Python training purposes, download a small demo data from my GitHub (https://github.com/HaikuoLi/Yale_BENG469_teaching/blob/main/ATAC_meta.csv) and upload it to your workspace. You may skip this if you are good at Python already.
If we have enough time for extra lab tasks, we will learn how to perform downstream snATAC-seq data analysis with SnapATAC2 or Signac, two popular analysis packages, following publicly available vignettes.
• If you prefer Python, have SnapATAC2
installed.
• If you prefer R (RStudio), have Signac
installed.