precellar.SeqSpec.delete_read#
+-
+
- +SeqSpec.delete_read(read_id)# +
Delete a read from the SeqSpec object.
+
diff --git a/.buildinfo b/.buildinfo new file mode 100644 index 0000000..7b9d173 --- /dev/null +++ b/.buildinfo @@ -0,0 +1,4 @@ +# Sphinx build info version 1 +# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. +config: f2af386d6e99bb22829f120be7f0f3d7 +tags: 645f666f9bcd5a90fca523b33c5a78b7 diff --git a/.doctrees/_autosummary/precellar.SeqSpec.delete_read.doctree b/.doctrees/_autosummary/precellar.SeqSpec.delete_read.doctree new file mode 100644 index 0000000..0b99698 Binary files /dev/null and b/.doctrees/_autosummary/precellar.SeqSpec.delete_read.doctree differ diff --git a/.doctrees/_autosummary/precellar.SeqSpec.doctree b/.doctrees/_autosummary/precellar.SeqSpec.doctree new file mode 100644 index 0000000..599ec4d Binary files /dev/null and b/.doctrees/_autosummary/precellar.SeqSpec.doctree differ diff --git a/.doctrees/_autosummary/precellar.SeqSpec.to_yaml.doctree b/.doctrees/_autosummary/precellar.SeqSpec.to_yaml.doctree new file mode 100644 index 0000000..f8a300b Binary files /dev/null and b/.doctrees/_autosummary/precellar.SeqSpec.to_yaml.doctree differ diff --git a/.doctrees/_autosummary/precellar.SeqSpec.update_read.doctree b/.doctrees/_autosummary/precellar.SeqSpec.update_read.doctree new file mode 100644 index 0000000..3d7ad6f Binary files /dev/null and b/.doctrees/_autosummary/precellar.SeqSpec.update_read.doctree differ diff --git a/.doctrees/_autosummary/precellar.align.doctree b/.doctrees/_autosummary/precellar.align.doctree new file mode 100644 index 0000000..775fbb8 Binary files /dev/null and b/.doctrees/_autosummary/precellar.align.doctree differ diff --git a/.doctrees/_autosummary/precellar.make_fragment.doctree b/.doctrees/_autosummary/precellar.make_fragment.doctree new file mode 100644 index 0000000..6310367 Binary files /dev/null and b/.doctrees/_autosummary/precellar.make_fragment.doctree differ diff --git a/.doctrees/_autosummary/precellar.make_genome_index.doctree b/.doctrees/_autosummary/precellar.make_genome_index.doctree new file mode 100644 index 0000000..362bedb Binary files /dev/null and b/.doctrees/_autosummary/precellar.make_genome_index.doctree differ diff --git a/.doctrees/_autosummary/precellar.utils.strip_barcode_from_fastq.doctree b/.doctrees/_autosummary/precellar.utils.strip_barcode_from_fastq.doctree new file mode 100644 index 0000000..6536212 Binary files /dev/null and b/.doctrees/_autosummary/precellar.utils.strip_barcode_from_fastq.doctree differ diff --git a/.doctrees/api.doctree b/.doctrees/api.doctree new file mode 100644 index 0000000..a95c509 Binary files /dev/null and b/.doctrees/api.doctree differ diff --git a/.doctrees/environment.pickle b/.doctrees/environment.pickle new file mode 100644 index 0000000..f2e0ed4 Binary files /dev/null and b/.doctrees/environment.pickle differ diff --git a/.doctrees/index.doctree b/.doctrees/index.doctree new file mode 100644 index 0000000..a0dadf6 Binary files /dev/null and b/.doctrees/index.doctree differ diff --git a/.doctrees/nbsphinx/tutorials/generic.ipynb b/.doctrees/nbsphinx/tutorials/generic.ipynb new file mode 100644 index 0000000..c158718 --- /dev/null +++ b/.doctrees/nbsphinx/tutorials/generic.ipynb @@ -0,0 +1,246 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Processing barcoded Fastq files\n", + "\n", + "You would likely encounter barcoded fastq files when working with single cell ATAC-seq data.\n", + "As on early days of single cell ATAC-seq, cell barcodes are usually added to the read name of the fastq files.\n", + "This notebook demonstrates how to process these barcoded fastq files." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import precellar" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extracting cell barcodes from read names" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "@CCAGCACAAGCCATCCTATCGT:A00953:155:HVCHLDRXX:1:1101:1036:1031 1:N:0:1\n", + "ANCTTGGATCATCAGGTTTGTCTGTAGCTGATTTATTTCTTTAAGTTTCCC\n", + "+\n", + "F#FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF\n", + "@TAACCACTACGAATGACTGACA:A00953:155:HVCHLDRXX:1:1101:1127:1031 1:N:0:1\n", + "TNCCAGGACCAGTGACCGTCACCCGCAGTAAGGATCGGGGCGGCTCCGCCA\n", + "+\n", + "F#:FFFFFFFFF:FFFFF:FF,F,FFFFFFFF,FFF:FFFF:FFFFFF,FF\n", + "@CGATATGTAGGGGACTAATTCC:A00953:155:HVCHLDRXX:1:1101:1145:1031 1:N:0:1\n", + "GNCGGATCACAAGGTCAGGAGTTCGAGACCTGGCTGGCCAACACGGTGAAA\n", + "\n", + "gzip: stdout: Broken pipe\n" + ] + } + ], + "source": [ + "!zcat R1.fq.gz | head" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "precellar.utils.strip_barcode_from_fastq(\n", + " 'R1.fq.gz',\n", + " 'R1_processed.fq.zst',\n", + " out_barcode='I1.fq.zst',\n", + " regex=\"^([ACTG]+):\",\n", + " right_add=1,\n", + ")\n", + "\n", + "precellar.utils.strip_barcode_from_fastq(\n", + " 'R2.fq.gz',\n", + " 'R2_processed.fq.zst',\n", + " regex=\"^([ACTG]+):\",\n", + " right_add=1,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[90m[\u001b[0m2024-10-01T15:18:02Z \u001b[32mINFO \u001b[0m cached_path::cache\u001b[90m]\u001b[0m Starting download of https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/generic_atac.yaml\n", + "\u001b[90m[\u001b[0m2024-10-01T15:18:02Z \u001b[32mINFO \u001b[0m cached_path::cache\u001b[90m]\u001b[0m Downloaded 2643 bytes\n", + "\u001b[90m[\u001b[0m2024-10-01T15:18:02Z \u001b[32mINFO \u001b[0m cached_path::cache\u001b[90m]\u001b[0m New version of https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/generic_atac.yaml cached\n" + ] + } + ], + "source": [ + "assay = precellar.SeqSpec(\"https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/generic_atac.yaml\")" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "└── atac(153-1150)\n", + " ├── atac-illumina_p5(29)\n", + " ├── atac-read1(34) [↓R1(1-98)]\n", + " ├── gDNA(1-1000)\n", + " ├── atac-read2(34) [↑R2(1-98), ↓I1(22)]\n", + " ├── atac-cell_barcode(22)\n", + " └── atac-illumina_p7(24)" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "assay" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "assay.update_read(\"R1\", fastq=\"R1_processed.fq.zst\")\n", + "assay.update_read(\"I1\", fastq=\"I1.fq.zst\")\n", + "assay.update_read(\"R2\", fastq=\"R2_processed.fq.zst\")" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "└── atac(153-1150)\n", + " ├── atac-illumina_p5(29)\n", + " ├── atac-read1(34) [↓R1(51)]\n", + " ├── gDNA(1-1000)\n", + " ├── atac-read2(34) [↑R2(51), ↓I1(22)]\n", + " ├── atac-cell_barcode(22)\n", + " └── atac-illumina_p7(24)" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "assay" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[90m[\u001b[0m2024-10-01T15:18:10Z \u001b[32mINFO \u001b[0m precellar::align\u001b[90m]\u001b[0m Counting barcodes...\n", + "\u001b[90m[\u001b[0m2024-10-01T15:18:10Z \u001b[33mWARN \u001b[0m seqspec\u001b[90m]\u001b[0m Reads (R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.\n", + "\u001b[90m[\u001b[0m2024-10-01T15:18:10Z \u001b[33mWARN \u001b[0m seqspec\u001b[90m]\u001b[0m Reads (R2) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.\n", + "\u001b[90m[\u001b[0m2024-10-01T15:18:10Z \u001b[32mINFO \u001b[0m precellar::align\u001b[90m]\u001b[0m Found 2500 barcodes. 100.00% of them have an exact match in whitelist\n", + "\u001b[90m[\u001b[0m2024-10-01T15:18:10Z \u001b[32mINFO \u001b[0m precellar::align\u001b[90m]\u001b[0m Aligning reads...\n", + "\u001b[90m[\u001b[0m2024-10-01T15:18:10Z \u001b[33mWARN \u001b[0m seqspec\u001b[90m]\u001b[0m Reads (R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.\n", + "\u001b[90m[\u001b[0m2024-10-01T15:18:10Z \u001b[33mWARN \u001b[0m seqspec\u001b[90m]\u001b[0m Reads (R2) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.\n", + "100%|██████████| 2500/2500 [00:00<00:00, 15545.42it/s]" + ] + } + ], + "source": [ + "qc = precellar.align(\n", + " assay, \"/data/kzhang/GRCh38/hg38.fa.gz\",\n", + " modality=\"atac\",\n", + " output_fragment=\"atac_fragments.tsv.zst\",\n", + " num_threads=32,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'frac_q30_bases_read1': 0.8179764705882353,\n", + " 'frac_valid_barcode': 1.0,\n", + " 'sequenced_read_pairs': 2500.0,\n", + " 'frac_q30_bases_barcode': 1.0,\n", + " 'frac_unmapped': 0.07640000000000002,\n", + " 'sequenced_reads': 5000.0,\n", + " 'frac_fragment_flanking_single_nucleosome': 0.0029791459781529296,\n", + " 'frac_confidently_mapped': 0.8524,\n", + " 'frac_fragment_in_nucleosome_free_region': 0.010427010923535254,\n", + " 'frac_q30_bases_read2': 0.9442745098039216,\n", + " 'frac_nonnuclear': 0.0128,\n", + " 'frac_duplicates': 0.004940711462450593}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "qc" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/.doctrees/nbsphinx/tutorials/txg_multiome.ipynb b/.doctrees/nbsphinx/tutorials/txg_multiome.ipynb new file mode 100644 index 0000000..21302b5 --- /dev/null +++ b/.doctrees/nbsphinx/tutorials/txg_multiome.ipynb @@ -0,0 +1,261 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Preprocess 10X Genomics Gene Expression + ATAC Data" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import precellar" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create the SeqSpec object" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We first create a SeqSpec object from the 10X multiome seqspec template file." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[90m[\u001b[0m2024-10-01T15:16:56Z \u001b[32mINFO \u001b[0m cached_path::cache\u001b[90m]\u001b[0m Cached version of https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/10x_rna_atac.yaml is up-to-date\n" + ] + } + ], + "source": [ + "assay = precellar.SeqSpec(\"https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/10x_rna_atac.yaml\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The SeqSpec object can be visualized as a tree structure." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "├── rna(153-1150)\n", + "│ ├── rna-illumina_p5(29)\n", + "│ ├── rna-truseq_read1(29) [↓rna-R1(28)]\n", + "│ ├── rna-cell_barcode(16)\n", + "│ ├── rna-umi(12)\n", + "│ ├── rna-cDNA(1-1000)\n", + "│ ├── rna-truseq_read2(34) [↓rna-I1(8), ↑rna-R2(1-98)]\n", + "│ ├── rna-index7(8)\n", + "│ └── rna-illumina_p7(24)\n", + "└── atac(153-1150)\n", + " ├── atac-illumina_p5(29)\n", + " ├── atac-cell_barcode(16)\n", + " ├── atac-linker(8)\n", + " ├── atac-nextera_read1(33) [↓atac-R1(1-98), ↑atac-I2(24)]\n", + " ├── atac-gDNA(1-1000)\n", + " ├── atac-nextera_read2(34) [↓atac-I1(8), ↑atac-R2(1-98)]\n", + " ├── atac-index7(8)\n", + " └── atac-illumina_p7(24)" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "assay" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We then add ATAC data to the SeqSpec object. Specifically, we add R1, R2 and index fastq files. Note that the index read, ususally named \"*_R2_*\" in the 10X multiome data, can be sequenced in either direction, depending on the sequencing platform. Older platforms such as NovaSeq 6000 with v1.0 reagent kits, MiniSeq with Rapid Reagent kits, MiSeq, HiSeq 2500, and HiSeq 2000 use the forward strand workflow. iSeq 100, MiniSeq with Standard reagent kits, NextSeq Systems, NovaSeq 6000 with v1.5 reagent kits, HiSeq X, HiSeq 4000, and HiSeq 3000 use the reverse strand workflow.\n", + "\n", + "We need to pay special attention to the orientation of the index read. In this example, the index read is sequenced in the reverse direction." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "assay.update_read(\"atac-R1\", fastq=\"/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R1_001.fq.zst\")\n", + "assay.update_read(\"atac-I2\", fastq=\"/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R2_001.fq.zst\")\n", + "assay.update_read(\"atac-R2\", fastq=\"/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R3_001.fq.zst\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can visualize the SeqSpec object again after adding the ATAC data. Note that read information has been updated, reflecting the real read length of the fastq files." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "├── rna(153-1150)\n", + "│ ├── rna-illumina_p5(29)\n", + "│ ├── rna-truseq_read1(29) [↓rna-R1(28)]\n", + "│ ├── rna-cell_barcode(16)\n", + "│ ├── rna-umi(12)\n", + "│ ├── rna-cDNA(1-1000)\n", + "│ ├── rna-truseq_read2(34) [↓rna-I1(8), ↑rna-R2(1-98)]\n", + "│ ├── rna-index7(8)\n", + "│ └── rna-illumina_p7(24)\n", + "└── atac(153-1150)\n", + " ├── atac-illumina_p5(29)\n", + " ├── atac-cell_barcode(16)\n", + " ├── atac-linker(8)\n", + " ├── atac-nextera_read1(33) [↓atac-R1(150), ↑atac-I2(24)]\n", + " ├── atac-gDNA(1-1000)\n", + " ├── atac-nextera_read2(34) [↓atac-I1(8), ↑atac-R2(150)]\n", + " ├── atac-index7(8)\n", + " └── atac-illumina_p7(24)" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "assay" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can save the SeqSpec object to a yaml file and load it next time." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "with open(\"10x_rna_atac.yaml\", \"w\") as f:\n", + " f.write(assay.to_yaml())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Align ATAC reads\n", + "\n", + "We can now start reads mapping and create the fragment file. If you do not have the genome index, you can create it using the `make_genome_index` function." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[90m[\u001b[0m2024-10-01T15:17:03Z \u001b[32mINFO \u001b[0m precellar::align\u001b[90m]\u001b[0m Counting barcodes...\n", + "\u001b[90m[\u001b[0m2024-10-01T15:17:04Z \u001b[32mINFO \u001b[0m cached_path::cache\u001b[90m]\u001b[0m Cached version of https://teichlab.github.io/scg_lib_structs/data/10X-Genomics/atac_737K-arc-v1.txt.gz is up-to-date\n", + "\u001b[90m[\u001b[0m2024-10-01T15:17:04Z \u001b[33mWARN \u001b[0m seqspec\u001b[90m]\u001b[0m Reads (atac-R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.\n", + "\u001b[90m[\u001b[0m2024-10-01T15:17:11Z \u001b[32mINFO \u001b[0m precellar::align\u001b[90m]\u001b[0m Found 12387063 barcodes. 94.19% of them have an exact match in whitelist\n", + "\u001b[90m[\u001b[0m2024-10-01T15:17:11Z \u001b[32mINFO \u001b[0m precellar::align\u001b[90m]\u001b[0m Aligning reads...\n", + "\u001b[90m[\u001b[0m2024-10-01T15:17:11Z \u001b[33mWARN \u001b[0m seqspec\u001b[90m]\u001b[0m Reads (atac-R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.\n", + "\u001b[90m[\u001b[0m2024-10-01T15:17:11Z \u001b[33mWARN \u001b[0m seqspec\u001b[90m]\u001b[0m Reads (atac-R2) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.\n", + "100%|██████████| 12387063/12387063 [05:33<00:00, 37127.51it/s]" + ] + } + ], + "source": [ + "qc = precellar.align(\n", + " assay, \"/data/kzhang/genome/mm10\",\n", + " modality=\"atac\",\n", + " output_fragment=\"atac_fragments.tsv.zst\",\n", + " num_threads=32,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "QC metrics: " + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'frac_fragment_flanking_single_nucleosome': 2.912761037492815e-05, 'frac_confidently_mapped': 0.9292046064511015, 'frac_q30_bases_read1': 0.9173089515515771, 'frac_valid_barcode': 0.9815396111249293, 'frac_duplicates': 0.33183332695206896, 'frac_unmapped': 0.008819422106194463, 'frac_q30_bases_barcode': 0.9219054690365263, 'frac_nonnuclear': 7.443249461151526e-05, 'frac_q30_bases_read2': 0.9229901653577338, 'sequenced_read_pairs': 12387063.0, 'sequenced_reads': 24774126.0, 'frac_fragment_in_nucleosome_free_region': 4.518862917979508e-05}\n" + ] + } + ], + "source": [ + "print(qc)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/.doctrees/tutorials/generic.doctree b/.doctrees/tutorials/generic.doctree new file mode 100644 index 0000000..c711cc6 Binary files /dev/null and b/.doctrees/tutorials/generic.doctree differ diff --git a/.doctrees/tutorials/index.doctree b/.doctrees/tutorials/index.doctree new file mode 100644 index 0000000..495a0f7 Binary files /dev/null and b/.doctrees/tutorials/index.doctree differ diff --git a/.doctrees/tutorials/txg_multiome.doctree b/.doctrees/tutorials/txg_multiome.doctree new file mode 100644 index 0000000..0c4e86e Binary files /dev/null and b/.doctrees/tutorials/txg_multiome.doctree differ diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/_autosummary/precellar.SeqSpec.delete_read.html b/_autosummary/precellar.SeqSpec.delete_read.html new file mode 100644 index 0000000..1423482 --- /dev/null +++ b/_autosummary/precellar.SeqSpec.delete_read.html @@ -0,0 +1,561 @@ + + + + + + +
+ + + +A SeqSpec object.
+A SeqSpec object is used to annotate sequencing libraries produced by genomics assays. +Genomic library structure depends on both the assay and sequencer (and kits) used to +generate and bind the assay-specific construct to the sequencing adapters to generate +a sequencing library. SeqSpec is specific to both a genomics assay and sequencer +and provides a standardized format for describing the structure of sequencing +libraries and the resulting sequencing reads. See pachterlab/seqspec for more details.
+path – The local path or url to the seqspec file.
+See also
+ +Methods
+
|
+Delete a read from the SeqSpec object. |
+
+ | + |
|
+Update read information in the SeqSpec object. |
+
Update read information in the SeqSpec object.
+This method updates the read information in the SeqSpec object. +If the read does not exist, it will be created.
+read_id (str) – The id of the read.
modality (str | None) – The modality of the read.
primer_id (str | None) – The id of the primer.
is_reverse (bool | None) – Whether the read is reverse.
fastq (Path | list[Path] | None) – The path to the fastq file containing the reads.
min_len (int | None) – The minimum length of the read. If not provided, the minimum length is inferred from the fastq file.
max_len (int | None) – The maximum length of the read. If not provided, the maximum length is inferred from the fastq file.
Align fastq reads to the reference genome and generate unique fragments.
+seqspec (SeqSpec | Path) – A SeqSpec object or file path to the yaml sequencing specification file, see +pachterlab/seqspec.
genom_index (Path) – File path to the genome index. The genome index can be created by the make_genome_index
function.
modality (str) – The modality of the sequencing data, e.g., “rna” or “atac”.
output_bam (Path | None) – File path to the output bam file. If None, the bam file will not be generated.
output_fragment (Path | None) – File path to the output fragment file. If None, the fragment file will not be generated.
shift_left (int) – The number of bases to shift the left end of the fragment.
shift_right (int) – The number of bases to shift the right end of the fragment.
compression (str | None) – The compression algorithm to use for the output fragment file. +If None, the compression algorithm will be inferred from the file extension.
compression_level (int | None) – The compression level to use for the output fragment file.
temp_dir (Path | None) – The temporary directory to use.
num_threads (int) – The number of threads to use.
A dictionary containing the QC metrics of the alignment and fragment generation.
+Create a genome index from a fasta file.
+fasta (Path) – File path to the fasta file.
genome_prefix (Path) – File path to the genome index.
Remove barcode from the read names of fastq records.
+The old practice of storing barcodes in read names is not recommended. This +function extracts barcodes from read names using regular expressions and +writes them to a separate fastq file.
+in_fq (Path) – File path to the input fastq file.
out_fq (Path) – File path to the output fastq file.
regex (str) – Extract barcodes from read names of BAM records using regular expressions.
+Reguler expressions should contain exactly one capturing group
+(Parentheses group the regex between them) that matches
+the barcodes. For example, barcode_regex="(..:..:..:..):\\w+$"
+extracts bd:69:Y6:10
from
+A01535:24:HW2MMDSX2:2:1359:8513:3458:bd:69:Y6:10:TGATAGGTTG
.
+You can test your regex on this website: https://regex101.com/.
out_barcode (Path | None) – File path to the output fastq file containing the extracted barcodes. +If None, the barcodes will not be written to a file.
left_add (int) – Additional bases to strip from the left side of the barcode.
right_add (int) – Additional bases to strip from the right side of the barcode.
compression (str | None) – Compression algorithm to use. If None, the compression algorithm will be inferred from the file extension.
compression_level (int | None) – Compression level to use.
num_threads (int) – The number of threads to use.