scran, in Python

Overview

The scranpy package provides Python bindings to the single-cell analysis methods in the libscran C++ libraries. It performs the standard steps in a typical single-cell analysis including quality control, normalization, feature selection, dimensionality reduction, clustering and marker detection. This package is effectively a mirror of its counterparts in Javascript (scran.js) and R (scrapper), which are based on the same underlying C++ libraries and concepts.

Quick start

Let's fetch a dataset from the scrnaseq package:

import scrnaseq 
sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14", realize_assays=True)
print(sce)
## class: SingleCellExperiment
## dimensions: (20006, 3005)
## assays(1): ['counts']
## row_data columns(1): ['featureType']
## row_names(20006): ['Tspan12', 'Tshz1', 'Fnbp1l', ..., 'mt-Rnr2', 'mt-Rnr1', 'mt-Nd4l']
## column_data columns(9): ['tissue', 'group #', 'total mRNA mol', 'well', 'sex', 'age', 'diameter', 'level1class', 'level2class']
## column_names(3005): ['1772071015_C02', '1772071017_G12', '1772071017_A05', ..., '1772063068_D01', '1772066098_A12', '1772058148_F03']
## main_experiment_name: gene
## reduced_dims(0): []
## alternative_experiments(2): ['repeat', 'ERCC']
## row_pairs(0): []
## column_pairs(0): []
## metadata(0):

Then we call scranpy's analyze() functions, with some additional information about the mitochondrial subset for quality control purposes.

import scranpy
results = scranpy.analyze(
    sce,
    rna_subsets = {
        "mito": [name.startswith("mt-") for name in sce.get_row_names()]
    }
)

This will perform all of the usual steps for a routine single-cell analysis, as described in Bioconductor's Orchestrating single cell analysis book. It returns an object containing clusters, t-SNEs, UMAPs, marker genes, and so on:

print(results.clusters)
## [ 0  0  0 ...  6  6 13]

print(results.tsne)
## [[  6.24189264   6.12559936   5.41776875 ...  20.07822751  18.25022123
##    14.78338538]
##  [-28.82249121 -28.18510674 -28.92849565 ...   7.73694327   3.70750309
##     7.13103324]]

print(results.umap)
## [[ 9.84396648  9.73148155  9.83376408 ... -6.64551735 -5.74155378
##   -4.41887522]
##  [-1.26350224 -1.16540933 -1.13979638 ... -5.63315582 -4.83151293
##   -6.02484226]]

first_markers = results.rna_markers.to_biocframes(summaries=["median"])[0]
first_markers.set_row_names(results.rna_row_names, in_place=True)
print(first_markers)
## BiocFrame with 20006 rows and 6 columns
##                        mean            detected     cohens_d_median          auc_median   delta_mean_median delta_detected_median
##          <ndarray[float64]>  <ndarray[float64]>  <ndarray[float64]>  <ndarray[float64]>  <ndarray[float64]>    <ndarray[float64]>
## Tspan12 0.35759151503119846  0.3157894736842105 0.31138667468315545  0.5989624247185128 0.31138667468315545   0.31138667468315545
##   Tshz1  0.5997779968274406 0.41776315789473684 0.36865087228075244  0.6031352215283973 0.36865087228075244   0.36865087228075244
##  Fnbp1l  1.1660581606996154              0.6875  0.7644031115934953  0.6905056759545924  0.7644031115934953    0.7644031115934953
##                         ...                 ...                 ...                 ...                 ...                   ...
## mt-Rnr2   6.966227511583628  0.9967105263157895 -0.7666238430581961  0.2928277982073087 -0.7666238430581961   -0.7666238430581961
## mt-Rnr1   4.914541788016454  0.9769736842105263 -0.4847704371628273  0.3834708208344696 -0.4847704371628273   -0.4847704371628273
## mt-Nd4l  3.2901199968427246  0.9342105263157894 -0.5903983282435646 0.30724666142969365 -0.5903983282435646   -0.5903983282435646

Users can also convert the results into a SingleCellExperiment for easier manipulation:

print(results.to_singlecellexperiment())
## class: SingleCellExperiment
## dimensions: (20006, 2874)
## assays(2): ['filtered', 'normalized']
## row_data columns(5): ['mean', 'variance', 'fitted', 'residual', 'is_highly_variable']
## row_names(20006): ['Tspan12', 'Tshz1', 'Fnbp1l', ..., 'mt-Rnr2', 'mt-Rnr1', 'mt-Nd4l']
## column_data columns(5): ['sum', 'detected', 'subset_proportion_mito', 'size_factors', 'clusters']
## column_names(2874): ['1772071015_C02', '1772071017_G12', '1772071017_A05', ..., '1772066097_D04', '1772063068_D01', '1772066098_A12']
## main_experiment_name:
## reduced_dims(3): ['pca', 'tsne', 'umap']
## alternative_experiments(0): []
## row_pairs(0): []
## column_pairs(0): []
## metadata(0):

Check out the reference documentation for more details.

Multiple batches

To demonstrate, let's grab two pancreas datasets from the scrnaseq package. Each dataset represents a separate batch of cells generated in different studies.

import scrnaseq 
gsce = scrnaseq.fetch_dataset("grun-pancreas-2016", "2023-12-14", realize_assays=True)
msce = scrnaseq.fetch_dataset("muraro-pancreas-2016", "2023-12-19", realize_assays=True)

They don't have the same features, so we'll just take the intersection of their row names before combining them into a single SingleCellExperiment object:

import biocutils
common = biocutils.intersect(gsce.get_row_names(), msce.get_row_names())
combined = biocutils.relaxed_combine_columns(
    gsce[biocutils.match(common, gsce.get_row_names()), :],
    msce[biocutils.match(common, msce.get_row_names()), :]
)
print(combined)
## class: SingleCellExperiment
## dimensions: (18499, 4800)
## assays(1): ['counts']
## row_data columns(2): ['symbol', 'chr']
## row_names(18499): ['A1BG-AS1__chr19', 'A1BG__chr19', 'A1CF__chr10', ..., 'ZYX__chr7', 'ZZEF1__chr17', 'ZZZ3__chr1']
## column_data columns(4): ['donor', 'sample', 'label', 'plate']
## column_names(4800): ['D2ex_1', 'D2ex_2', 'D2ex_3', ..., 'D30-8_94', 'D30-8_95', 'D30-8_96']
## main_experiment_name: endogenous
## reduced_dims(0): []
## alternative_experiments(0): []
## row_pairs(0): []
## column_pairs(0): []
## metadata(0):

We can now perform a batch-aware analysis, where the blocking factor is also used in relevant functions to avoid problems with batch effects.

import scranpy
block = ["grun"] * gsce.shape[1] + ["muraro"] * msce.shape[1]
results = scranpy.analyze(combined, block=block) # no mitochondrial genes in this case...

This yields mostly the same set of results as before, but with an extra MNN-corrected embedding for clustering, visualization, etc.

results.mnn_corrected.corrected
## array([[-1.87690275e+01, -2.20133721e+01, -2.01364711e+01, ...,
##          1.60988874e+01, -2.10494187e+01, -9.41325421e+00],
##        [ 9.95069366e+00,  1.12168142e+01,  1.40745981e+01, ...,
##         -5.63689417e+00, -1.46003926e+01, -4.02325382e+00],
##        [ 1.17917046e+01,  8.40756681e+00,  1.24557851e+01, ...,
##          3.65281722e+00, -1.13280613e+01, -1.12939448e+01],
##        ...,
##        [-4.20177077e+00,  3.64443391e-01,  1.13834851e+00, ...,
##          1.43898885e-02, -2.24228270e+00, -5.89749453e-01],
##        [-2.49456306e+00,  6.82624452e-01,  2.30363317e+00, ...,
##          1.09145269e+00,  3.17776365e+00,  8.27058276e-01],
##        [-2.03562222e+00,  2.04701389e+00,  5.64774034e-01, ...,
##          4.31078606e-01, -4.02375136e-01,  8.52493315e-01]],
##       shape=(25, 3984))

Multiple modalities

Let's grab a 10X Genomics immune profiling dataset (see here), which contains count data for the entire transcriptome and targeted proteins:

import singlecellexperiment
sce = singlecellexperiment.read_tenx_h5("immune_3.0.0-tenx.h5", realize_assays=True)
sce.set_row_names(sce.get_row_data().get_column("id"), in_place=True)
## class: SingleCellExperiment
## dimensions: (33555, 8258)
## assays(1): ['counts']
## row_data columns(7): ['feature_type', 'genome', 'id', 'name', 'pattern', 'read', 'sequence']
## row_names(33555): ['ENSG00000243485', 'ENSG00000237613', 'ENSG00000186092', ..., 'IgG2b', 'CD127', 'CD15']
## column_data columns(1): ['barcodes']
## column_names(0):
## main_experiment_name:
## reduced_dims(0): []
## alternative_experiments(0): []
## row_pairs(0): []
## column_pairs(0): []
## metadata(0):

We split it to genes and ADTs:

feattypes = sce.get_row_data().get_column("feature_type")
gene_data = sce[[x == "Gene Expression" for x in feattypes],:]
adt_data = sce[[x == "Antibody Capture" for x in feattypes],:]

And now we can run the analysis:

import scranpy
results = scranpy.analyze(
    gene_data,
    adt_x = adt_data, 
    rna_subsets = { 
        "mito": [n.startswith("MT-") for n in gene_data.get_row_data().get_column("name")]
    },
    adt_subsets = {
        "igg": [n.startswith("IgG") for n in adt_data.get_row_data().get_column("name")]
    }
)

This returns ADT-specific results in the relevant fields, as well as a set of combined PCs for use in clustering, visualization, etc.

print(results.adt_size_factors)
## [0.79359408 0.79410332 0.89536413 ... 0.79207839 0.66492723 0.76847637]

print(results.combined_pca.combined)
## [[-9.97603155e+00 -1.04045057e+01 -1.26408576e+01 ... -1.29361354e+01
##   -1.09887392e+01 -1.08070608e+01]
##  [ 7.47726554e+00  6.77629373e+00  1.78091509e+00 ...  2.22256539e+00
##    5.96667219e+00  7.10437993e+00]
##  [-2.63898263e+00 -1.24485522e+00  5.51002546e+00 ...  5.21037673e+00
##   -5.54233035e+00 -3.38828724e+00]
##  ...
##  [-2.04699441e-01 -4.38991650e-01 -2.87170731e+00 ...  2.36527395e+00
##    7.05969255e-01 -2.46180209e-01]
##  [ 4.75688909e-01 -1.54557081e-01 -1.30053159e+00 ...  2.81492567e+00
##    1.21607502e+00 -3.12194853e-01]
##  [ 8.56575012e-02  8.74924626e-03 -7.17362957e-04 ...  1.65769854e-01
##    1.73927253e-01  5.04057044e-02]]

second_markers = results.adt_markers.to_biocframes(summaries=["min_rank"])[1]
second_markers.set_row_names(results.adt_row_names, in_place=True)
print(second_markers)
## BiocFrame with 17 rows and 6 columns
##                      mean           detected cohens_d_min_rank      auc_min_rank delta_mean_min_rank delta_detected_min_rank
##        <ndarray[float64]> <ndarray[float64]> <ndarray[uint32]> <ndarray[uint32]>   <ndarray[uint32]>       <ndarray[uint32]>
##    CD3  11.04397358391642                1.0                 1                 1                   1                       1
##   CD19  4.072383130863625                1.0                 4                 4                   4                       4
## CD45RA 10.481785289114054                1.0                 1                 1                   1                       1
##                       ...                ...               ...               ...                 ...                     ...
##  IgG2b 2.8690172565558263 0.9911190053285968                 6                 5                   6                       6
##  CD127  6.258223461814724                1.0                 2                 2                   2                       2
##   CD15  5.366264191077669                1.0                 4                 4                   4                       4

Customizing the analysis

Most parameters can be changed by modifying the relevant arguments in analyze(). For example:

import scrnaseq 
sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14", realize_assays=True)
is_mito = [name.startswith("mt-") for name in sce.get_row_names()]

import scranpy
results = scranpy.analyze(
    sce,
    rna_subsets = {
        "mito": is_mito
    },
    build_snn_graph_options = {
        "num_neighbors": 10
    },
    cluster_graph_options = {
        "multilevel_resolution": 2
    },
    run_pca_options = {
        "number": 15
    },
    run_tsne_options = {
        "perplexity": 25
    },
    run_umap_options = {
        "min_dist": 0.05
    }
)

For finer control, users can call each step individually via lower-level functions. A typical RNA analysis might be implemented as:

counts = sce.assay(0)
qcmetrics = scranpy.compute_rna_qc_metrics(counts, subsets=is_mito)
thresholds = scranpy.suggest_rna_qc_thresholds(qcmetrics)
filter = scranpy.filter_rna_qc_metrics(thresholds, metrics)

import delayedarray # avoid an actual copy of the matrix.
filtered = delayedarray.DelayedArray(rna_x)[:,filter]

sf = scranpy.center_size_factors(qcmetrics.sum[filter])
normalized = scranpy.normalize_counts(filtered, sf)

vardf = scranpy.model_gene_variances(normalized)
hvgs = scranpy.choose_highly_variable_genes(vardf.residual)
pca = scranpy.run_pca(normalized[hvgs,:])

nn_out = scranpy.run_all_neighbor_steps(pca.components)
clusters = nn_out.cluster_graph.membership
markers = scranpy.score_markers(normalized, groups=clusters)

Check out analyze.py for more details.

Name	Name	Last commit message	Last commit date
Latest commit LTLA Provide option to skip DelayedArray wrapper before normalization. Mar 17, 2025 00283a4 · Mar 17, 2025 History 125 Commits
.github/workflows	.github/workflows	Use the trusted publisher workflow to publish to PyPI.	Jan 3, 2025
docs	docs	Overhaul to use the new libscran libraries via pybind11 (#83 )	Jan 3, 2025
lib	lib	Overhaul to use the new libscran libraries via pybind11 (#83 )	Jan 3, 2025
src/scranpy	src/scranpy	Provide option to skip DelayedArray wrapper before normalization.	Mar 17, 2025
tests	tests	Provide option to skip DelayedArray wrapper before normalization.	Mar 17, 2025
.coveragerc	.coveragerc	Initial commit	Jun 28, 2023
.gitignore	.gitignore	Removed an accidental commit of a stub lib file.	Jan 8, 2025
.gitmodules	.gitmodules	Switch to assorthead for the vendored C++ libraries. (#51 )	Sep 11, 2023
.pre-commit-config.yaml	.pre-commit-config.yaml	Overhaul to use the new libscran libraries via pybind11 (#83 )	Jan 3, 2025
.readthedocs.yml	.readthedocs.yml	Initial commit.	Jul 25, 2023
AUTHORS.md	AUTHORS.md	Add pre-commit config (#42 )	Aug 30, 2023
CHANGELOG.md	CHANGELOG.md	Provide option to skip DelayedArray wrapper before normalization.	Mar 17, 2025
CONTRIBUTING.md	CONTRIBUTING.md	Initial commit	Jun 28, 2023
LICENSE.txt	LICENSE.txt	Update license	Aug 11, 2023
README.md	README.md	Updated README links to libscran.	Jan 3, 2025
pyproject.toml	pyproject.toml	New release with recompiled binaries for updated C++ libs.	Jan 8, 2025
setup.cfg	setup.cfg	Overhaul to use the new libscran libraries via pybind11 (#83 )	Jan 3, 2025
setup.py	setup.py	Overhaul to use the new libscran libraries via pybind11 (#83 )	Jan 3, 2025
tox.ini	tox.ini	Initial commit.	Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scran, in Python

Overview

Quick start

Multiple batches

Multiple modalities

Customizing the analysis

About

Releases

Packages

Contributors 3

Languages

License

libscran/scranpy

Folders and files

Latest commit

History

Repository files navigation

scran, in Python

Overview

Quick start

Multiple batches

Multiple modalities

Customizing the analysis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages