Skip to content
/ rbcde Public

Rank-biserial correlation coefficient for big data marker detection

License

Notifications You must be signed in to change notification settings

Teichlab/rbcde

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rank-biserial correlation

RBCDE is a Python implementation of the rank-biserial correlation coefficient (Cureton, 1956), which can be used as an effect size equivalent of the Wilcoxon test (Kerby, 2014), which in turn was deemed to perform well on single cell data problems (Soneson, 2018). Using effect size analyses is recommended for problems with large sample sizes (Sullivan, 2012). The package comes with both a scanpy-compatible version and a standalone function that ingests a data matrix and an assignment vector.

Citation

Stay tuned!

Installation

RBCDE depends on numpy, scipy and pandas. The package is available on pip, and can be easily installed as follows:

pip3 install rbcde

Usage and Documentation

RBCDE can slot into a scanpy workflow and accept an object with per cell normalised data stored as a layer or .raw, and the desired clustering/grouping vector as an .obs column. If mirroring the scanpy PBMC tutorial, the output of sc.pp.log1p() produces a generally useful data representation (the earlier step sc.pp.normalize_total() will yield identical RBCDE results).

import rbcde
rbcde.RBC(adata)
degs, plot_dict = rbcde.filter_markers(adata)

rbcde.RBC()'s clus_key argument controls which .obs column is used for the grouping, and a combination of layer and use_raw can instruct the function to retrieve expression data from .X, .layers or .raw. rbcde.filter_markers() takes the computed coefficient values and thresholds them into a list of per-cluster markers. The thresholding can be controlled via the thresh argument, with a range of literature critical values available. A helper dictionary, compatible with the formatting scanpy plotting functions accept in the var_names argument, is returned as the second piece of output. Consult the demonstration notebook for a usage example.

Analogous functions exist for scanpy-independent data analysis, and can ingest any data matrix with variables as rows and observations as columns along with a vector of cluster/group assignments for the observations and a second vector of variable names. The filtering function does not produce a helper dictionary, only yielding the marker data frame.

results = rbcde.matrix.RBC(data, clusters, vars)
degs = rbcde.matrix.filter_markers(results)

An HTML render of the RBCDE function docstrings, detailing all the parameters, can be accessed at ReadTheDocs.

Example Notebook

rbc_demo.ipynb computes the rank-biserial correlation coefficient for demonstration 10X PBMC data, yielding a similar standard of markers to established approaches while reporting only ~13% of the gene total. This more compact summary does not require any heuristic filtering to obtain. For the 15-cell megakaryocyte cluster, RBCDE identifies more markers than hypothesis testing, illustrating the utility of effect size for characterising rare subpopulations. The full marker export yielded by the analysis can be found at examples/markers.csv

About

Rank-biserial correlation coefficient for big data marker detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages