Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use at least 20 samples for all splicing calculations #311

Merged
merged 44 commits into from
Jun 12, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
5448dea
update all NMF and modality calculations to require 20 samples, fix b…
olgabot Jun 9, 2015
304959a
change number of samples to 50 for splicing testing
olgabot Jun 9, 2015
75f749e
use n=20 as default instead of n=0.5
olgabot Jun 9, 2015
5c7cea3
fix tests for splicing modalities and NMF space
olgabot Jun 9, 2015
7761406
update all NMF and modality calculations to require 20 samples, fix b…
olgabot Jun 9, 2015
0a0e9ac
change number of samples to 50 for splicing testing
olgabot Jun 9, 2015
6946175
use n=20 as default instead of n=0.5
olgabot Jun 9, 2015
65703c4
fix tests for splicing modalities and NMF space
olgabot Jun 9, 2015
5bed758
if there are zero non-NA values in a dataset, make sure all its log b…
olgabot Jun 10, 2015
0f57cf9
Merge remote-tracking branch 'origin/splicing_min20' into splicing_min20
olgabot Jun 10, 2015
e742b7c
don't estimate modalities on data iwth no counts
olgabot Jun 10, 2015
54b6b78
actually get column names instead of boolean series
olgabot Jun 10, 2015
da210e4
get proper columns for ambiguous_columns and use reindex for just col…
olgabot Jun 10, 2015
732aa6d
intersect non_na_columns and ambiguous_columns to get only the column…
olgabot Jun 10, 2015
a6e1f52
modality_assignements is a series
olgabot Jun 10, 2015
548184e
return bayesfactors not estimated modalities
olgabot Jun 10, 2015
30b324c
Return dataframe with the same columns, but everything without enoug…
olgabot Jun 10, 2015
9777613
concatenate 1 and 2 not just 1...
olgabot Jun 10, 2015
b4908c6
separate the fit transform code to have a subroutine for each individ…
olgabot Jun 10, 2015
0917709
Fix concatenating of two and one-paramter models
olgabot Jun 11, 2015
2b4ab88
Remove old code that was commented out
olgabot Jun 11, 2015
b009f3f
Add assign_modalities function to operate on bayes factor dataframes
olgabot Jun 11, 2015
8d7309d
bayes_factors --> log2_bayes_factors (since that's the truth)
olgabot Jun 11, 2015
1e1c20b
make separate modality_scores function that modality_assignments calls
olgabot Jun 11, 2015
e068e07
modality_scores --> modality_log2bf
olgabot Jun 11, 2015
4ed8906
change naming of modalities in test to the new names
olgabot Jun 11, 2015
98193d2
Change test of fit_Transform to only test getting the scores
olgabot Jun 11, 2015
9f76d4f
Actually test modality_log2bf and modality_assignments
olgabot Jun 11, 2015
512fb85
actually test assign_modalities in compute.splicing
olgabot Jun 11, 2015
7e97269
use estimator instead of self testing object
olgabot Jun 11, 2015
9ce0a6b
self --> splicing so don't use test object
olgabot Jun 11, 2015
bc40383
change model variable names to reflect psi~1, psi~0 renaming
olgabot Jun 11, 2015
5891423
only check modality correctness on the events which had enough sample…
olgabot Jun 11, 2015
24d70c7
The "true modalities" are only what kind of beta distribution was giv…
olgabot Jun 11, 2015
4a6f64c
Add a "positive control" test for modalities. If this one doesn't pas…
olgabot Jun 11, 2015
9c0e72e
Add -v flag for verbose pytest output so we know what's failing even …
olgabot Jun 11, 2015
3743853
simplify names to test and true
olgabot Jun 11, 2015
7c84e4d
correct docstring
olgabot Jun 11, 2015
90e584a
don't do NMF transitions if the dataframe is entirely NA
olgabot Jun 11, 2015
9fece26
remove big nmf space transitions because the choice of "big" is stil …
olgabot Jun 12, 2015
a251090
remove big nmf space transitions because the choice of "big" is stil …
olgabot Jun 12, 2015
52aaafc
pep8 fixes
olgabot Jun 12, 2015
6558f8b
Remove big_nmf_space transitions because the cutoff is too arbitrary
olgabot Jun 12, 2015
769d7a3
Add release notes about modalities, minimum20 samples for splicnig an…
olgabot Jun 12, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ test:

coverage:
cp testing/matplotlibrc .
coverage run --source flotilla --omit=test --module py.test
coverage run --source flotilla --omit=test --module py.test -v
rm matplotlibrc

lint:
Expand Down
13 changes: 11 additions & 2 deletions doc/releases/v0.2.8.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
v0.2.8 (........)
------------------------
-----------------

New features
~~~~~~~~~~~~

- Added ``Study.modality_log2bf()`` which will get the log2 bayes factor
for each splicing event's fit to each modality in each phenotype

Bug fixes
~~~~~~~~~
Expand All @@ -15,4 +21,7 @@ Miscellaneous
- Change modality estimation to a two-step process: Estimate :math:`$\Psi~0` and :math:`$\Psi~1`
first, which change 1 parameter of the Beta distribution at a time,
then bimodal and middle, which change both parameters of the Beta
distribution at once.
distribution at once.
- Make sure both modality estimation and NMF space calculation use at least
20 samples per event
- Get rid of ``big_nmf_space_transitions`` for now
106 changes: 69 additions & 37 deletions flotilla/compute/splicing.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,20 +107,52 @@ def _logsumexp(self, logliks):
logsumexps['ambiguous'] = self.logbf_thresh
return logsumexps

def _guess_modality(self, logsumexps):
"""Guess the most likely modality.
def assign_modalities(self, log2_bayes_factors, reset_index=False):
"""Guess the most likely modality for each event

If no modalilites have logsumexp'd logliks greater than the log Bayes
factor threshold, then they are assigned the 'uniform' modality,
which is the null hypothesis
"""
For each event that has at least one non-NA value, if no modalilites
have logsumexp'd logliks greater than the log Bayes factor threshold,
then they are assigned the 'ambiguous' modality, because we cannot
reject the null hypothesis that these did not come from the uniform
distribution.

Parameters
----------
log2_bayes_factors : pandas.DataFrame
A (4, n_events) dataframe with bayes factors for the Psi~1, Psi~0,
bimodal, and middle modalities. If an event has no bayes factors
for any of those modalities, it is ignored
reset_index : bool
If True, remove the first level of the index from the dataframe.
Useful if you are using this function to apply to a grouped
dataframe where the first level is something other than the
modality, e.g. the celltype

Returns
-------
modalities : pandas.Series
A (n_events,) series with the most likely modality for each event

if all(logsumexps[self.one_param_models.keys()] > self.logbf_thresh):
return logsumexps[self.one_param_models.keys()].idxmax()
"""
if reset_index:
x = log2_bayes_factors.reset_index(level=0, drop=True)
else:
x = log2_bayes_factors
not_na = (x.notnull() > 0).any()
not_na_columns = not_na[not_na].index
x.ix['ambiguous', not_na_columns] = self.logbf_thresh
return x.idxmax()

def _fit_transform_one_step(self, data, models):
non_na = data.count() > 0
non_na_columns = non_na[non_na].index
data_non_na = data[non_na_columns]
if data_non_na.empty:
return pd.DataFrame()
else:
other_models = logsumexps.index.difference(
self.one_param_models.keys())
return logsumexps[other_models].idxmax()
return data_non_na.apply(lambda x: pd.Series(
{k: v.logsumexp_logliks(x)
for k, v in models.iteritems()}), axis=0)

def fit_transform(self, data):
"""Get the modality assignments of each splicing event in the data
Expand All @@ -133,9 +165,9 @@ def fit_transform(self, data):

Returns
-------
modality_assignments : pandas.Series
A (n_events,) series of the estimated modality for each splicing
event
log2_bayes_factors : pandas.DataFrame
A (n_modalities, n_events) dataframe of the estimated log2
bayes factor for each splicing event, for each modality

Raises
------
Expand All @@ -145,29 +177,29 @@ def fit_transform(self, data):
assert np.all(data.values.flat[np.isfinite(data.values.flat)] <= 1)
assert np.all(data.values.flat[np.isfinite(data.values.flat)] >= 0)

# Estimate Psi~0/Psi~1 first
logsumexp_logliks1 = data.apply(
lambda x: pd.Series(
{k: v.logsumexp_logliks(x)
for k, v in self.one_param_models.iteritems()}), axis=0)
logsumexp_logliks1.ix['ambiguous'] = self.logbf_thresh
modality_assignments1 = logsumexp_logliks1.idxmax()

# Take everything that was ambiguous for included/excluded and estimate
# bimodal and middle
data2 = data.ix[:, modality_assignments1 == 'ambiguous']
logsumexp_logliks2 = data2.apply(
lambda x: pd.Series(
{k: v.logsumexp_logliks(x)
for k, v in self.two_param_models.iteritems()}), axis=0)
logsumexp_logliks2.ix['ambiguous'] = self.logbf_thresh
modality_assignments2 = logsumexp_logliks2.idxmax()

# Combine the results
modality_assignments = modality_assignments1
modality_assignments[modality_assignments2.index] = \
modality_assignments2.values
return modality_assignments
# Estimate Psi~0/Psi~1 first (only one parameter change with each
# paramterization)
logbf_one_param = self._fit_transform_one_step(data,
self.one_param_models)

# Take everything that was below the threshold for included/excluded
# and estimate bimodal and middle (two parameters change in each
# parameterization
ind = (logbf_one_param < self.logbf_thresh).all()
ambiguous_columns = ind[ind].index
data2 = data.ix[:, ambiguous_columns]
logbf_two_param = self._fit_transform_one_step(data2,
self.two_param_models)
log2_bayes_factors = pd.concat([logbf_one_param, logbf_two_param],
axis=0)

# Make sure the returned dataframe has the same number of columns
empty = data.count() == 0
empty_columns = empty[empty].index
empty_df = pd.DataFrame(np.nan, index=log2_bayes_factors.index,
columns=empty_columns)
log2_bayes_factors = pd.concat([log2_bayes_factors, empty_df], axis=1)
return log2_bayes_factors


def switchy_score(array):
Expand Down
89 changes: 9 additions & 80 deletions flotilla/data_model/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1053,7 +1053,7 @@ def plot_feature(self, feature_id, sample_ids=None,
phenotype_to_color=None,
phenotype_to_marker=None, nmf_xlabel=None,
nmf_ylabel=None,
nmf_space=False, fig=None, axesgrid=None):
nmf_space=False, fig=None, axesgrid=None, n=20):
"""
Plot the violinplot of a feature. Have the option to show NMF movement
"""
Expand Down Expand Up @@ -1101,13 +1101,13 @@ def plot_feature(self, feature_id, sample_ids=None,
phenotype_to_color=phenotype_to_color,
phenotype_to_marker=phenotype_to_marker,
order=phenotype_order, ax=axes[1],
xlabel=nmf_xlabel, ylabel=nmf_ylabel)
xlabel=nmf_xlabel, ylabel=nmf_ylabel, n=n)
except KeyError:
continue
sns.despine()
fig.tight_layout()

def nmf_space_positions(self, groupby, n=0.5):
def nmf_space_positions(self, groupby, n=20):
"""Calculate NMF-space position of splicing events in phenotype groups

Parameters
Expand Down Expand Up @@ -1138,16 +1138,17 @@ def nmf_space_positions(self, groupby, n=0.5):
# lambda x: x if x.count() >= n else pd.Series(np.nan,
# index=x.index))
df = at_least_n_per_group_per_event.groupby(groupby).apply(
lambda x: self.binned_nmf_reduced(data=x))
lambda x: self.binned_nmf_reduced(data=x) if
x.notnull().sum().sum() > 0 else pd.DataFrame())
df = df.swaplevel(0, 1)
df = df.sort_index()
return df

def plot_nmf_space_transitions(self, feature_id, groupby,
phenotype_to_color,
phenotype_to_marker, order, ax=None,
xlabel=None, ylabel=None):
nmf_space_positions = self.nmf_space_positions(groupby)
xlabel=None, ylabel=None, n=20):
nmf_space_positions = self.nmf_space_positions(groupby, n=n)

nmf_space_transitions(nmf_space_positions, feature_id,
phenotype_to_color,
Expand All @@ -1156,7 +1157,7 @@ def plot_nmf_space_transitions(self, feature_id, groupby,

@staticmethod
def transition_distances(positions, transitions):
"""Get NMF distance of features between phenotype transitions
"""Get cartesian distance of phenotype transitions in NMF space

Parameters
----------
Expand Down Expand Up @@ -1186,7 +1187,7 @@ def transition_distances(positions, transitions):
pass
return distances

def nmf_space_transitions(self, groupby, phenotype_transitions, n=0.5):
def nmf_space_transitions(self, groupby, phenotype_transitions, n=20):
"""Get distance in NMF space of different splicing events

Parameters
Expand Down Expand Up @@ -1225,78 +1226,6 @@ def nmf_space_transitions(self, groupby, phenotype_transitions, n=0.5):
axis=0)
return nmf_space_transitions

def big_nmf_space_transitions(self, groupby, phenotype_transitions, n=0.5):
"""Get features whose change in NMF space between phenotypes is large

Parameters
----------
groupby : mappable
A sample id to phenotype group mapping
phenotype_transitions : list of length-2 tuples of str
List of ('phenotype1', 'phenotype2') transitions whose change in
distribution you are interested in
n : int
Minimum number of samples per phenotype, per event

Returns
-------
big_transitions : pandas.DataFrame
A (n_events, n_transitions) dataframe of the NMF distances between
splicing events
"""
nmf_space_transitions = self.nmf_space_transitions(
groupby, phenotype_transitions, n=n)

# get the mean and standard dev of the whole array
n = nmf_space_transitions.count().sum()
mean = nmf_space_transitions.sum().sum() / n
std = np.sqrt(np.square(nmf_space_transitions - mean).sum().sum() / n)

big_transitions = nmf_space_transitions[
nmf_space_transitions > (mean + std)].dropna(how='all')
return big_transitions

def plot_big_nmf_space_transitions(self, phenotype_groupby,
phenotype_transitions,
phenotype_order, color,
phenotype_to_color,
phenotype_to_marker, n=0.5):
"""Violinplots and NMF transitions of features different in phenotypes

Plot violinplots and NMF-space transitions of features that have large
NMF-space transitions between different phenotypes

Parameters
----------
n : int
Minimum number of samples per phenotype, per event


Returns
-------


Raises
------
"""
big_transitions = self.big_nmf_space_transitions(phenotype_groupby,
phenotype_transitions,
n=n)
nrows = big_transitions.shape[0]
ncols = 2
figsize = 4 * ncols, 4 * nrows

fig, axesgrid = plt.subplots(nrows=nrows, ncols=ncols,
figsize=figsize)
if nrows == 1:
axesgrid = [axesgrid]
for feature_id in big_transitions.index:
self.plot_feature(feature_id, phenotype_groupby=phenotype_groupby,
phenotype_order=phenotype_order, color=color,
phenotype_to_color=phenotype_to_color,
phenotype_to_marker=phenotype_to_marker,
nmf_space=True, fig=fig, axesgrid=axesgrid)

def plot_two_samples(self, sample1, sample2, fillna=None,
**kwargs):
"""
Expand Down
Loading