Skip to content

Commit

Permalink
[WIP] added a silhouette calculation (#876)
Browse files Browse the repository at this point in the history
* Develop (#827)

* Merged into the wrong branch without noticing :( (#814)

* use better conda link (#799)

* Estimated filtering fix (#813)

* oops

* fix testing and set a max number of filtered reads

* apparently a bunch of things were getting skipped

* fix wrappers

* update computeMatrix wrapper

* Decrease memory needs (#817)

* Use an iterator to not blow memory up

* Update a bit more

* The GC bias stuff is all deprecated, I'm not fixing that old code

* Cache resulting counts rather than just decreasing the bin size (#818)

* Cache resulting counts rather than just decreasing the bin size

* sanity check

* Implement #815

* [skip ci] update change log

* Implement #816 (#825)

* Implement #816

* expose option

* Add a test using pseudocounts and skipZeroOverZero

* syntax

* Fix tests

* Make --skipZeroOverZero a galaxy macro and add to bigwigCompare

* [ci skip] a bit of formatting

* Fix #822 (#826)

* fixes linting issues (#837)

* Delete #test.bg# (#859)

File is removed upon clean.

* Release 3.3.1 (#872)

* copy changes from bgruening

* this file should not be here since years (#845)

* Develop (#827)

* Merged into the wrong branch without noticing :( (#814)

* use better conda link (#799)

* Estimated filtering fix (#813)

* oops

* fix testing and set a max number of filtered reads

* apparently a bunch of things were getting skipped

* fix wrappers

* update computeMatrix wrapper

* Decrease memory needs (#817)

* Use an iterator to not blow memory up

* Update a bit more

* The GC bias stuff is all deprecated, I'm not fixing that old code

* Cache resulting counts rather than just decreasing the bin size (#818)

* Cache resulting counts rather than just decreasing the bin size

* sanity check

* Implement #815

* [skip ci] update change log

* Implement #816 (#825)

* Implement #816

* expose option

* Add a test using pseudocounts and skipZeroOverZero

* syntax

* Fix tests

* Make --skipZeroOverZero a galaxy macro and add to bigwigCompare

* [ci skip] a bit of formatting

* Fix #822 (#826)

* fixes linting issues (#837)

* this file should not be here since years

* Add Arabidopsis TAIR10 (A_thaliana_Jun_2009) (#853)

Using output from:
faCount A_thaliana_Jun_2009.fa 
#seq	len	A	C	G	T	N	cpg
Chr1	30427671	9709674	5435374	5421151	9697113	164359	697370
Chr2	19698289	6315641	3542973	3520766	6316348	2561	457572
Chr3	23459830	7484757	4258333	4262704	7448059	5977	559031
Chr4	18585056	5940546	3371349	3356091	5914038	3032	439585
Chr5	26975502	8621974	4832253	4858759	8652238	10278	630299
ChrC	154478	48546	28496	27570	49866	0	4639
ChrM	366924	102464	82661	81609	100190	0	13697
total	119667750	38223602	21551439	21528650	38177852	186207	2802193
hpc $ python
Python 2.7.11 (default, Jul 25 2019, 12:10:26) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 119667750-186207
119481543

* Fix python version in Azure tests  (#860)

* Develop (#827)

* Merged into the wrong branch without noticing :( (#814)

* use better conda link (#799)

* Estimated filtering fix (#813)

* oops

* fix testing and set a max number of filtered reads

* apparently a bunch of things were getting skipped

* fix wrappers

* update computeMatrix wrapper

* Decrease memory needs (#817)

* Use an iterator to not blow memory up

* Update a bit more

* The GC bias stuff is all deprecated, I'm not fixing that old code

* Cache resulting counts rather than just decreasing the bin size (#818)

* Cache resulting counts rather than just decreasing the bin size

* sanity check

* Implement #815

* [skip ci] update change log

* Implement #816 (#825)

* Implement #816

* expose option

* Add a test using pseudocounts and skipZeroOverZero

* syntax

* Fix tests

* Make --skipZeroOverZero a galaxy macro and add to bigwigCompare

* [ci skip] a bit of formatting

* Fix #822 (#826)

* fixes linting issues (#837)

* Delete #test.bg# (#859)

File is removed upon clean.

* Fix python version

* Update azure-pipelines.yml

* fixed typo (#864)

* Develop (#827)

* Merged into the wrong branch without noticing :( (#814)

* use better conda link (#799)

* Estimated filtering fix (#813)

* oops

* fix testing and set a max number of filtered reads

* apparently a bunch of things were getting skipped

* fix wrappers

* update computeMatrix wrapper

* Decrease memory needs (#817)

* Use an iterator to not blow memory up

* Update a bit more

* The GC bias stuff is all deprecated, I'm not fixing that old code

* Cache resulting counts rather than just decreasing the bin size (#818)

* Cache resulting counts rather than just decreasing the bin size

* sanity check

* Implement #815

* [skip ci] update change log

* Implement #816 (#825)

* Implement #816

* expose option

* Add a test using pseudocounts and skipZeroOverZero

* syntax

* Fix tests

* Make --skipZeroOverZero a galaxy macro and add to bigwigCompare

* [ci skip] a bit of formatting

* Fix #822 (#826)

* fixes linting issues (#837)

* Delete #test.bg# (#859)

File is removed upon clean.

* fixed typo

* Update test images, skip testing if the wrong matplotlib version is used (#865)

* Update test images, skip testing if the wrong matplotlib version is used

* Update test-template.yml

* linting

* can't conda activate on azure

* now the heatmap is correct and the profile is wrong

* lint

* only one test should fail now

* Fix #844

* Should fix one test at least

* fix last tests

* fix #838 (#843)

* fix #838

* fixes

* Update CHANGES.txt

* Close #868 #867 and #851 (#869)

* Fix #868

* Fix #867

* Default ALL the things!

* Fix #866 (#871)

* release 3.3.1

* try github actions

* each action is a file

* OK, that's inflexible

* OK, the action.yml thing is a mess

* syntax

* ok, try this

* uses

* spacing

* ok

* do anchors work?

* boo, so duplicative!

* oops

* maybe this will work for pypi

* ensure dist is empty

* nev

* rename

* bump version number

* added a silhouette calculation

* remove sklearn, implement with scipy and numpy

* Update requirements.txt

* Update setup.py

* Update heatmapper.py

* update galaxy wrapper

* Fix run time issues

* refactor, the order matters here.

* removing debugging stuff

* Update heatmapper.py

* Update plotHeatmap.py

Co-authored-by: Devon Ryan <dpryan79@users.noreply.github.com>
Co-authored-by: Björn Grüning <bjoern@gruenings.eu>
Co-authored-by: Steffen Möller <steffen_moeller@gmx.de>
  • Loading branch information
4 people committed Jan 21, 2020
1 parent 3115f1b commit e8856ce
Show file tree
Hide file tree
Showing 4 changed files with 73 additions and 11 deletions.
47 changes: 44 additions & 3 deletions deeptools/heatmapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -991,7 +991,10 @@ def save_matrix_values(self, file_name):
def save_BED(self, file_handle):
boundaries = np.array(self.matrix.group_boundaries)
# Add a header
file_handle.write("#chrom\tstart\tend\tname\tscore\tstrand\tthickStart\tthickEnd\titemRGB\tblockCount\tblockSizes\tblockStart\tdeepTools_group\n")
file_handle.write("#chrom\tstart\tend\tname\tscore\tstrand\tthickStart\tthickEnd\titemRGB\tblockCount\tblockSizes\tblockStart\tdeepTools_group")
if self.matrix.silhouette is not None:
file_handle.write("\tsilhouette")
file_handle.write("\n")
for idx, region in enumerate(self.matrix.regions):
# the label id corresponds to the last boundary
# that is smaller than the region index.
Expand All @@ -1013,11 +1016,14 @@ def save_BED(self, file_handle):
region[5],
region[4]))
file_handle.write(
'\t{0}\t{1}\t{2}\t{3}\n'.format(
'\t{0}\t{1}\t{2}\t{3}'.format(
len(region[1]),
",".join([str(int(y) - int(x)) for x, y in region[1]]),
",".join([str(int(x) - int(starts[0])) for x, y in region[1]]),
self.matrix.group_labels[label_idx]))
if self.matrix.silhouette is not None:
file_handle.write("\t{}".format(self.matrix.silhouette[idx]))
file_handle.write("\n")
file_handle.close()

@staticmethod
Expand Down Expand Up @@ -1053,6 +1059,23 @@ def get_num_individual_matrix_cols(self):
return matrixCols


def computeSilhouetteScore(d, idx, labels):
"""
Given a square distance matrix with NaN diagonals, compute the silhouette score
of a given row (idx). Each row should have an associated label (labels).
"""
keep = ~np.isnan(d[idx, ])
foo = np.bincount(labels[keep], weights=d[idx, ][keep])
groupSizes = np.bincount(labels[keep])
intraIdx = labels[idx]
if groupSizes[intraIdx] == 1:
return 0
intra = foo[labels[idx]] / groupSizes[intraIdx]
interMask = np.arange(len(foo))[np.arange(len(foo)) != labels[idx]]
inter = np.min(foo[interMask] / groupSizes[interMask])
return (inter - intra) / max(inter, intra)


class _matrix(object):
"""
class to hold heatmapper matrices
Expand Down Expand Up @@ -1082,6 +1105,7 @@ def __init__(self, regions, matrix, group_boundaries, sample_boundaries,
self.sample_boundaries = sample_boundaries
self.sort_method = None
self.sort_using = None
self.silhouette = None

if group_labels is None:
self.group_labels = ['group {}'.format(x)
Expand Down Expand Up @@ -1225,7 +1249,7 @@ def sort_groups(self, sort_using='mean', sort_method='no', sample_list=None):
self.regions = _sorted_regions
self.set_sorting_method(sort_method, sort_using)

def hmcluster(self, k, method='kmeans', clustering_samples=None):
def hmcluster(self, k, evaluate_silhouette=True, method='kmeans', clustering_samples=None):
matrix = np.asarray(self.matrix)
matrix_to_cluster = matrix
if clustering_samples is not None:
Expand Down Expand Up @@ -1295,8 +1319,25 @@ def hmcluster(self, k, method='kmeans', clustering_samples=None):

self.regions = _clustered_regions
self.matrix = np.vstack(_clustered_matrix)

return idx

def computeSilhouette(self, k):
if k > 1:
from scipy.spatial.distance import pdist, squareform

silhouette = np.repeat(0.0, self.group_boundaries[-1])
groupSizes = np.subtract(self.group_boundaries[1:], self.group_boundaries[:-1])
labels = np.repeat(np.arange(k), groupSizes)

d = pdist(self.matrix)
d2 = squareform(d)
np.fill_diagonal(d2, np.nan) # This excludes the diagonal
for idx in range(len(labels)):
silhouette[idx] = computeSilhouetteScore(d2, idx, labels)
sys.stderr.write("The average silhouette score is: {}\n".format(np.mean(silhouette)))
self.silhouette = silhouette

def removeempty(self):
"""
removes matrix rows containing only zeros or nans
Expand Down
11 changes: 11 additions & 0 deletions deeptools/parserCommon.py
Original file line number Diff line number Diff line change
Expand Up @@ -505,6 +505,17 @@ def heatmapperOptionalArgs(mode=['heatmap', 'profile'][0]):
'fail with an error if a cluster has very few members compared to the '
'total number of regions.',
type=int)
cluster.add_argument(
'--silhouette',
help='Compute the silhouette score for regions. This is only'
' applicable if clustering has been performed. The silhouette score'
' is a measure of how similar a region is to other regions in the'
' same cluster as opposed to those in other clusters. It will be reported'
' in the final column of the BED file with regions. The '
'silhouette evaluation can be very slow when you have more'
'than 100 000 regions.',
action='store_true'
)

optional = parser.add_argument_group('Optional arguments')

Expand Down
19 changes: 11 additions & 8 deletions deeptools/plotHeatmap.py
Original file line number Diff line number Diff line change
Expand Up @@ -780,14 +780,11 @@ def main(args=None):
if args.sortRegions == 'keep':
args.sortRegions = 'no' # These are the same thing
if args.kmeans is not None:
hm.matrix.hmcluster(args.kmeans, method='kmeans',
clustering_samples=args.clusterUsingSamples)
else:
if args.hclust is not None:
print("Performing hierarchical clustering."
"Please note that it might be very slow for large datasets.\n")
hm.matrix.hmcluster(args.hclust, method='hierarchical',
clustering_samples=args.clusterUsingSamples)
hm.matrix.hmcluster(args.kmeans, method='kmeans', clustering_samples=args.clusterUsingSamples)
elif args.hclust is not None:
print("Performing hierarchical clustering."
"Please note that it might be very slow for large datasets.\n")
hm.matrix.hmcluster(args.hclust, method='hierarchical', clustering_samples=args.clusterUsingSamples)

group_len_ratio = np.diff(hm.matrix.group_boundaries) / len(hm.matrix.regions)
if np.any(group_len_ratio < 5.0 / 1000):
Expand Down Expand Up @@ -816,6 +813,12 @@ def main(args=None):
sort_method=args.sortRegions,
sample_list=sortUsingSamples)

if args.silhouette:
if args.kmeans is not None:
hm.matrix.computeSilhouette(args.kmeans)
elif args.hclust is not None:
hm.matrix.computeSilhouette(args.args.hclust)

if args.outFileNameMatrix:
hm.save_matrix_values(args.outFileNameMatrix)

Expand Down
7 changes: 7 additions & 0 deletions galaxy/wrapper/deepTools_macros.xml
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,12 @@
</when>
<when value="none" />
</conditional>
<param argument="--silhouette" type="boolean" truevalue="--silhouette" falsevalue=""
label="Compute the silhouette score for each region"
help="Compute the silhouette score for regions. This is only applicable if clustering has been performed.
The silhouette score is a measure of how similar a region is to other regions in the same cluster as
opposed to those in other clusters. It will be reported in the final column of the BED file with regions.
The silhouette evaluation can be very slow when you have more than 100 000 regions." />
</when>
<when value="yes" />
</conditional>
Expand All @@ -181,6 +187,7 @@
--hclust $advancedOpt.used_multiple_regions.clustering.n_hclust
#end if
#end if
$advancedOpt.used_multiple_regions.silhouette
#end if
</token>

Expand Down

0 comments on commit e8856ce

Please sign in to comment.