Skip to content

Commit 09d8090

Browse files
committed
2 parents f29b29a + 1aa10ea commit 09d8090

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+3681
-2531
lines changed

AUTHORS.rst

+1
Original file line numberDiff line numberDiff line change
@@ -97,5 +97,6 @@ People
9797

9898
* `Gilles Louppe <http://www.montefiore.ulg.ac.be/~glouppe>`_
9999

100+
100101
If I forgot anyone, do not hesitate to send me an email to
101102
fabian.pedregosa@inria.fr and I'll include you in the list.

doc/datasets/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,7 @@ can be used to build artifical datasets of controled size and complexity.
116116
:template: function.rst
117117

118118
make_classification
119+
make_multilabel_classification
119120
make_regression
120121
make_blobs
121122
make_friedman1

doc/modules/classes.rst

+35
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,7 @@ Samples generator
145145
:template: function.rst
146146

147147
datasets.make_classification
148+
datasets.make_multilabel_classification
148149
datasets.make_regression
149150
datasets.make_blobs
150151
datasets.make_friedman1
@@ -588,6 +589,7 @@ See the :ref:`clustering` section of the user guide for further details.
588589
:template: function.rst
589590

590591
metrics.adjusted_rand_score
592+
metrics.adjusted_mutual_info_score
591593
metrics.homogeneity_completeness_v_measure
592594
metrics.homogeneity_score
593595
metrics.completeness_score
@@ -640,6 +642,39 @@ Pairwise metrics
640642
mixture.VBGMM
641643

642644

645+
.. _multiclass_ref:
646+
647+
:mod:`sklearn.multiclass`: Multiclass and multilabel classification
648+
===================================================================
649+
650+
.. automodule:: sklearn.multiclass
651+
:no-members:
652+
:no-inherited-members:
653+
654+
**User guide:** See the :ref:`multiclass` section for further details.
655+
656+
.. currentmodule:: sklearn
657+
658+
.. autosummary::
659+
:toctree: generated
660+
:template: class.rst
661+
662+
multiclass.OneVsRestClassifier
663+
multiclass.OneVsOneClassifier
664+
multiclass.OutputCodeClassifier
665+
666+
.. autosummary::
667+
:toctree: generated
668+
:template: function.rst
669+
670+
multiclass.fit_ovr
671+
multiclass.predict_ovr
672+
multiclass.fit_ovo
673+
multiclass.predict_ovo
674+
multiclass.fit_ecoc
675+
multiclass.predict_ecoc
676+
677+
643678
.. _naive_bayes_ref:
644679

645680
:mod:`sklearn.naive_bayes`: Naive Bayes

doc/modules/multiclass.rst

+37-7
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,23 @@
11

22
.. _multiclass:
33

4-
=====================
5-
Multiclass algorithms
6-
=====================
4+
====================================
5+
Multiclass and multilabel algorithms
6+
====================================
77

88
.. currentmodule:: sklearn.multiclass
99

10-
This module implements multiclass learning algorithms:
10+
This module implements multiclass and multilabel learning algorithms:
1111
- one-vs-the-rest / one-vs-all
1212
- one-vs-one
1313
- error correcting output codes
1414

15+
Multiclass classification means classification with more than two classes.
16+
Multilabel classification is a different task, where a classifier is used to
17+
predict a set of target labels for each instance; i.e., the set of target
18+
classes is not assumed to be disjoint as in ordinary (binary or multiclass)
19+
classification. This is also called any-of classification.
20+
1521
The estimators provided in this module are meta-estimators: they require a base
1622
estimator to be provided in their constructor. For example, it is possible to
1723
use these estimators to turn a binary classifier or a regressor into a
@@ -26,9 +32,15 @@ improves.
2632
multiclass classification out-of-the-box. Below is a summary of the
2733
classifiers supported in scikit-learn grouped by the strategy used.
2834

29-
- Inherently multiclass: Naive Bayes, LDA.
30-
- One-Vs-One: SVC.
31-
- One-Vs-All: LinearSVC, LogisticRegression, SGDClassifier, RidgeClassifier.
35+
- Inherently multiclass: Naive Bayes, :class:`LDA`.
36+
- One-Vs-One: :class:`SVC`.
37+
- One-Vs-All: :class:`LinearSVC`, :class:`LogisticRegression`,
38+
:class:`SGDClassifier`, :class:`RidgeClassifier`.
39+
40+
.. note::
41+
42+
At the moment there are no evaluation metrics implemented for multilabel
43+
learnings.
3244

3345

3446
One-Vs-The-Rest
@@ -57,6 +69,24 @@ fair default choice. Below is an example::
5769
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2,
5870
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
5971

72+
Multilabel learning with OvR
73+
----------------------------
74+
75+
``OneVsRestClassifier`` also supports multilabel classification.
76+
To use this feature, feed the classifier a list of tuples containing
77+
target labels, like in the example below.
78+
79+
80+
.. figure:: ../auto_examples/images/plot_multilabel_1.png
81+
:target: ../auto_examples/plot_multilabel.html
82+
:align: center
83+
:scale: 75%
84+
85+
86+
.. topic:: Examples:
87+
88+
* :ref:`example_plot_multilabel.py`
89+
6090

6191
One-Vs-One
6292
==========

doc/sphinxext/gen_rst.py

+22
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
Files that generate images should start with 'plot'
88
99
"""
10+
from time import time
1011
import os
1112
import shutil
1213
import traceback
@@ -256,6 +257,7 @@ def generate_file_rst(fname, target_dir, src_dir, plot_gallery):
256257
os.stat(src_file).st_mtime):
257258
# We need to execute the code
258259
print 'plotting %s' % fname
260+
t0 = time()
259261
import matplotlib.pyplot as plt
260262
plt.close('all')
261263
cwd = os.getcwd()
@@ -304,6 +306,8 @@ def generate_file_rst(fname, target_dir, src_dir, plot_gallery):
304306
finally:
305307
os.chdir(cwd)
306308
sys.stdout = orig_stdout
309+
310+
print " - time elapsed : %.2g sec" % (time() - t0)
307311
else:
308312
figure_list = [f[len(image_dir):]
309313
for f in glob.glob(image_path % '[1-9]')]
@@ -339,3 +343,21 @@ def generate_file_rst(fname, target_dir, src_dir, plot_gallery):
339343
def setup(app):
340344
app.connect('builder-inited', generate_example_rst)
341345
app.add_config_value('plot_gallery', True, 'html')
346+
347+
# Sphinx hack: sphinx copies generated images to the build directory
348+
# each time the docs are made. If the desired image name already
349+
# exists, it appends a digit to prevent overwrites. The problem is,
350+
# the directory is never cleared. This means that each time you build
351+
# the docs, the number of images in the directory grows.
352+
#
353+
# This question has been asked on the sphinx development list, but there
354+
# was no response: http://osdir.com/ml/sphinx-dev/2011-02/msg00123.html
355+
#
356+
# The following is a hack that prevents this behavior by clearing the
357+
# image build directory each time the docs are built. If sphinx
358+
# changes their layout between versions, this will not work (though
359+
# it should probably not cause a crash). Tested successfully
360+
# on Sphinx 1.0.7
361+
build_image_dir = '_build/html/_images'
362+
if os.path.exists(build_image_dir):
363+
shutil.rmtree(build_image_dir)

doc/whats_new.rst

+33-13
Original file line numberDiff line numberDiff line change
@@ -25,17 +25,17 @@ Changelog
2525
- Faster tests by `Fabian Pedregosa`_.
2626

2727
- Silhouette Coefficient cluster analysis evaluation metric added as
28-
``sklearn.metrics.silhouette_score`` by Robert Layton.
28+
:func:`sklearn.metrics.silhouette_score` by Robert Layton.
2929

30-
- Fixed a bug in `KMeans` in the handling of the `n_init` parameter:
31-
the clustering algorithm used to be run `n_init` times but the last
30+
- Fixed a bug in :ref:`k_means` in the handling of the ``n_init`` parameter:
31+
the clustering algorithm used to be run ``n_init`` times but the last
3232
solution was retained instead of the best solution.
3333

3434
- Minor refactoring in :ref:`sgd` module; consolidated dense and sparse
3535
predict methods.
3636

3737
- Adjusted Mutual Information metric added as
38-
``sklearn.metrics.adjusted_mutual_info_score`` by Robert Layton.
38+
:func:`sklearn.metrics.adjusted_mutual_info_score` by Robert Layton.
3939

4040
- Models like SVC/SVR/LinearSVC/LogisticRegression from libsvm/liblinear
4141
now support scaling of C regularization parameter by the number of
@@ -54,7 +54,24 @@ Changelog
5454

5555
- Fix a bug due to atom swapping in :ref:`OMP` by `Vlad Niculae`_.
5656

57-
- :ref:`SparseCoder` by `Vlad Niculae`_.
57+
- :ref:`SparseCoder` by `Vlad Niculae`_.
58+
59+
- :ref:`mini_batch_kmeans` performance improvements by `Olivier Grisel`_.
60+
61+
- :ref:`k_means` support for sparse matrices by `Mathieu Blondel`_.
62+
63+
- Improved documentation for developers and for the :mod:`sklearn.utils`
64+
module, by `Jake VanderPlas`_.
65+
66+
- Vectorized 20newsgroups dataset loader
67+
(:func:`sklearn.datasets.fetch_20newsgroups_vectorized`) by
68+
`Mathieu Blondel`_.
69+
70+
- :ref:`multiclass` by `Lars Buitinck`_.
71+
72+
- Utilities for fast computation of mean and variance for sparse matrices
73+
by `Mathieu Blondel`_.
74+
5875

5976
API changes summary
6077
-------------------
@@ -66,10 +83,10 @@ version 0.9:
6683
had ``overwrite_`` parameters; these have been replaced with ``copy_``
6784
parameters with exactly the opposite meaning.
6885

69-
This particularly affects some of the estimators in ``linear_models``.
86+
This particularly affects some of the estimators in :mod:`linear_model`.
7087
The default behavior is still to copy everything passed in.
7188

72-
- The SVMlight dataset loader ``sklearn.datasets.load_svmlight_file`` no
89+
- The SVMlight dataset loader :func:`sklearn.datasets.load_svmlight_file` no
7390
longer supports loading two files at once; use ``load_svmlight_files``
7491
instead. Also, the (unused) ``buffer_mb`` parameter is gone.
7592

@@ -80,13 +97,14 @@ version 0.9:
8097
- The :ref:`covariance` module now has a robust estimator of
8198
covariance, the Minimum Covariance Determinant estimator.
8299

83-
- Cluster evaluation metrics in ``metrics.cluster.py`` have been refactored
100+
- Cluster evaluation metrics in :mod:`metrics.cluster` have been refactored
84101
but the changes are backwards compatible. They have been moved to the
85-
``metrics.cluster.supervised``, along with ``metrics.cluster.unsupervised``
86-
which contains the Silhouette Coefficient.
102+
:mod:`metrics.cluster.supervised`, along with
103+
:mod:`metrics.cluster.unsupervised` which contains the Silhouette
104+
Coefficient.
87105

88-
- The permutation_test_score function now behaves the same way as
89-
cross_val_score (i.e. uses the mean score across the folds.)
106+
- The ``permutation_test_score`` function now behaves the same way as
107+
``cross_val_score`` (i.e. uses the mean score across the folds.)
90108

91109
- Cross Validation generators now use integer indices (``indices=True``)
92110
by default instead of boolean masks. This make it more intuitive to
@@ -99,10 +117,12 @@ version 0.9:
99117
as opposed to the regression setting.
100118

101119
- Fixed an off-by-one error in the SVMlight/LibSVM file format handling;
102-
files generated using ``sklearn.datasets.dump_svmlight_file`` should be
120+
files generated using :func:`sklearn.datasets.dump_svmlight_file` should be
103121
re-generated. (They should continue to work, but accidentally had one
104122
extra column of zeros prepended.)
105123

124+
- ``BaseDictionaryLearning`` class replaced by ``SparseCodingMixin``.
125+
106126

107127
.. _changes_0_9:
108128

examples/document_clustering.py

+45-19
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,54 @@
11
"""
2-
===============================================
3-
Clustering text documents using MiniBatchKmeans
4-
===============================================
2+
=======================================
3+
Clustering text documents using k-means
4+
=======================================
55
66
This is an example showing how the scikit-learn can be used to cluster
77
documents by topics using a bag-of-words approach. This example uses
88
a scipy.sparse matrix to store the features instead of standard numpy arrays.
99
10+
Two algorithms are demoed: ordinary k-means and its faster cousin minibatch
11+
k-means.
12+
1013
"""
11-
print __doc__
1214

1315
# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
16+
# Lars Buitinck <L.J.Buitinck@uva.nl>
1417
# License: Simplified BSD
1518

16-
from time import time
17-
import logging
18-
import numpy as np
19-
2019
from sklearn.datasets import fetch_20newsgroups
2120
from sklearn.feature_extraction.text import Vectorizer
2221
from sklearn import metrics
2322

24-
from sklearn.cluster import MiniBatchKMeans
23+
from sklearn.cluster import KMeans, MiniBatchKMeans
24+
25+
import logging
26+
from optparse import OptionParser
27+
import sys
28+
from time import time
29+
30+
import numpy as np
2531

2632

2733
# Display progress logs on stdout
2834
logging.basicConfig(level=logging.INFO,
2935
format='%(asctime)s %(levelname)s %(message)s')
3036

37+
# parse commandline arguments
38+
op = OptionParser()
39+
op.add_option("--no-minibatch",
40+
action="store_false", dest="minibatch", default=True,
41+
help="Use ordinary k-means algorithm.")
42+
43+
print __doc__
44+
op.print_help()
45+
46+
(opts, args) = op.parse_args()
47+
if len(args) > 0:
48+
op.error("this script takes no arguments.")
49+
sys.exit(1)
50+
51+
3152
###############################################################################
3253
# Load some categories from the training set
3354
categories = [
@@ -61,23 +82,28 @@
6182
print "n_samples: %d, n_features: %d" % X.shape
6283
print
6384

85+
6486
###############################################################################
65-
# Sparse MiniBatchKmeans
87+
# Do the actual clustering
88+
89+
if opts.minibatch:
90+
km = MiniBatchKMeans(k=true_k, init='k-means++', n_init=1,
91+
init_size=1000,
92+
batch_size=1000, verbose=1)
93+
else:
94+
km = KMeans(k=true_k, init='random', max_iter=100, n_init=1, verbose=1)
6695

67-
mbkm = MiniBatchKMeans(k=true_k, init='k-means++', n_init=1,
68-
init_size=1000,
69-
batch_size=1000, verbose=1)
70-
print "Clustering sparse data with %s" % mbkm
96+
print "Clustering sparse data with %s" % km
7197
t0 = time()
72-
mbkm.fit(X)
98+
km.fit(X)
7399
print "done in %0.3fs" % (time() - t0)
74100
print
75101

76-
print "Homogeneity: %0.3f" % metrics.homogeneity_score(labels, mbkm.labels_)
77-
print "Completeness: %0.3f" % metrics.completeness_score(labels, mbkm.labels_)
78-
print "V-measure: %0.3f" % metrics.v_measure_score(labels, mbkm.labels_)
102+
print "Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_)
103+
print "Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_)
104+
print "V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_)
79105
print "Adjusted Rand-Index: %.3f" % \
80-
metrics.adjusted_rand_score(labels, mbkm.labels_)
106+
metrics.adjusted_rand_score(labels, km.labels_)
81107
print "Silhouette Coefficient: %0.3f" % metrics.silhouette_score(
82108
X, labels, sample_size=1000)
83109

0 commit comments

Comments
 (0)