Release for JSS paper.

mhahsler · Oct 23, 2019 · 50cacb7 · 50cacb7
1 parent 70adc7b
commit 50cacb7
Show file tree

Hide file tree

Showing 9 changed files with 966 additions and 919 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: dbscan
-Version: 1.1-4.1
-Date: 2019-xx-xx
+Version: 1.1-5
+Date: 2019-10-22
 Title: Density Based Clustering of Applications with Noise (DBSCAN) and Related
     Algorithms
 Authors@R: c(person("Michael", "Hahsler", role = c("aut", "cre", "cph"),
@@ -14,7 +14,8 @@ Description: A fast reimplementation of several density-based algorithms of
     the clustering structure) clustering algorithms HDBSCAN (hierarchical DBSCAN) and the LOF (local outlier
     factor) algorithm. The implementations use the kd-tree data structure (from
     library ANN) for faster k-nearest neighbor search. An R interface to fast kNN
-    and fixed-radius NN search is also provided.
+    and fixed-radius NN search is also provided. 
+    See Hahsler M, Piekenbrock M and Doran D (2019) <doi:10.18637/jss.v091.i01>.
 Imports:
     Rcpp (>= 1.0.0),
     graphics,

diff --git a/NEWS.md b/NEWS.md
@@ -1,4 +1,4 @@
-# dbscan 1.1-4.1 (2019-xx-xx)
+# dbscan 1.1-5 (2019-10-22)
 
 ## New Features
 * kNN and frNN gained parameter query to query neighbors for points not in the data.

diff --git a/inst/CITATION b/inst/CITATION
@@ -0,0 +1,19 @@
+bibentry(bibtype = "Article",
+  title        = "{dbscan}: Fast Density-Based Clustering with {R}",
+  author       = c(person(given = "Michael",
+                          family = "Hahsler",
+                          email = "mhahsler@lyle.smu.edu"),
+                   person(given = "Matthew",
+                          family = "Piekenbrock"),
+                   person(given = "Derek",
+                          family = "Doran",
+                          email = "derek.doran@wright.edu")),
+  journal      = "Journal of Statistical Software",
+  year         = "2019",
+  volume       = "91",
+  number       = "1",
+  pages        = "1--30",
+  doi          = "10.18637/jss.v091.i01",
+  header       = "To cite dbscan in publications use:"
+)
+
diff --git a/man/dbscan.Rd b/man/dbscan.Rd
@@ -42,7 +42,7 @@ dbscan(x, eps, minPts = 5, weights = NULL, borderPoints = TRUE, ...)
 \emph{Note:} use \code{dbscan::dbscan} to call this
 implementation when you also use package \pkg{fpc}.
 
-This implementation of DBSCAN implements the original algorithm as described by
+This implementation of DBSCAN (Hahsler et al, 2019) implements the original algorithm as described by
 Ester et al (1996). DBSCAN estimates the density around each data point by counting the number of points in a user-specified eps-neighborhood and applies a used-specified minPts thresholds to identify core, border and noise points. In a second step, core points are joined into a cluster if they are density-reachable (i.e., there is a chain of core points where one falls inside the eps-neighborhood of the next). Finally, border points are assigned to clusters. The algorithm only needs
 parameters \code{eps} and \code{minPts}.
 
@@ -72,7 +72,10 @@ cluster will be reported as members of the noise cluster 0.
     \item{cluster }{A integer vector with cluster assignments. Zero indicates noise points.}
 }
 \references{
-Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. \emph{Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).}
+Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based Clustering with R.
+  \emph{Journal of Statistical Software}, 91(1), 1-30. \doi{10.18637/jss.v091.i01}
+
+Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. \emph{Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96)}, 226-231.
 
 Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013). Density-Based Clustering
 Based on Hierarchical Density Estimates. \emph{Proceedings of the 17th

diff --git a/man/hdbscan.Rd b/man/hdbscan.Rd
@@ -35,10 +35,10 @@ hdbscan(x, minPts, xdist = NULL,
 
 }
 \details{
-Computes the hierarchical cluster tree representing density estimates along with the stability-based flat cluster extraction
+This fast implementation of HDBSCAN (Hahsler et al, 2019) computes the hierarchical cluster tree representing density estimates along with the stability-based flat cluster extraction
 proposed by Campello et al. (2013). HDBSCAN essentially computes the hierarchy of all DBSCAN* clusterings, and then uses a stability-based extraction method to find optimal cuts in the hierarchy, thus producing a flat solution.
 
-Additional, related algorithms including the "Global-Local Outlier Score from Hierarchies" (GLOSH) (see section 6 of Campello et al. 2015) outlier scores and ability to cluster based on instance-level constraints (see section 5.3 of Campello et al. 2015) are supported. The algorithms only need the parameter \code{minPts}.
+Additional, related algorithms including the "Global-Local Outlier Score from Hierarchies" (GLOSH) (see section 6 of Campello et al., 2015) outlier scores and ability to cluster based on instance-level constraints (see section 5.3 of Campello et al. 2015) are supported. The algorithms only need the parameter \code{minPts}.
 
 Note that \code{minPts} not only acts as a minimum cluster size to detect, but also as a "smoothing" factor of the density estimates implicitly computed from HDBSCAN.
 }
@@ -54,11 +54,12 @@ Note that \code{minPts} not only acts as a minimum cluster size to detect, but a
   %% ...
 }
 \references{
-Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. \emph{Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).}
+Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based Clustering with R.
+  \emph{Journal of Statistical Software}, 91(1), 1-30. \doi{10.18637/jss.v091.i01}
 
-Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. \emph{Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD 2013,} Lecture Notes in Computer Science 7819, p. 160.
+Campello RJGB, Moulavi D, Sander J (2013). Density-Based Clustering Based on Hierarchical Density Estimates. \emph{Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD 2013,} Lecture Notes in Computer Science 7819, p. 160.
 
-Campello, Ricardo JGB, et al. "Hierarchical density estimates for data clustering, visualization, and outlier detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 10.1 (2015): 5.
+Campello RJGB, Moulavi D, Zimek A, Sander J (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. \emph{ACM Transactions on Knowledge Discovery from Data (TKDD),} 10(5):1-51.
 }
 
 \seealso{

diff --git a/man/optics.Rd b/man/optics.Rd
@@ -45,7 +45,7 @@ extractXi(object, xi, minimum = FALSE, correctPredecessors = TRUE)
     details on how to control the search strategy.}
 }
 \details{
-This implementation of OPTICS implements the original algorithm as described by
+This implementation of OPTICS (Hahsler et al, 2019) implements the original algorithm as described by
 Ankerst et al (1999). OPTICS is an ordering algorithm using similar concepts
 to DBSCAN. However, for OPTICS
 \code{eps} is only an upper limit for the neighborhood size used to reduce
@@ -101,9 +101,12 @@ See \code{\link{frNN}} for more information on the parameters related to nearest
     \item{clusters_xi }{ data.frame containing the start and end of each cluster found in the OPTICS ordering. }
 }
 \references{
-Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Joerg Sander (1999). OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD international conference on Management of data. ACM Press. pp. 49--60.
+Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based Clustering with R.
+  \emph{Journal of Statistical Software}, 91(1), 1-30. \doi{10.18637/jss.v091.i01}
 
-Erich Schubert, Michael Gertz (2018). Improving the Cluster Structure Extracted from OPTICS Plots. Lernen, Wissen, Daten, Analysen (LWDA 2018). pp. 318--329.
+Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Joerg Sander (1999). OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD international conference on Management of data. ACM Press. pp. 49-60.
+
+Erich Schubert, Michael Gertz (2018). Improving the Cluster Structure Extracted from OPTICS Plots. Lernen, Wissen, Daten, Analysen (LWDA 2018). pp. 318-329.
 }
 
 \author{

diff --git a/vignettes/dbscan.Rnw b/vignettes/dbscan.Rnw
@@ -127,6 +127,13 @@ This article presents an overview of the \proglang{R} package~\pkg{dbscan}
 focusing on DBSCAN and OPTICS, outlining its operation and experimentally
 compares its performance with implementations in other open-source implementations. We first review the concept of density-based clustering and present the DBSCAN and OPTICS algorithms in Section~\ref{sec:dbc}. This section concludes with a short review of existing software packages that implement these algorithms. Details about \pkg{dbscan}, with examples of its use, are presented in Section~\ref{sec:dbscan}. A performance evaluation is presented in Section~\ref{sec:eval}. Concluding remarks are offered in Section~\ref{sec:conc}.
 
+A version of this article describing the package \pkg{dbscan} was published as \cite{hahsler2019dbscan} and should be cited.
+
+<<echo=FALSE>>=
+options(useFancyQuotes = FALSE)
+citation("dbscan")
+@
+
 \section{Density-based clustering}\label{sec:dbc}
 Density-based clustering is now a well-studied field. Conceptually, the idea behind density-based clustering is simple: given a set of data points, define a structure that accurately reflects the underlying density~\citep{sander2011density}. An important distinction between density-based clustering and alternative approaches to cluster analysis, such as the use of \emph{(Gaussian) mixture models}~\citep[see][]{jain1999review}, is that the latter represents a \emph{parametric} approach in which the observed data are assumed to have been produced by mixture of either Gaussian or other parametric families of distributions.
 While certainly useful in many applications, parametric approaches naturally assume clusters will exhibit some type convex (generally hyper-spherical or hyper-elliptical) shape. Other approaches, such as $k$-means clustering (where the $k$ parameter signifies the user-specified number of clusters to find), share this common theme of `minimum variance', where the underlying assumption is made that ideal clusters are found by minimizing some measure of intra-cluster variance (often referred to as cluster cohesion) and maximizing the inter-cluster variance (cluster separation)~\citep{arbelaitz2013extensive}. Conversely, the label density-based clustering is used for methods which do not assume parametric distributions, are capable of finding arbitrarily-shaped clusters, handle varying amounts of noise, and require no prior knowledge regarding how to set the number of clusters $k$. This methodology is best expressed in the DBSCAN algorithm, which we discuss next.