This library requires the Carrot2 Document Clustering Server - an open source clustering engine available at http://project.carrot2.org/index.html. Installation instructions and configuration can be found at http://project.carrot2.org/documentation.html. Carrot2 was originally designed for clustering search results from web queries, and thus uses a "search result" metaphor (which we've upheld), but it can also be used for any small (a few thousand) collection of documents.
Install the package:
npm install carrot2
The basic use of node-carrot2 involves providing a set of documents to the cluster server and receiving a SearchResult
object through a callback. For a complete example, refer to examples/basic.js.
var carrot2 = require('carrot2');
DocumentClusteringServer
can accept an optional parameter object with host
and port
properties.
var dcs = new carrot2.DocumentClusteringServer(params);
Each document contains an id, title, url, snippet, and optional custom parameters:
var sr = new carrot2.SearchResult();
sr.addDocument("ID", "Title", "http://www.site.com/", "This is a snippet.", {my_key1:my_value1, my_key2:my_value2});
dcs.cluster(sr, {algorithm:'lingo'}, [
{key:"LingoClusteringAlgorithm.desiredClusterCountBase", value:10},
{key:"LingoClusteringAlgorithm.phraseLabelBoost", value:1.0}
], function(err, sr) {
if (err) console.log(err);
var cluster = sr.clusters;
});
For a complete list of customizable Carrot2 attributes, refer to the Component documentation: http://download.carrot2.org/head/manual/index.html#chapter.components.
NOTE: Currently the DCS parameters object supports algorithm
, ids
(set of document id's to use - defaults to all), and max
(maximum number of documents to supply). Possible algorithm's are:
lingo
— Lingo Clustering (default)stc
— Suffix Tree Clusteringkmeans
— Bisecting k-meansurl
— By URL Clusteringsource
— By Source Clustering
Alternatively, you can cluster an external search engine results by suppling a query string
instead of a SearchResult
to the cluster
method. For a complete example, refer to examples/external.js.
dcs.cluster('my query', {algorithm:'stc', source:"bing-web"}, [
{key:"LingoClusteringAlgorithm.desiredClusterCountBase", value:10},
{key:"LingoClusteringAlgorithm.phraseLabelBoost", value:1.0}
], function(err, sr) {
if (err) console.log(err);
var cluster = sr.clusters;
});
NOTE: The DCS parameters object supports source
(search engine to use), and results
(number of search results to grab from source). Possible external sources include:
etools
— eTools Metasearch Enginebing-web
— Bing Searchboss-web
— Yahoo Web Searchwiki
— Wikipedia Search (with Yahoo Boss)boss-images
— Yahoo Image Searchboss-news
— Yahoo Boss News Searchpubmed
— PubMed medical databaseindeed
— Jobs from indeed.comxml
— XMLgoogle-desktop
— Google Desktop searchsolr
— Solr Search Engine
A SearchResult
object returned in a cluster callback looks like:
{ query: 'seattle',
cap: 100,
id_increment: 0,
documents: [ ... ],
documentHash: { ... },
idHash: {},
clusters:
[ { id: '[\'Washington\']',
size: 13,
score: 39.551955526331575,
phrases: [ 'Washington' ],
documents:
[ { id: 1 },
{ id: 4 },
{ id: 25 },
{ id: 26 },
{ id: 36 },
{ id: 39 },
{ id: 45 },
{ id: 47 },
{ id: 64 },
{ id: 71 },
{ id: 73 },
{ id: 75 },
{ id: 95 } ],
attributes: { score: 39.551955526331575 } }
,
...
clusterHash:
{ '[\'Washington\']':
{ id: '[\'Washington\']',
size: 13,
score: 39.551955526331575,
phrases: [ 'Washington' ],
documents:
[ { id: 1 },
{ id: 4 },
{ id: 25 },
{ id: 26 },
{ id: 36 },
{ id: 39 },
{ id: 45 },
{ id: 47 },
{ id: 64 },
{ id: 71 },
{ id: 73 },
{ id: 75 },
{ id: 95 } ],
attributes: { score: 39.551955526331575 } },
...
}
}
For detailed documentation on Carrot2 JSON output reference http://download.carrot2.org/head/manual/index.html#section.architecture.output-json.