The spam datasets come from:
The "Toys and Games" and "Patio, Lawn and Garden" datasets come from:
The Titanic dataset comes from:
The Game of Thrones dataset comes from: The updated "isAlive" values were collected and prepared by Sarah Yurick.
To run self-taught clustering, do %run a b c d e f g, where a is the datasets to use ("spam", "amazon", or "survival"), b is the number of iterations to do self-taught clustering, c is a hyperparameter related to weighting the auxiliary versus target data, d is the maximum number of features to use, e is the maximum number of rows to use from the auxiliary dataset, f is the maximum number of rows to use from the target dataset, and g is to specify if you want to perform dimensionality reduction before self-taught clustering. g may be any of the following: pca, sparse, truncated, kernel, LPP, pca_double, sparse_double, truncated_double, kernel_double, and LPP_double.