-
Notifications
You must be signed in to change notification settings - Fork 168
Benchmarks
We recently ran some fresh benchmarks for Spark v1.1 and v1.2 and Graphlab clusters, and included some updated numbers from other recent published benchmarks.
The algorithms and datasets are listed below.
- Logistic Regression for text classification (RCV1)
- Logistic Regression for click prediction (Criteo)
- K-means (MNIST 8M digit recognition)
- Matrix Factorization (Netflix dataset)
- Topic Modeling (NYTimes dataset)
- Random Forests (Year Prediction dataset)
- PageRank on social graphs (Twitter Followers)
- PageRank on a web graph (AltaVista crawl)
RCV1-v2 (Reuters news data, LYR2004 distribution) benchmarks are for OAA (One Against All) classification, since RCV1-v2 has 103 independent topic labels. RCV1-v2 is a small dataset (0.5 GB).
BIDMach was run on a single Amazon g2.xlarge instance, while Spark was run on a cluster of m3.2xlarge high-memory instances. The other systems were run on an 8-code Intel E-2660 system. All the systems that had support for Intel MKL were compiled with MKL support linked in. Energy and cost are estimates only based on a basic CPU (m3.2xlarge) or GPU (g2.2xlarge) EC2 instance.
Logistic Regression (sparse) on RCV1
System | Nodes/Cores | nclasses | AUC | Time | Cost | Energy(KJ) |
Spark | 18/72 | 1 | 0.93 | 30s | $0.07 | 120 |
BIDMach | 1 | 103 | 0.96 | 14s | $0.002 | 3 |
VW | 1/8 | 103 | 0.96 | 130s | $0.02 | 30 |
Scikit-Learn | 1/8 | 103 | 0.96 | 576s | $0.08 | 120 |
Liblinear | 1/8 | 103 | 0.96 | 250s | $0.04 | 60 |
VW = Vowpal Wabbit
Accuracy (AUC - higher is better) is quoted only for topic 6, which is roughly balanced in +/- instances. The accuracy across all classes is somewhat higher, but tracks the topic 6 accuracy.
All the systems were run as OAA classifiers (multiclass) except Spark and Liblinear. Neither of these systems supports multi-label classification directly. Liblinear was trained separately for each topic.
Spark accuracy was lower than the other systems after 3 passes over the dataset, perhaps due to Spark's SGD implementation, which does only a single SGD update for each pass over the dataset. The other systems make one update per minibatch.
VW has a fast SGD implementation and uses hardware acceleration (Intel intrinsics) and is probably the fastest single-machine system in use today. See e.g.
http://fastml.com/vowpal-wabbit-liblinear-sbm-and-streamsvm-compared/
BIDMach with GPU is an order of magnitude faster than any of the single-machine systems, and outperforms Spark on a 28-node cluster while training 103x as many models.
The reference for RCV1-v2 is:
Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.
Criteo released a medium-sized (12 GB) dataset with a single target (click, no click) with a very sparse set of features. This is representative of many click prediction tasks in industry.
System | Nodes/Cores | npasses | AUC | Time | Cost | Energy(KJ) |
Spark | 8/32 | 10 | 0.62 | 964s | $0.64 | 1500 |
Spark | 32/128 | 10 | 0.62 | 400s | $1.00 | 2500 |
BIDMach | 1 | 1 | 0.66 | 81s | $0.01 | 6 |
BIDMach | 1 | 10 | 0.72 | 805s | $0.13 | 60 |
BIDMach was run as before on a single Amazon g2.xlarge instance, while Spark 1.1 was run on a cluster of m3.2xlarge high-memory instances. Once again Spark SGD did not seem to perform well, and running more iterations did not improve it. Random guessing yields an AUC of 0.5, so BIDMach's score (0.72) represents about twice the lift of Spark (0.62). BIDMach was around 10x faster on a per-node basis. Single-target logistic regression is a worst-case for GPU acceleration, but BIDMach's numbers were quite respectable on this problem. Overall speed was higher for comparable accuracy, and cost and energy use were several orders of magnitude lower.
The dataset is available here:
http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset/
The MNIST dataset is a medium-sized (25 GB) collection of images of handwritten digits. The dataset has 8 million 28x28 images labeled 0-9. K-means was run for 20 iterations. The BIDMach numbers are for a Titan-X GPU.
System | nodes/cores | nclust | SumSq | Time | Cost | Energy(KJ) |
Spark | 32/128 | 256 | 1.5e13 | 180s | $0.45 | 1150 |
BIDMach | 1 | 256 | 1.44e13 | 320s | $0.06 | 90 |
Scikit-Learn | 1/8 | 256 | 3200x4 * | $1.0 | 10 | |
Spark | 96/384 | 4096 | 1.05e13 | 1100 | $9.00 | 22000 |
BIDMach | 1 | 4096 | 0.995e13 | 735 | $0.12 | 140 |
BIDMach was competitive with a small Spark cluster (32 nodes) on small problems (dim 256) but outperformed a ~100-node cluster on a large problem (dim 4096). This is typical for compute-intensive problems where BIDMach gets full leverage from GPU matrix kernels.
The reference for the dataset is:
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998.
The dataset is a small (about 4GB) collection of movie reviews from 0.5 million users on 17770 movies. Sparse matrix factorization was used, either a batch version (also known as Alternating Least Squares) and an SGD implementation. We compared against Spark 1.1 and the archived graphlab open source Distribution. The new commercial version of graphlab doesnt yet support distributed computing.
BIDMach was run on a g2.2xlarge instance, Spark on an m3.2xlarge cluster, and Graphlab on an cc2.8xlarge cluster (the default in the graphlab EC2 script).
System | Nodes/cores | Dim | RMSE | Time | Cost | Energy (KJ) |
Graphlab | 18/576 | 100 | 376 | $3.50 | 10,000 | |
Spark | 32/128 | 100 | 0.82 | 146 | $0.40 | 1000 |
BIDMach | 1 | 100 | 0.83 | 90 | $0.015 | 20 |
Spark | 32/128 | 200 | 0.82 | 544 | $1.45 | 3500 |
BIDMach | 1 | 200 | 0.83 | 129 | $0.02 | 30 |
BIDMach | 1 | 500 | 0.83 | 600 | $0.10 | 150 |
BIDMach includes several implementations of LDA: An online Variational Bayes (VB) [3], a batch variational Bayes [1], and a "cooled" Gibbs sampler implementation which is described in a forthcoming paper [6]. We compared against 4 systems, which are all custom LDA implementations:
- David's Blei's original C++ implementation of VB [1]
- Thomas Hoffmann's implementation of Collapsed Gibbs Sampling in Matlab/C [2]
- Yan et al.'s implementation of Collapsed Gibbs Sampling with GPU acceleration [4]
- Ahmed et al.'s implementation of VB using a customized cluster implementation [5]
System | Time | Iterations | Data Throughput | Gflops | Mflops/W |
BIDMach online VB (680 GPU) | 40 secs | 2 | 40 MB/s | 25 | 250 |
BIDMach Cooled GS (680 GPU) | 90 secs | 3 | 20 MB/s | 30 | 300 |
BIDMach batch VB (680 GPU) | 400 secs | 20 | 50 MB/s | 25 | 250 |
Yan GPU Collapsed GS | 5400 secs | 1000 | 100 MB/s | 0.4 (est.) | 4 |
Blei batch VB | 252000 secs | 20 | 0.1 MB/s | 0.05 (est.) | 0.5 |
Griffiths Collapsed GS | 225000 secs | 1000 | 2 MB/s | 8 Mflops (est.) | 0.08 |
To compare with the Yahoo_LDA cluster algorithm [5] which used proprietary news articles, we constructed a repeating stream of NYTimes articles as per [5]. There were two problem instances, first for 200 million articles, 256 topics:
System | Time | Iterations | Data Throughput | Gflops | Mflops/W |
BIDMach Cooled GS (1 node) | 30,000 secs | 2 | 10 MB/s | 30 | 300 |
Yahoo_LDA (100 nodes) | 120,000 secs | 1000 | 1300 MB/s | 8 (est.) | 0.8 |
The second for 2 billion articles, 256 topics:
System | Time | Iterations | Data Throughput | Gflops | Mflops/W |
BIDMach Cooled GS (1 node) | 300,000 secs | 2 | 10 MB/s | 30 | 300 |
Yahoo_LDA (1000 nodes) | 240,000 secs | 1000 | 7000 MB/s | 40 (est.) | 0.4 |
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.
[2] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences of the United States of America, 101(Suppl 1):5228–5235, 2004.
[3] M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent dirichlet allocation. In NIPS, volume 2, page 5, 2010.
[4] F. Yan, N. Xu, and Y. Qi. Parallel inference for latent dirichlet allocation on graphics processing units. In NIPS, volume 9, pages 2134–2142, 2009.
[5] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining, pages 123–132. ACM, 2012.
[6] H. Zhao, and J. Canny, Cooled Gibbs Parameter Estimation, (under review) 2014.
For this test, we used the UCI Year Prediction dataset that was used in a recent Spark v1.2 benchmark:
https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html (about 0.2 GB)
We ran Forests with 100 trees at depth 10 (you really need depth ~ 20 to get good accuracy, but we dont have those times for Spark) to match the benchmark reported in that blog. Spark was run on 16 EC2 workers, which were r3.2xlarge. BIDMach was run on a single g2.2xlarge.
Here are the times:
System | Time(s) | Cost($) | Energy(KJ) |
Spark-16W | 150 | 0.50 | 500 |
BIDMach | 300 | 0.06 | 60 |
Its fair to ask whether the comparison is meaningful since Spark is a cluster implementation and in theory can do much larger datasets. There are other RF implementations that are faster per-machine (Wise-IO and the most recent Scikit-Learn implementation is very good). We think the comparison is appropriate because BIDMach's RF implementation is scalable (unlike the other single-machine systems). Its an out-of-memory implementation that pulls the data repeatedly off disk. We've computed Forests on 100 GB datasets, and it should be possible to go at least an order of magnitude beyond that, still on one machine.
With Forests of course, one has perfectly linear scaling with the number of trees. BIDMach's time above was based on blocks of 25 trees, and so one can reduce time by x4 by running on 4 nodes still for the same cost and energy. Our implementation focused on scalability on a single node, squeezing memory use down so that we can get competitive running time but avoid the overhead of distributing the calculation of single trees. We think the result is very competitive in production environments with other forest implementations, in fact we're testing this right now :-)
It turns out that most data frameworks (Hadoop, Spark, Graphlab) use a reduce operation which isnt scalable [2]. In these systems, messages are exchanged all-to-all in a network with N nodes, and message size shrinks as O(1/N) (fixed amount of data per node) or O(1/N^2) (fixed total dataset size). Eventually messages become too small to be sent efficiently. Butterfly Allreduce primitives overcome this problem [1], [2], and provide benefits of error-tolerance and speed.
The table below shows the performance of BIDMach running on a cluster of 64 nodes, for one iteration of the PageRank algorithm, using the "Kylix" allreduce primitive [2]. The first benchmark is for the Twitter Followers graph, a Power-law graph with 40 million vertices and 1.4 billion edges.
System | Time |
BIDMach/Kylix (64 nodes) | 0.5 secs |
PowerGraph (64 nodes) | 3.6 secs |
Hadoop (90 nodes) | 250 |
The second benchmark is for the Yahoo Altavista Web crawl, a Power-law graph with 1.4 billion vertices and 6 billion edges.
System | Time |
BIDMach/Kylix (64 nodes) | 2.5 secs |
PowerGraph (64 nodes) | 7 secs |
Hadoop (90 nodes) | 1000 |
BIDMach with Kylix is 3-7x faster than the next-fastest system for this problem, PowerGraph, on a 64-node cluster. This gap should grow on larger clusters thanks to the better asymptotic performance of butterflys.
[1] Huasha Zhao and John Canny, Butterfly Mixing: Accelerating Incremental-Update Algorithms on Clusters, Proc. 2013 SIAM International Conference on Data Mining (SDM 2013)
[2] H. Zhao and J. Canny, Kylix: A Sparse Allreduce for Commodity Clusters, in Int. Conf. on Parallel Processing 2014.
(1) Allreduce can be implemented in the MapReduce Framework but requires two reduce stages. Our approach provides allreduce as a single-stage primitive.