-
Notifications
You must be signed in to change notification settings - Fork 92
Home
Yanpei Chen, Sara Alspaugh, Archana Ganapathi (1), Rean Griffith (2), Randy Katz
{ychen2, alspaugh, randy} [at] eecs [dot] berkeley [dot] edu, (1) aganapathi [at] splunk [dot] com, (2) rean [at] vmware [dot] com.
Version 1.4. Released January 2012.
MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. SWIM includes
- Repository of real life MapReduce workloads from production systems.
- Workload synthesis tools to generate representative test workloads by sampling historical MapReduce cluster traces.
- Workload replay tools to execute the historical or test workloads with low performance overhead.
SWIM enables rigorous performance measurement of MapReduce systems. SWIM contains suites of workloads of thousands of jobs, with complex data, arrival, and computation patterns. This represents an advance over previous MapReduce pseudo-benchmarks of limited diversity and scope. SWIM informs both highly targeted, workload specific optimizations, as well as designs that intend to bring general benefit.
We believe MapReduce cluster operators can use SWIM to accomplish other previously challenging tasks, including but not limited to resource provisioning and planning in multiple dimensions, configurations tuning for diverse job types within a workload, anticipating workload consolidation behavior and quantify workload superposition in multiple dimensions.
SWIM is currently integrated with Hadoop. The performance and evaluation science behind it is extensible to MapReduce systems in general.
You can learn more about SWIM from our IEEE MASCOTS 2011 paper The Case for Evaluating MapReduce Performance Using Workload Suites.
This page contains an early release of SWIM, and we expect to populate this page with additional workloads and examples as they become available. We welcome and appreciate all comments, suggestions, bug fix requests, use cases, and success stories.
SWIM is currently open-source under the New BSD License, except for files derived from Apache Hadoop, which are under the Apache License 2.0.
Please use either the git repository directly or download the repository as a compressed archive.
Analyze historical cluster traces and synthesize representative workload
Performance measurement by executing synthetic or historical workloads
Our IEEE MASCOTS 2011 paper discusses the scientific/engineering details of the workload synthesis and replay methods.
We thank the various industry partners and government sponsors of UC Berkeley RAD Lab and its successor AMP Lab for their support and feedback on the initial versions of SWIM.