Skip to content
yanpeichen edited this page Apr 29, 2013 · 37 revisions

SWIM

Statistical Workload Injector for MapReduce (SWIM)

Yanpei Chen (1), Sara Alspaugh, Archana Ganapathi (2), Rean Griffith (3), Randy Katz

(1) yanpei [at] cloudera [dot] com, {alspaugh, randy} [at] eecs [dot] berkeley [dot] edu, (2) aganapathi [at] splunk [dot] com, (3) rean [at] vmware [dot] com.

Additional contributions from Madalin Mihailescu (madalin [at] cs [dot] toronto [dot] edu) and Andrew Ferguson (adf [at] cs [dot] brown [dot] edu).

Last update February 2013.

Overview

MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. SWIM includes

  1. Repository of real life MapReduce workloads from production systems.
  2. Workload synthesis tools to generate representative test workloads by sampling historical MapReduce cluster traces.
  3. Workload replay tools to execute the historical or test workloads with low performance overhead.

SWIM enables rigorous performance measurement of MapReduce systems. SWIM contains suites of workloads of thousands of jobs, with complex data, arrival, and computation patterns. This represents an advance over previous MapReduce pseudo-benchmarks of limited diversity and scope. SWIM informs both highly targeted, workload specific optimizations, as well as designs that intend to bring general benefit.

We believe MapReduce cluster operators can use SWIM to accomplish other previously challenging tasks, including but not limited to resource provisioning and planning in multiple dimensions, configurations tuning for diverse job types within a workload, anticipating workload consolidation behavior and quantify workload superposition in multiple dimensions.

SWIM is currently integrated with Hadoop. The performance and evaluation science behind it is extensible to MapReduce systems in general.

You can learn more about SWIM from our IEEE MASCOTS 2011 paper The Case for Evaluating MapReduce Performance Using Workload Suites.

This page contains an early release of SWIM, and we expect to populate this page with additional workloads and examples as they become available. We welcome and appreciate all comments, suggestions, bug fix requests, use cases, and success stories.

Download SWIM

SWIM is currently open-source under the New BSD License, except for files derived from Apache Hadoop, which are under the Apache License 2.0.

Please use either the git repository directly or download the repository as a compressed archive.

Using SWIM

Analyze historical cluster traces and synthesize representative workload

Workloads repository

Performance measurement by executing synthetic or historical workloads

Our IEEE MASCOTS 2011 paper The Case for Evaluating MapReduce Performance Using Workload Suites discusses the scientific/engineering details of the workload synthesis and replay methods.

Our VLDB 2012 paper Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads talks more about the workloads currently in the SWIM repository, as well as some additional workloads that we're trying to make public.

Community

Please join the SWIMapReduce-general mailing list for updates on latest developments, and questions/discussions among community members.

Also please consider adding a brief note on how you are using SWIM to the SWIM projects page.

Contribute

Please feel free to fork your own version of SWIM, and issue pull requests for any bug fixes, or improvements that you believe would be beneficial to the community.

CHANGELOG and pre-github improvements to SWIM. After migrating the SWIM repository to github in late January 2012, this file is no longer being maintained, and the github commit records serve as the official change log.

Acknowledgements

We thank the various industry partners and government sponsors of UC Berkeley Reliable, Adaptive, and Distribtued systems Laboratory and its successor Algorithms, Machines, and People Laboratory for their support and feedback on the initial versions of SWIM.

We also thank all members of the community for contributing the subsequent improvements.