-
Notifications
You must be signed in to change notification settings - Fork 708
All about reducers count
Finding the right number of reducers for a flow, or part of it, is a very important and tedious step toward optimal performances and cluster resources usage. This page details the behaviour of and the facilities provided by Scalding to help in that process.
By default, Scalding reads the "mapred.reduce.tasks" JobConf's value as the number of reducers used throughout the whole flow. If not present, it defaults to the value of 1.
Job developer may specify the number of reducers for each aggregation results using the withReducer method on *Grouped instances.
Scalding can optionally be instrumented to estimate the ideal number of reducers at each step of the flow using the facilities located in the reducer_estimation package. Estimation logic is provided by ReducerEstimator implementations and injected thought the Cascading's FlowStepStrategy mechanism.
The following command-line argument instructs Scalding to use one (or several) ReducerEstimator:
-Dscalding.reducer.estimator.classes="Comma-separated list of fully qualified estimator classes"
Defaults to ""
A programmatically approach is also provided:
class MyJob(args: Args) extends Job(args) {
...
scaldingConfig.addReducerEstimator(...)
...
}
Whether estimators computed value takes precedence over programmatically defined reducers number may be stated with
-Dscalding.reducer.estimator.override=boolean
Defaults to false
Scalding comes with two built-in ReducersEstimators implementations:
-
InputSizeReducerEstimator computes the number of reducers based on input size and a fixed "bytesPerReducer" value.
It can be set by passing
-Dscalding.reducer.estimator.bytes.per.reducer=12345
Defaults to 8GB -
RatioBasedEstimator computes the average ratio of mapper bytes to reducer bytes of previous steps and uses that to scale the estimate produced by InputSizeReducerEstimator.
RatioBasedEstimator optionally ignores history items whose input size is drastically different than the current job.
The
-Dscalding.reducer.estimator.input.ratio.threshold=float
parameter specifies the lower bound on allowable input size ratio. Defaults to 0.10 (10%), which sets the upper bound to 10x. The number of historical steps used for the computation can be set with-Dscalding.reducer.estimator.max.history=int
. Defaults to 1.
Custom ReducerEstimator implementations may also be provided.
For each FlowStep, both the originally-present-in-JobConf and estimator-computed values are consultable in Hadoop's Job tracker through the following keys:
/** Output param: what the Reducer Estimator recommended, regardless of if it was used. */
"scalding.reducer.estimator.result"
/** Output param: what the original job config was. */
"scalding.reducer.estimator.original.mapred.reduce.tasks"
- Scaladocs
- Getting Started
- Type-safe API Reference
- SQL to Scalding
- Building Bigger Platforms With Scalding
- Scalding Sources
- Scalding-Commons
- Rosetta Code
- Fields-based API Reference (deprecated)
- Scalding: Powerful & Concise MapReduce Programming
- Scalding lecture for UC Berkeley's Analyzing Big Data with Twitter class
- Scalding REPL with Eclipse Scala Worksheets
- Scalding with CDH3U2 in a Maven project
- Running your Scalding jobs in Eclipse
- Running your Scalding jobs in IDEA intellij
- Running Scalding jobs on EMR
- Running Scalding with HBase support: Scalding HBase wiki
- Using the distributed cache
- Unit Testing Scalding Jobs
- TDD for Scalding
- Using counters
- Scalding for the impatient
- Movie Recommendations and more in MapReduce and Scalding
- Generating Recommendations with MapReduce and Scalding
- Poker collusion detection with Mahout and Scalding
- Portfolio Management in Scalding
- Find the Fastest Growing County in US, 1969-2011, using Scalding
- Mod-4 matrix arithmetic with Scalding and Algebird
- Dean Wampler's Scalding Workshop
- Typesafe's Activator for Scalding