-
Notifications
You must be signed in to change notification settings - Fork 602
Configuration Reference
This page describes options that can be set on the JVM with -D
to configure the MongoDB Hadoop Connector. For example:
bin/hadoop jar MyJob.jar com.enterprise.JobMainClass \
-Dmongo.input.split.read_shard_chunks=true \
-Dmongo.input.uri=mongodb://...
Jump to MapReduce options, input/output options, or miscellaneous options.
Configure types and classes used in MapReduce jobs.
sets the class to be used as the implementation for Mapper
sets the class to be used as the implementation for Reducer
sets the class to be used for a combiner. Must implement Reducer
.
sets the class to be used as the implementation for Partitioner
sets the output class of the key produced by your Mapper
. It may be necessary to set this value if the output types of your Mapper
and Reducer
classes are not the same.
sets the output class of the value produced by your Mapper
. It may be necessary to set this value if the output types of your Mapper
and Reducer
classes are not the same.
sets the output class of the key produced by your Reducer
.
sets the output class of the value produced by your Reducer
. It may be necessary to set this value if the output types of your Mapper
and Reducer
classes are not the same.
the Comparator to use in the Job. Note that keys emitted by Mappers either need to be WritableComparable
, or you need to specify a Comparator capable of comparing the key type emitted by Mappers.
Configure the way the connector reads from and writes to MongoDB or BSON.
the InputFormat class to use. The MongoDB Hadoop Connector provides two: MongoInputFormat
and BSONFileInputFormat
for reading from live MongoDB clusters and BSON dumps, respectively.
the OutputFormat class to use. The MongoDB Hadoop Connector provides two: MongoOutputFormat
and BSONFileOutputFormat
for writing to live MongoDB clusters and BSON files, respectively.
MongoDB connection string
to authenticate against when constructing splits. If you are using a sharded
cluster and your config
database requires authentication, and its
username/password differs from the collection that you are running Map/Reduce
on, then you may need to use this option so that Hadoop can access collection
metadata in order to compute splits. An example URI:
mongodb://username:password@cyberdyne:27017/config
.
This URI is also used by com.mongodb.hadoop.splitter.MongoSplitterFactory
to determine what Splitter implementation to use, if one was not specified in mongo.splitter.class
. If using MongoSplitterFactory, the user specified in this URI must also have read access to the input collection.
Note: this option
cannot be used to provide credentials for reading from or writing to
MongoDB. Please see
authentication options
in the MongoDB connection string documentation for guidance on providing correct
options to mongo.input.uri
and
mongo.output.uri
instead.
specify one or more input collections (including auth and readPreference options) to use as the input data source for the Map/Reduce job. This takes the form of a MongoDB connection string, and by specifying options you can control how and where the input is read from.
Examples:
-
mongodb://joe:12345@weyland-yutani:27017/analytics.users?readPreference=secondary
Authenticate as "joe" with the password "12345" and read from only SECONDARY nodes from the "users" collection in the database "analytics". -
mongodb://joe:12345@weyland-yutani:27017/production.customers?readPreferenceTags=dc:tokyo,type:hadoop
Authenticate "joe" with the password "12345" and read the "users" collection in database "analytics" only on nodes tagged with "dc:tokyo" and "type:hadoop".
A MongoDB connection string describing the destionation to which to write. Deprecated: If using a sharded cluster, you can specify a space-delimited list of URIs referring to separate mongos instances and the output records will be written to them in round-robin fashion to improve throughput and avoid overloading a single mongos (the MongoDB Java Driver does this automatically, so this is unnecessary. Just provide all mongos addresses in a single connection string.).
filter the input collection with a query. This query must be represented in JSON format, and use the MongoDB extended JSON format to represent non-native JSON data types like ObjectIds, Dates, etc.
Example
'mongo.input.query={"_id":{"$gt":{"$date":1182470400000}}}'
//this is equivalent to {_id : {$gt : ISODate("2007-06-21T20:00:00-0400")}} in the mongo shell
a projection document limiting the fields that appear in each document. This must be represented in JSON format, and use the MongoDB extended JSON format to represent non-native JSON data types like ObjectIds, Dates, etc.
the cursor sort on each split. This must be represented in JSON format, and use the MongoDB extended JSON format to represent non-native JSON data types like ObjectIds, Dates, etc.
the cursor limit on each split.
the cursor skip on each split.
Set noTimeout
on the cursor when reading each split. This can be necessary if really large splits are being truncated because cursors are killed before all the data has been read.
Defaults to false, which will decode documents into BasicBSONObjects
for Mapper. Setting this option to true will decode docs into LazyBSONObjects
.
specify space-delimited MongoDB connection strings giving the addresses to one or more mongos instances. Mongoses specified in this option will be read from in round-robin fashion. Deprecated: The MongoDB Java Driver does this automatically, so you can just provide a list of mongos addresses in a single connection string.
a key that names a field in each MongoDB document that should be used as the value for the key in each Mapper. This is _id
by default. "Dot-separated" keys are supported to access nested fields, or fields contained within arrays. For example, the key location.addresses.0.street
will be "Sesame" in the following document:
{
"location": {
"addresses": [
{"street": "Sesame", ...},
{"street": "Embarcadero", ...}
]
}
}
The number of documents to write in a single batch from the connector. This is 1000 by default.
Configure how the connector creates splits.
The name of the Splitter class to use. If left unset, the MongoDB Hadoop Connector will attempt to make a best guess as to what Splitter to use. The Hadoop connector provides the following Splitters:
-
com.mongodb.hadoop.splitter.StandaloneMongoSplitter
This is the default Splitter implemenatation for standalone and replica set configurations. It runs
splitVector
on the input collection to create ranges to use for InputSplits. -
com.mongodb.hadoop.splitter.ShardMongoSplitter
This is the default Splitter implementation for sharded clusters when mongo.input.split.read_from_shards is set to
true
(it'sfalse
by default). This Splitter creates one InputSplit per shard. -
com.mongodb.hadoop.splitter.ShardChunkMongoSplitter
This is the default Splitter implementation for sharded clusters with all the default settings. This Splitter creates one InputSplit per shard chunk.
-
com.mongodb.hadoop.splitter.MultiMongoCollectionSplitter
This Splitter is capable of reading from multiple MongoDB collections simultaneously. It delegates the task of creating InputSplits to child Splitters for each collection.
-
com.mongodb.hadoop.splitter.MongoPaginatingSplitter
This Splitter builds incremental range queries to cover a query. This Splitter requires a bit more work to calculate split boundaries, but it performs better than the other splitters when a mongo.input.query is given.
This setting causes data to be read from mongo by adding $gte/$lt
clauses to the query for limiting the boundaries of each split, instead of the default behavior of limiting with $min/$max
. This may be useful in cases when using mongo.input.query
is being used to filter mapper input, but the index that would be used by $min/$max
is less selective than the query and indexes. Setting this to 'true'
lets the MongoDB server select the best index possible and still. This defaults to false
if not set.
Limitations
This setting should only be used when:
- The query specified in
mongo.input.query
does not already have filters applied against the split field. Will log an error otherwise. - The same data type is used for the split field in all documents, otherwise you may get incomplete results.
- The split field contains simple scalar values, and is not a compound key (e.g.,
"foo"
is acceptable but{"k":"foo", "v":"bar"}
is not)
Defaults to true
. When set, attempts to create multiple input splits from the input collection, for parallelism.
When set to false
the entire collection will be treated as a single input split.
Deprecated.
Sets slaveOk
on all outbound connections when reading data from splits. This is deprecated - if you would like to read from secondaries, just specify the appropriate readPreference as part of your mongo.input.uri
setting.
When reading data for splits, query the shard directly rather than via the mongos
. This may produce inaccurate results if there are aborted/incomplete chunk migrations in process during execution of the job. Make sure to turn off the balancer and clean up orphaned chunks before using this option.
Use the chunk boundaries on a MongoDB sharded cluster's config server to determine input splits. Chunk data is read through one or more mongos instances, as described in the connection string provided to mongo.input.uri
or in mongo.input.mongos_hosts
. If both this option and mongo.input.split.read_from_shards
are true, then this option takes priority.
If you want to customize that split point for efficiency reasons (such as different distribution) on an un-sharded collection, you may set this to any valid field name. The restriction on this key name are the exact same rules as when sharding an existing MongoDB Collection. You must have an index on the field, and follow the other rules outlined in the docs.
Lower-bound for splits created using the index described by
mongo.input.split.split_key_pattern
. This
value must be set to a JSON string that describes a point in the index. This
setting must be used in conjunction with
mongo.input.split.split_key_pattern
and
mongo.input.split.split_key_max
.
Upper-bound for splits created using the index described by
mongo.input.split.split_key_pattern
. This
value must be set to a JSON string that describes a point in the index. This
setting must be used in conjunction with
mongo.input.split.split_key_pattern
and
mongo.input.split.split_key_min
.
When set to true
, will attempt to read + calculate split points for each BSON file in the input. When set to false
, will create just one split for each input file, consisting of the entire length of the file. Defaults to true
.
The split size when computing splits with StandaloneMongoSplitter
. Defaults to 8MB.
Set a lower bound on acceptable size for file splits (in bytes). Defaults to 1.
Set an upper bound on acceptable size for file splits (in bytes). Defaults to LONG_MAX_VALUE.
Automatically save any split information calculated for input files, by writing to corresponding .splits
files. Defaults to true
.
Build a .splits
file on the fly when constructing an output .BSON file. Defaults to false
.
the location where .splits
files should go. If left unset, .splits
files will go into the same directory as their BSON file counterpart, or into the current working directory.
Set the class name for a [PathFilter](http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/PathFilter.html)
to filter which files to process when scanning directories for bson input files. Defaults to null
(no additional filtering). You can set this value to com.mongodb.hadoop.BSONPathFilter
to restrict input to only files ending with the ".bson" extension.
When set to true
, turns on INFO-level logging for the Job.
When set to true
, runs the Job in the background. If set to false
, the RunningJob blocks until the Job is finished. See http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#waitForCompletion%28%29.