This is an example of how to configure a compaction strategy. By default, Accumulo will always use the DefaultCompactionStrategy, unless
these steps are taken to change the configuration. Use the strategy and settings that best fits your Accumulo setup. This example shows
how to configure a non-default strategy. Note that this example requires hadoop native libraries built with snappy in order to
use snappy compression. Within this example, commands starting with user@uno>
are run from within the Accumulo shell, whereas
commands beginning with $
are executed from the command line terminal.
Start by creating a table that will be used for the compactions.
user@uno> createnamespace examples
user@uno> createtable examples.test1
Take note of the TableID for examples.test1. This will be needed later. The TableID can be found by running:
user@uno> tables -l
accumulo.metadata => !0
accumulo.replication => +rep
accumulo.root => +r
examples.test1 => 2
The commands below will configure the desired compaction strategy. The goals are:
- Avoid compacting files over 250M.
- Compact files over 100M using gz.
- Compact files less than 100M using snappy.
- Limit the compaction throughput to 40MB/s.
Create a compaction service named cs1
that has three executors. The first executor named small
has
8 threads and runs compactions less than 16M. The second executor, medium
, runs compactions less than
128M with 4 threads. The last executor, large
, runs all other compactions with 2 threads.
user@uno> config -s tserver.compaction.major.service.cs1.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner
user@uno> config -s 'tserver.compaction.major.service.cs1.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":8},{"name":"medium","type":"internal","maxSize":"128M","numThreads":4},{"name":"large","type":"internal","numThreads":2}]'
Create a compaction service named cs2
that has three executors. It has a similar configuration to cs1
, but its
executors have fewer threads. For service, cs2
, files over 250M should not be compacted. It also limits
the total I/O of all compactions within the service to 40MB/s.
user@uno> config -s tserver.compaction.major.service.cs2.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner
user@uno> config -s 'tserver.compaction.major.service.cs2.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":4},{"name":"medium","type":"internal","maxSize":"128M","numThreads":2},{"name":"large","type":"internal","maxSize":"250M","numThreads":1}]'
user@uno> config -s tserver.compaction.major.service.cs2.rate.limit=40M
Configurations can be verified for correctness with the check-compaction-config
tool in
Accumulo. Place your compaction configuration into a file and run the tool. For example, if you create a file
myconfig
that contains the following:
tserver.compaction.major.service.cs1.planner=org.apache.accumulo.core.spi.compaction2.DefaultCompactionPlanner
tserver.compaction.major.service.cs1.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":8},{"name":"medium","type":"internal","maxSize":"128M","numThreads":4},{"name":"large","type":"internal","numThreads":2}]
tserver.compaction.major.service.cs2.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner
tserver.compaction.major.service.cs2.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":4},{"name":"medium","type":"internal","maxSize":"128M","numThreads":2},{"name":"large","type":"internal","maxSize":"250M","numThreads":1}]
tserver.compaction.major.service.cs2.rate.limit=40M
The following command would check the configuration for errors:
$ accumulo check-compaction-config /path/to/myconfig
With the compaction configuration set, configure table specific properties.
Configure the compression for table examples.test1
. Files over 100M will be compressed using gz
. All
others will be compressed via snappy
.
user@uno> config -t examples.test1 -s table.compaction.configurer=org.apache.accumulo.core.client.admin.compaction.CompressionConfigurer
user@uno> config -t examples.test1 -s table.compaction.configurer.opts.large.compress.threshold=100M
user@uno> config -t examples.test1 -s table.compaction.configurer.opts.large.compress.type=gz
user@uno> config -t examples.test1 -s table.file.compress.type=snappy
user@uno> config -t examples.test1 -s table.compaction.dispatcher=org.apache.accumulo.core.spi.compaction.SimpleCompactionDispatcher
Set table examples.test1
to use compaction service cs1
for system compactions and service cs2
for user compactions.
user@uno> config -t examples.test1 -s table.compaction.dispatcher.opts.service=cs1
user@uno> config -t examples.test1 -s table.compaction.dispatcher.opts.service.user=cs2
If needed, chop
compactions can be configured also.
user@uno> config -t examples.test1 -s table.compaction.dispatcher.opts.service.chop=cs2
Generate some data and files in order to test the strategy:
$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 1000 --size 50
$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 2000 --size 50
$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
$ accumulo shell -u <username> -p <password> -e "compact -t examples.test1 -w"
View the tserver
log in <accumulo_home>/logs for the compaction and find the name of the rfile
that was
compacted for your table. Print info about this file using the rfile-info
tool. Replace the TableID with
the TableID from above. Note, your filenames will differ from the ones within this example.
accumulo rfile-info hdfs:///accumulo/tables/2/default_tablet/A000000a.rf
Details about the rfile will be printed. The compression type should match the type used in the compaction.
In this case, snappy
is used since the size is less than 100M.
Meta block : RFile.index
Raw size : 168 bytes
Compressed size : 127 bytes
Compression type : snappy
Continue to add additional data.
$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 1000000 --size 50
$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 1000000 --num 1000000 --size 50
$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 2000000 --num 1000000 --size 50
$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
$ accumulo shell -u <username> -p <password> -e "compact -t examples.test1 -w"
Again, view the tserver log in <accumulo_home>/logs for the compaction and find the name of the rfile
that was
compacted for your table. Print info about this file using the rfile-info
tool:
accumulo rfile-info hdfs:///accumulo/tables/2/default_tablet/A000000o.rf
In this case, the compression type should be gz
.
Meta block : RFile.index
Raw size : 56,044 bytes
Compressed size : 21,460 bytes
Compression type : gz
Examining the size of A000000o.rf
within HDFS should verify that the rfile is greater than 100M.
$ hdfs dfs -ls -h /accumulo/tables/2/default_tablet/A000000o.rf