Skip to content

Alternative archiving tool with fast performance for huge numbers of small files

License

Notifications You must be signed in to change notification settings

replicon/fast-archiver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fast-archiver

fast-archiver is a command-line tool for archiving directories, and restoring those archives written in [Go](http://golang.org).

fast-archiver uses a few techniques to try to be more efficient than traditional tools:

  1. It reads a number of files concurrently and then serializes the output. Most other tools use sequential file processing, where operations like open(), lstat(), and close() can cause a lot of overhead when reading huge numbers of small files. Making these operations concurrent means that the tool is more often reading and writing data than you would be otherwise.
  2. It begins archiving files before it has completed reading the directory entries that it is archiving, allowing for a fast startup time compared to tools that first create an inventory of files to transfer.

How Fast?

On a test workload of 2,089,214 files representing a total of 90 GiB of data, fast-archiver was compared with tar and rsync for reading data files and transfering them over a network. The test scenario was a PostgreSQL database, with many of the files being small, 8-24kiB in size.

Compared with tar, fast-archiver took 33% of the execution time (27m 38s vs. 1h 23m 23s) to read the test workload and output the archive to /dev/null. The tar output had to be redirected through cat to create a comparable scenario, because tar recognized /dev/null and shortcuts the actual data file reading and writing. Here's the raw timing output for some hard data:

$ time fast-archiver -c -o /dev/null /db/data
skipping symbolic link /db/data/pg_xlog
1008.92user 663.00system 27:38.27elapsed 100%CPU (0avgtext+0avgdata 24352maxresident)k
0inputs+0outputs (0major+1732minor)pagefaults 0swaps

$ time tar -cf - /db/data | cat > /dev/null
tar: Removing leading `/' from member names
tar: /db/data/base/16408/12445.2: file changed as we read it
tar: /db/data/base/16408/12464: file changed as we read it
32.68user 375.19system 1:23:23elapsed 8%CPU (0avgtext+0avgdata 81744maxresident)k
0inputs+0outputs (0major+5163minor)pagefaults 0swaps

Compared with rsync, fast-archiver piped over ssh can transfer the database from one machine to another in 1h 30m, vs. rsync in 3h.

These huge reductions in time may not be typical, but they happen to be the workload that fast-archiver was designed for.

Examples

Creates an archive (-c) reading the directory target1, and redirects the archive to the file named target1.fast-archive:

fast-archiver -c target1 > target1.fast-archive
fast-archiver -c -o target1.fast-archive target1

Extracts the archive target1.fast-archive into the current directory:

fast-archiver -x < target1.fast-archive
fast-archiver -x -i target1.fast-archive

Creates a fast-archive remotely, and restores it locally, piping the data through ssh:

ssh postgres@10.32.32.32 "cd /db; fast-archive -c data --exclude=data/\*.pid" | fast-archiver -x

Installation

The fast-archiver repository contains both a command-line tool (at the root) and a package called falib which contains the archive reading and writing code. To make the build work correctly with both the library and the command-line tool, it's necessary to setup the correct GOPATH and directory references.

Here's a quick set of steps to setup the build:

  • Install [Go](http://golang.org).
  • Setup $GOPATH, for example: export GOPATH=$HOME/go-projects. Probably better to set it up in your .bash_aliases.
  • go get -u github.com/replicon/fast-archiver

_or_

  • go get -d github.com/replicon/fast-archiver && $GOPATH/src/github.com/replicon/fast-archiver/build.sh

Command-line arguments

-x Extract archive mode.
-c Create archive mode.
--multicpu Allows concurrent activities to run on the specified number of CPUs. Since the archiving is dominated by I/O, additional CPUs tend to just add overhead in communicating between concurrent processes, but it could increase throughput in some scenarios. Defaults to 1.

Create-mode only

-o Output path for the archive. Defaults to stdout.
--exclude A colon-separated list of paths to exclude from the archive. Can include wildcards and other shell matching constructs.
--block-size Specifies the size of blocks being read from disk, in bytes. The larger the block size, the more memory fast-archiver will use, but it could result in higher I/O rates. Defaults to 4096, maximum value is 65535.
--dir-readers The maximum number of directories that will be read concurrently. Defaults to 16.
--file-readers The maximum number of files that will be read concurrently. Defaults to 16.
--queue-dir The maximum size of the queue for sub-directory paths to be processed. Defaults to 128.
--queue-read The maximum size of the queue for file paths to be processed. Defaults to 128.
--queue-write The maximum size of the block queue for archive output. Increasing this will increase the potential memory usage, as (queue-write * block-size) memory could be allocated for file reads. Defaults to 128.

Extract-mode only

-i Input path for the archive. Defaults to stdin.
--ignore-perms Do not restore permissions on files and directories.
--ignore-owners
 Do not restore uid and gid on files and directories.

About

Alternative archiving tool with fast performance for huge numbers of small files

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •