-
Notifications
You must be signed in to change notification settings - Fork 2
/
parsync.txt
298 lines (227 loc) · 13.5 KB
/
parsync.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
parsync - a parallel rsync wrapper for large data transfers
===========================================================
by Harry Mangalam <harry.mangalam@uci.edu>
v1.68 (Mac Beta)
Mar 27, 2017
//Harry Mangalam mailto:harry.mangalam@uci.edu[harry.mangalam@uci.edu]
// Convert this file to HTML & move to it's final dest with the command:
// export fileroot="/home/hjm/parsync/parsync"; asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.[ht]* moo.nac.uci.edu:~/public_html/parsync
== Download
If you already know you want it, get it here:
http://moo.nac.uci.edu/~hjm/parsync/parsync+utils.tar.gz[parsync+utils.tar.gz]
(contains 'parsync' plus the 'kdirstat-cache-writer', 'stats', and 'scut' utilities below)
Extract it into a dir on your $PATH and after verifying the other dependencies below,
give it a shot.
While parsync is developed for and test on Linux, the latest version of parsync has been
modified to (mostly) work on the Mac (tested on OSX 10.9.5). A number of the Linux-specific
dependencies have been removed and there are a number of Mac-specific work arounds. Thanks to
Phil Reese <preese@stanford.edu> for the code mods needed to get it started. It's the same package
and instructions for both platforms.
[[deps]]
== Dependencies
parsync requires the following utilities to work:
//* https://www.kernel.org/pub/software/network/ethtool/[ethtool] - std Linux tool for probing ethernet interfaces. 'Install from repositories.'
//* iwconfig - std Linux tool for probing wireless interfaces. 'Install from repositories.'
//* http://gael.roualland.free.fr/ifstat/[ifstat] - std Linux tool for extracting metrics from network interfaces. 'Install from repositories.'
* http://moo.nac.uci.edu/~hjm/parsync/utils/stats[stats] - self-writ Perl utility for providing descriptive stats on 'STDIN'
* http://moo.nac.uci.edu/~hjm/parsync/utils/scut[scut] - self-writ Perl utility like 'cut' that allows regex split tokens
* kdirstat-cache-writer (included in the tarball mentioned above), requires a
//* https://www.openfabrics.org/mediawiki/index.php/Basic_Commands#ibstat[ibstat] - part of the https://www.openfabrics.org/[OFED] package to identify Infiniband characteristics.
non-default Perl utility: *URI::Escape qw(uri_escape)*
--------------------------------------------------------------------
sudo yum install perl-URI # CentOS-like
sudo apt-get install liburi-perl # Debian-like
--------------------------------------------------------------------
'parsync' needs to be installed only on the SOURCE end of the transfer and
uses whatever 'rsync' is available on the TARGET. It uses a number of Linux-
specific utilities so if you're transferring between Linux and a FreeBSD
host, install parsync on the Linux side. In fact, as currently written, it will
'only PUSH data to remote targets'; it will not pull data as rsync itself can do.
This will probably in the near future.
== Overview
http://rsync.samba.org[rsync] is a fabulous data mover. Possibly more bytes have been moved
(or have been prevented from being moved) by rsync than by any other application.
So what's not to love?
For transferring large, deep file trees, rsync will pause while it generates
lists of files to process.
Since Version 3, it does this pretty fast, but on sluggish filesystems, it
can take hours or even days
before it will start to actually exchange rsync data.
Second, due to various bottlenecks, rsync will tend to use less than the
available bandwidth on high speed networks. Starting multiple instances of rsync
can improve this significantly. However, on such transfers, it is also easy to overload
the available bandwidth, so it would be nice to both limit the bandwidth used if
necessary and also to limit the load on the system.
parsync tries to satisfy all these conditions and more by:
* using the http://moo.nac.uci.edu/~hjm/parsync/utils/kdirstat-cache-writer[kdir-cache-writer]
utility from the beautiful http://kdirstat.sourceforge.net/[kdirstat] directory browser which
can produce lists of files very rapidly
* allowing re-use of the cache files so generated.
* doing crude loadbalancing of the number of active rsyncs, suspending and un-suspending
the processes as necessary.
* using rsync's own bandwidth limiter (--bwlimit) to throttle the total bandwidth.
* using rsync's own vast option selection is available as a pass-thru (tho limited to
those compatible with the '--files-from' option).
.Only use for LARGE data transfers
[IMPORTANT]
==========================================================================
The main use case for parsync is really only very large data transfers thru
fairly fast network connections (>1Gb/s). Below this speed, a single rsync
can saturate the connection, so there's little reason to use parsync and
in fact the overhead of testing the existence of and starting more rsyncs
tends to worsen its performance on small transfers to slightly less than
rsync alone.
==========================================================================
Beyond this introduction, parsync's internal help is about all you'll need to
figure out how to use it; below is what you'll see when you type 'parsync -h'.
There are still edge cases where parsync will fail or behave oddly,
especially with small data transfers, so I'd be happy to hear of such misbehavior or
suggestions to improve it.
Download the complete tarball of parsync, plus the required utilities here:
http://moo.nac.uci.edu/~hjm/parsync/parsync+utils.tar.gz[parsync+utils.tar.gz]
Unpack it, move the contents to a dir on your '$PATH', chmod it executable, and try it out.
parsync --help
or just
parsync
Below is what you should see:
parsync help
-------------
-------------------------------------------------------------------------
parsync version 1.67 (Mac compatibility beta) Jan 22, 2017
by Harry Mangalam <hjmangalam@gmail.com> || <harry.mangalam@uci.edu>
parsync is a Perl script that wraps Andrew Tridgell's miraculous 'rsync' to
provide some load balancing and parallel operation across network connections
to increase the amount of bandwidth it can use.
parsync is primarily tested on Linux, but (mostly) works on MaccOSX
as well.
parsync needs to be installed only on the SOURCE end of the
transfer and only works in local SOURCE -> remote TARGET mode
(it won't allow remote local SOURCE <- remote TARGET, emitting an
error and exiting if attempted).
It uses whatever rsync is available on the TARGET. It uses a number
of Linux-specific utilities so if you're transferring between Linux
and a FreeBSD host, install parsync on the Linux side.
The only native rsync option that parsync uses is '-a' (archive) &
'-s' (respect bizarro characters in filenames).
If you need more, then it's up to you to provide them via
'--rsyncopts'. parsync checks to see if the current system load is
too heavy and tries to throttle the rsyncs during the run by
monitoring and suspending / continuing them as needed.
It uses the very efficient (also Perl-based) kdirstat-cache-writer
from kdirstat to generate lists of files which are summed and then
crudely divided into NP jobs by size.
It appropriates rsync's bandwidth throttle mechanism, using '--maxbw'
as a passthru to rsync's 'bwlimit' option, but divides it by NP so
as to keep the total bw the same as the stated limit. It monitors and
shows network bandwidth, but can't change the bw allocation mid-job.
It can only suspend rsyncs until the load decreases below the cutoff.
If you suspend parsync (^Z), all rsync children will suspend as well,
regardless of current state.
Unless changed by '--interface', it tried to figure out how to set the
interface to monitor. The transfer will use whatever interface routing
provides, normally set by the name of the target. It can also be used for
non-host-based transfers (between mounted filesystems) but the network
bandwidth continues to be (usually pointlessly) shown.
[[NB: Between mounted filesystems, parsync sometimes works very poorly for
reasons still mysterious. In such cases (monitor with 'ifstat'), use 'cp'
or 'tnc' (https://goo.gl/5FiSxR) for the initial data movement and a single
rsync to finalize. I believe the multiple rsync chatter is interfering with
the transfer.]]
It only works on dirs and files that originate from the current dir (or
specified via "--rootdir"). You cannot include dirs and files from
discontinuous or higher-level dirs.
** the ~/.parsync files **
The ~/.parsync dir contains the cache (*.gz), the chunk files (kds*), and the
time-stamped log files. The cache files can be re-used with '--reusecache'
(which will re-use ALL the cache and chunk files. The log files are
datestamped and are NOT overwritten.
** Odd characters in names **
parsync will sometimes refuse to transfer some oddly named files, altho
recent versions of rsync allow the '-s' flag (now a parsync default)
which tries to respect names with spaces and properly escaped shell
characters. Filenames with embedded newlines, DOS EOLs, and other
odd chars will be recorded in the log files in the ~/.parsync dir.
** Because of the crude way that files are chunked, NP may be
adjusted slightly to match the file chunks. ie '--NP 8' -> '--NP 7'.
If so, a warning will be issued and the rest of the transfer will be
automatically adjusted.
OPTIONS
=======
[i] = integer number
[f] = floating point number
[s] = "quoted string"
( ) = the default if any
--NP [i] (sqrt(#CPUs)) ............... number of rsync processes to start
optimal NP depends on many vars. Try the default and incr as needed
--startdir [s] (`pwd`) .. the directory it works relative to. If you omit
it, the default is the CURRENT dir. You DO have
to specify target dirs. See the examples below.
--maxbw [i] (unlimited) .......... in KB/s max bandwidth to use (--bwlimit
passthru to rsync). maxbw is the total BW to be used, NOT per rsync.
--maxload [f] (NP+2) ........ max total system load - if sysload > maxload,
sleeps an rsync proc for 10s
--checkperiod [i] (5) .......... sets the period in seconds between updates
--rsyncopts [s] ... options passed to rsync as a quoted string (CAREFUL!)
this opt triggers a pause before executing to verify the command.
--interface [s] ............. network interface to /monitor/, not nec use.
default: `/sbin/route -n | grep "^0.0.0.0" | rev | cut -d' ' -f1 | rev`
above works on most simple hosts, but complex routes will confuse it.
--reusecache .......... don't re-read the dirs; re-use the existing caches
--email [s] ..................... email address to send completion message
(requires working mail system on host)
--barefiles ..... set to allow rsync of individual files, as oppo to dirs
--nowait ................ for scripting, sleep for a few s instead of wait
--version ................................. dumps version string and exits
--help ......................................................... this help
Examples
========
-- Good example 1 --
% parsync --maxload=5.5 --NP=4 --startdir='/home/hjm' dir1 dir2 dir3
hjm@remotehost:~/backups
where
= "--startdir='/home/hjm'" sets the working dir of this operation to
'/home/hjm' and dir1 dir2 dir3 are subdirs from '/home/hjm'
= the target "hjm@remotehost:~/backups" is the same target rsync would use
= "--NP=4" forks 4 instances of rsync
= -"-maxload=5.5" will start suspending rsync instances when the 5m system
load gets to 5.5 and then unsuspending them when it goes below it.
It uses 4 instances to rsync dir1 dir2 dir3 to hjm@remotehost:~/backups
-- Good example 2 --
% parsync --rsyncopts="--ignore-existing" --reusecache --NP=3
--barefiles *.txt /mount/backups/txt
where
= "--rsyncopts='--ignore-existing'" is an option passed thru to rsync
telling it not to disturb any existing files in the target directory.
= "--reusecache" indicates that the filecache shouldn't be re-generated,
uses the previous filecache in ~/.parsync
= "--NP=3" for 3 copies of rsync (with no "--maxload", the default is 4)
= "--barefiles" indicates that it's OK to transfer barefiles instead of
recursing thru dirs.
= "/mount/backups/txt" is the target - a local disk mount instead of a network host.
It uses 3 instances to rsync *.txt from the current dir to "/mount/backups/txt".
-- Error Example 1 --
% pwd
/home/hjm # executing parsync from here
% parsync --NP4 --compress /usr/local /media/backupdisk
why this is an error:
= '--NP4' is not an option (parsync will say "Unknown option: np4")
It should be '--NP=4'
= if you were trying to rsync '/usr/local' to '/media/backupdisk',
it will fail since there is no /home/hjm/usr/local dir to use as
a source. This will be shown in the log files in
~/.parsync/rsync-logfile-<datestamp>_#
as a spew of "No such file or directory (2)" errors
= the '--compress' is a native rsync option, not a native parsync option.
You have to pass it to rsync with "--rsyncopts='--compress'"
The correct version of the above command is:
% parsync --NP=4 --rsyncopts='--compress' --startdir=/usr local
/media/backupdisk
-- Error Example 2 --
% parsync --start-dir /home/hjm mooslocal hjm@moo.boo.yoo.com:/usr/local
why this is an error:
= this command is trying to PULL data from a remote SOURCE to a
local TARGET. parsync doesn't support that kind of operation yet.
The correct version of the above command is:
# ssh to hjm@moo, install parsync, then:
% parsync --startdir=/usr local hjm@remote:/home/hjm/mooslocal
-------------------------------------------------------------------------