-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathEtherDrive-2.6-HOWTO.sgml
1523 lines (1258 loc) · 59 KB
/
EtherDrive-2.6-HOWTO.sgml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype linuxdoc system>
<!-- This document is the SGML "linuxdoc" flavor described in the
"Howtos-with-LinuxDoc-mini-HOWTO", found at the following URL.
http://www.tldp.org/HOWTO/Howtos-with-LinuxDoc.html
This HOWTO was originally written by Sam Hopkins and is currently
maintained by Ed L. Cashin.
-->
<article>
<title>EtherDrive® storage and Linux 2.6
<!-- a technical "How To Guide" -->
<author>Sam Hopkins and Ed L. Cashin <tt/{sah,ecashin}@coraid.com/
<date>April 2008
<abstract>
Using network data storage with <url url="http://www.coraid.com/documents/AoEr10.txt"
name="ATA over Ethernet"> is easy after understanding a few
simple concepts.
This document explains how to use AoE targets from a Linux-based
Operating System, but the basic principles are applicable to other
systems
that use AoE devices. Below we begin by explaining the
key components of the network
communication method, ATA over Ethernet (AoE). Next, we discuss the
way a Linux host uses AoE devices, providing serveral examples.
A list of frequently asked questions follows, and the document ends
with
supplementary information.
</abstract>
<toc>
<sect>The EtherDrive System
<p>
The ATA over Ethernet network protocol allows any type of data
storage to be used over a local ethernet network. An "AoE target"
receives ATA read and write commands, executes them, and returns
responses to the "AoE initiator" that is using the storage.
These AoE commands and responses appear on the network as ethernet
frames with type 0x88a2, the IANA registered Ethernet type for <url
url="http://www.coraid.com/documents/AoEr10.txt" name="ATA over
Ethernet (AoE)">. An AoE target is identified by a pair of numbers:
the shelf address, and the slot address.
For example, the Coraid SR appliance can perform RAID internally on
its SATA disks, making the resulting storage capacity available on the
ethernet network as one or more AoE targets. All of the targets will
have the same shelf address because they are all exported by the same
SR. They will have different AoE slot addresses, so that each AoE
target is individually addressable. The SR documentation calls each
target a "LUN". Each LUN behaves like a network disk.
Using EtherDrive technology like the SR appliance is as simple as
sending and receiving AoE packets.
To a Linux-based system running the "aoe" driver, it doesn't matter
what the remote AoE device really is. All that matters is that the
AoE protocol can be used to communicate with a device identified by a
certain shelf and slot address.
<sect>How Linux Uses The EtherDrive System
<p>
For security and performance reasons, many people use a second,
dedicated network
interface card (NIC) for ATA over
Ethernet traffic.
A NIC must be up before it can perform any networking, including AoE.
On examining the output of the <tt>ifconfig</tt> command, you should
see your AoE NIC listed as "UP" before attempting to use an AoE device
reachable via that NIC.
You can <bf>activate the NIC</bf> with a simple <tt>ifconfig eth1
up</tt>, using the appropriate device name instead of "eth1". Note
that assigning an IP address is not necessary if the NIC is being used
only for AoE traffic, but having an IP address on a NIC used for AoE
will not interfere with AoE.
On a Linux system, block devices are used via special files called
device nodes. A familiar example is <tt>/dev/hda</tt>. When a block
device node is opened and used, the kernel translates operations on
the file into operations on the corresponding hardware EtherDrive.
Each accessible AoE target on your network is represented by a disk
device node in the <tt>/dev/etherd/</tt> directory and can be used
just like any other direct attached disk. The "aoe" device driver is
an open-source loadable kernel module authored by Coraid. It
translates system reads/writes on a device into AoE request frames for
the associated remote EtherDrive storage device, retransmitting requests if needed. When the AoE
responses from the device are received, the appropriate system
read/write call is acknowledged as complete. The aoe device driver
handles retransmissions in the event of network congestion.
The association of AoE targets on your network to device nodes in
<tt>/dev/etherd/</tt> follows a simple naming scheme. Each device
node is named eX.Y, where X represents a shelf address and Y
represents a slot address. Both X and Y are decimal integers. As an
example, the following command displays the first 4 KiB of data from
the AoE target with shelf address 0 and slot address 1.
<tscreen><verb>
dd if=/dev/etherd/e0.1 bs=1024 count=4 | hexdump -C
</verb></tscreen>
Creating an ext3 filesystem on the same AoE target is as simple
as ...
<tscreen><verb>
mkfs.ext3 /dev/etherd/e0.1
</verb></tscreen>
Notice that the filesystem goes directly on the block device. There's
no need for any intermediate "format" or partitioning step.
Although partitions are not usually needed, they may be created using
a tool like fdisk or GNU parted.
Please see the <ref id="dospart" name="FAQ entry about partition
tables"> for important caveats.
Partitions are used by adding "p" and the partition number to
the device name. For example, <tt>/dev/etherd/e0.3p1</tt> is the
first partition on the AoE target with shelf address zero and slot
address three.
After creating a filesystem, it can be mounted in the normal way. It
is important to remember to unmount the filesystem before shutting
down your network devices. Without networking, there is no way to
unmount a filesystem that resides on a disk across the network.
It is best to update your init scripts so that filesystems on
EtherDrive storage is unmounted early in the system-shutdown
procedure, before network interfaces are shut down.
<ref
id="aoeinit" name="An example"> is found below in the <ref id="faq"
name="list of Frequently Asked Questions">.
The device nodes in <tt>/dev/etherd/</tt> are usually created in one
of three ways:
<enum>
<item>Most distributions today use udev to dynamically create device nodes
as needed. You can configure udev to create the device nodes for your
AoE disks. (For an example of udev
configuration rules, see <ref id="udev" name="Why do my device nodes
disappear after a reboot?"> in the <ref id="faq" name="FAQ section"> below.)
<item>If you are using the standalone aoe driver, as opposed to the
one distributed with the Linux kernel, and you are not using udev, the
Makefile will create device
nodes for you when you do a "make install".
<item>If you are not using udev you can use static device nodes. Use
the <tt>aoe_dyndevs=0</tt> module load option for the aoe driver.
(You do not need this option if your aoe driver is older than version
aoe6-50.) Then the
<tt>aoe-mkdevs</tt> and <tt>aoe-mkshelf</tt> scripts in the <url
url="http://aoetools.sourceforge.net/" name="aoetools"> package can be
used to
create the static device nodes manually. It is very important to
avoid using these static device nodes with an aoe driver that has the
aoe_dyndevs module parameter set to 1, because you could accidentally
use the wrong device.
</enum>
<sect>The ATA over Ethernet Tools
<p>
The aoe kernel driver allows Linux to do ATA over Ethernet. In
addition to the aoe driver, there is a collection of helpful programs
that operate outside of the kernel, in "user space". This collection
of tools and documentation is called the aoetools, and may be found at
<url
url="http://aoetools.sourceforge.net/"
name="http://aoetools.sourceforge.net/">.
Current aoe drivers from the Coraid website are bundled with a
compatible version of the aoetools. This HOWTO may make reference to
commands from the aoetools, like the aoe-stat command.
<sect1>Limiting AoE traffic to certain network interfaces
<p>
By default, the aoe driver will use any local network interface
available to reach an AoE target. Most of the time, though, the
administrator expects legitimate AoE targets to appear only on certain
ethernet interfaces, e.g., "eth1" and "eth2".
Using the <tt>aoe-interfaces</tt> command from the aoetools package
allows the administrator to limit AoE activity to a set list of
ethernet interfaces.
This configuration is especially important when some ethernet
interfaces are on networks where an unexpected AoE target with the
same shelf and slot address as a production AoE target might appear.
Please see the <tt>aoe-interfaces</tt> manpage
for more information.
At module load time the list of allowable interfaces may be set with
the "aoe_iflist" module parameter.
<tscreen><verb>
modprobe aoe 'aoe_iflist=eth2 eth3'
</verb></tscreen>
<sect>EtherDrive storage and Linux Software RAID
<p>
Some AoE devices are internally redundant. A Coraid SR1521, for example,
might be exporting a 14-disk RAID 5 as a single 9.75 terabyte LUN.
In that case, the AoE target itself is performing RAID, enhancing
performance and reliability.
You can also perform RAID on the AoE initiator. Linux Software RAID
can increase performance by striping over multiple AoE targets and
reliability by using data redundancy. Reading the <url
url="http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html" name="Linux
Software RAID HOWTO"> before you start to work with RAID will likely
save time in the long run. The Linux
kernel has an "md" driver that performs the Software RAID, and there
are
several tool
sets that allow you to use this kernel feature.
The main software package for using the md driver is <url
url="http://www.cse.unsw.edu.au/~neilb/source/mdadm/" name="mdadm">.
Less popular alternatives include the older raidtools package <ref
id="archives" name="(discussed in the Archives below)">, and <url
url="http://evms.sourceforge.net/" name="EVMS">.
<sect1>Example: RAID 5 with mdadm
<p>
In this example we have five AoE targets in shelves 0-4, with each
shelf exporting a single LUN 0. The following mdadm command uses these five
AoE devices as RAID components, creating a level-5 RAID array. The md
configuration information is stored on the components themselves in
"md superblocks", which can be examined with another mdadm command.
<tscreen><verb>
# mdadm -C -n 5 --level=raid5 --auto=md /dev/md0 /dev/etherd/e[0-4].0
mdadm: array /dev/md0 started.
# mdadm --examine /dev/etherd/e0.0
/dev/etherd/e0.0:
Magic : a92b4efc
Version : 00.90.00
UUID : 46079e2f:a285bc60:743438c8:144532aa (local to host ellijay)
...
</verb></tscreen>
<p>
The <tt>/proc/mdstats</tt> file contains summary information about the
RAID as reported by the kernel itself.
<tscreen><verb>
# cat /proc/mdstat
Personalities : [raid5] [raid4]
md0 : active raid5 etherd/e4.0[5] etherd/e3.0[3] etherd/e2.0[2] etherd/e1.0[1] etherd/e0.0[0]
5860638208 blocks level 5, 64k chunk, algorithm 2 [5/4] [UUUU_]
[>....................] recovery = 0.0% (150272/1465159552) finish=23605.3min speed=1032K/sec
unused devices: <none>
</verb></tscreen>
Until md finishes initializing the parity of the RAID, performance is
sub-optimal, and the RAID will not be usable if one of the components
fails during initialization. After initialization is complete, the md
device can continue
to be used even if one component fails.
Later the array can be stopped in order to shut it down cleanly in
preparation for a system reboot or halt.
<tscreen><verb>
# mdadm -S /dev/md0
</verb></tscreen>
In a system init script (see <ref id="aoeinit" name="the aoe-init
example in the FAQ">) an mdadm command can assemble the RAID
components using the configuration information that was stored on them
when the RAID was created.
<tscreen><verb>
# mdadm -A /dev/md0 /dev/etherd/e[0-4].0
mdadm: /dev/md0 has been started with 5 drives.
</verb></tscreen>
To make an xfs filesystem on the RAID array and mount it, the
following commands can be issued:
<tscreen><verb>
# mkfs -t xfs /dev/md0
# mkdir /mnt/raid
# mount /dev/md0 /mnt/raid
</verb></tscreen>
Once md has finished initializing the RAID, the storage is
single-fault tolerant: Any of the components can fail without making
the storage unavailable. Once a single component has failed, the md
device is said to be in a "degraded" state. Using a degraded array is
fine, but a degraded array cannot remain usable if another component
fails.
Adding hot spares makes the array even more robust. Having hot spares
allows md to bring a new component into the RAID as soon as one of its
components has failed so that the normal state may be achieved as
quickly as possible. You can check <tt>/proc/mdstat</tt> for
information on the initialization's progress.
The new write-intent bitmap feature can dramatically reduce the time
needed for re-initialization after a component fails and is later
added back to the array. Reducing the time the RAID spends in
degraded mode makes a double fault less likely. Please see the mdadm
manpages for details.
<sect1>Important notes
<p>
<enum>
<item>Some Linux distributions come with an mdmonitor service running
by default. Unless you configure the mdmonitor to do what you want,
consider turning off this service with <tt>chkconfig mdmonitor
off</tt> and <tt>/etc/init.d/mdmonitor stop</tt> or your system's
equivalent commands. If mdadm is running in its "monitor" mode
without being properly configured, it may interfere with failover to
hot spares, the stopping of the RAID, and other actions.
<item>There is a problem with the way some 2.6 kernels determine
whether an I/O device is idle. On these kernels, RAID initialization
is about five times slower than it needs to be.
On these kernels you can do the following to work around the problem:
<tscreen><verb>
echo 100000 > /proc/sys/dev/raid/speed_limit_max
echo 100000 > /proc/sys/dev/raid/speed_limit_min
</verb></tscreen>
</enum>
<sect>FAQ (contains important info)<label id="faq">
<p>
<sect1>Q: How does the system know about the AoE targets on the network?
<p>
A: When an AoE target comes online, it emits a broadcast
frame indicating its presence. In addition to this mechanism,
the AoE initiator may send out a query frame to discover
any new AoE targets.
The Linux aoe driver, for example, sends an
AoE query once per minute. The discovery can be triggered
manually with the "aoe-discover" tool, one of the
<url url="http://aoetools.sourceforge.net/" name="aoetools">.
<sect1>Q: How do I see what AoE devices the system knows about?
<p>
A: The /usr/sbin/aoe-stat program (from the <url
url="http://aoetools.sourceforge.net/" name="aoetools">) lists the devices
the system considers valid. It also displays the
status of the device (up or down). For example:
<tscreen><verb>
root@makki root# aoe-stat
e0.0 10995.116GB eth0 up
e0.1 10995.116GB eth0 up
e0.2 10995.116GB eth0 up
e1.0 1152.874GB eth0 up
e7.0 370.566GB eth0 up
</verb></tscreen>
<sect1>Q: What is the "closewait" state?
<p>
A: The "down,closewait" status means that the device went down but at
least one process still has it open. After all processes close the
device, it will become "up" again if it the remote AoE device is
available and ready.
The user can also use the "aoe-revalidate" command to manually cause
the aoe driver to query the AoE device. If the AoE device is
available and ready, the device state on the Linux host will change
from "down,closewait" to "up".
<sect1>Q: How does the system know an AoE device has failed?
<p>
A: When an AoE target cannot complete a requested command it will
indicate so in the response to the failed request.
The Linux aoe driver will mark the AoE device as failed upon
reception of such a response. In addition, if an AoE target
has not responded to a prior request within a default
timeout (currently three minutes) the aoe driver will fail
the device.
<sect1>Q: How do I take an AoE device out of the failed state?
<p>
A: If the aoe driver shows the device state to be "down", first
check the EtherDrive storage itself and the AoE network. Once any
problem has been rectified, you can use the "aoe-revalidate" command
from the <url
url="http://aoetools.sourceforge.net/" name="aoetools"> to ask
the aoe driver to change the state back to "up".
<p>
If the Linux Software RAID driver has marked the
device as "failed" (so
that an "F" shows up in the output of "cat /proc/mdstat"), then you
first
need to remove the device from the RAID using mdadm. Next you add the
device back to the array with mdadm.
<p>
An example follows, showing how (after manually failing e10.0) the
device is removed from the array and then added back. After adding
it back to the RAID, the md driver begins rebuilding the redundancy of
the array.
<tscreen><verb>
root@kokone ~# cat /proc/mdstat
Personalities : [raid1] [raid5]
md0 : active raid1 etherd/e10.1[1] etherd/e10.0[0]
524224 blocks [2/2] [UU]
unused devices: <none>
root@kokone ~# mdadm --fail /dev/md0 /dev/etherd/e10.0
mdadm: set /dev/etherd/e10.0 faulty in /dev/md0
root@kokone ~# cat /proc/mdstat
Personalities : [raid1] [raid5]
md0 : active raid1 etherd/e10.1[1] etherd/e10.0[2](F)
524224 blocks [2/1] [_U]
unused devices: <none>
root@kokone ~# mdadm --remove /dev/md0 /dev/etherd/e10.0
mdadm: hot removed /dev/etherd/e10.0
root@kokone ~# mdadm --add /dev/md0 /dev/etherd/e10.0
mdadm: hot added /dev/etherd/e10.0
root@kokone ~# cat /proc/mdstat
Personalities : [raid1] [raid5]
md0 : active raid1 etherd/e10.0[2] etherd/e10.1[1]
524224 blocks [2/1] [_U]
[=>...................] recovery = 5.0% (26944/524224) finish=0.6min speed=13472K/sec
unused devices: <none>
root@kokone ~#
</verb></tscreen>
<sect1>Q: How can I use LVM with my EtherDrive storage?
<p>
A: With older <url url="http://sources.redhat.com/lvm2/"
name="LVM2"> releases, you may need to edit
lvm.conf, but the current version of LVM2 supports AoE
devices "out of the box".
You can also create md devices from your aoe devices and tell LVM to
use the md devices.
It's necessary to understand LVM itself in order to use AoE devices
with LVM. Besides the manpages for the LVM commands, the <url
url="http://tldp.org/HOWTO/LVM-HOWTO/" name="LVM HOWTO"> is a big help
getting started if you are starting out with LVM.
If you have an old LVM2 that does not already detect and work with AoE
devices, you can add this line to the "devices" block of your
lvm.conf.
<tscreen><verb>
types = [ "aoe", 16 ]
</verb></tscreen>
If you are creating physical volumes out of RAIDs over EtherDrive
storage, make sure to turn on md component detection so that LVM2
doesn't go snooping around on the underlying EtherDrive disks.
<tscreen><verb>
md_component_detection = 1
</verb></tscreen>
The snapshots feature in LVM2 did not work in early 2.6 kernels.
Lately, Coraid customers have reported success using snapshots on
AoE-backed logical volumes when using a recent kernel and aoe driver.
Older aoe drivers, like version 22, may need <url
url="https://bugzilla.redhat.com/attachment.cgi?id=311070" name="a
fix"> to work correctly with snapshots.
Customers have reported data corruption and kernel panics when using
striped logical volumes (created with the "-i" option to lvcreate)
when using aoe driver versions prior to aoe6-48. No such problems
occur with normal logical volumes or with Software RAID's striping
(RAID 0).
Most systems have boot scripts that try to detect LVM physical volumes
early in the boot process, before AoE devices are available. In
playing with LVM, you may need to help LVM to recognize AoE devices
that are physical devices by running vgscan after loading the aoe
module.
There have been reports that partitions can interfere with LVM's
ability to use an AoE device as a physical volume. For example, with
partitions e0.1p1 and e0.1p2 residing on e0.1, <tt>pvcreate /dev/etherd/e0.1</tt> might
complain,
<tscreen><verb>
Device /dev/etherd/e0.1 not found.
</verb></tscreen>
Removing the partitions allows LVM to create a physical volume from
e0.1.
<sect1>Q: I get an "invalid module format" error on modprobe. Why?
<p>
A: The aoe module and the kernel must be built to match one another.
On module load, the kernel version, SMP support (yes or no), the
compiler version, and the target processor must be the same for the
module as it was building the kernel.
<sect1>Q: Can I allow multiple Linux hosts to use a filesystem that is on my EtherDrive storage?
<p>
A: Yes, but you're now taking advantage of the flexibility of
EtherDrive storage, using it like a SAN. Your software
must be "cluster aware", like <url
url="http://sources.redhat.com/cluster/gfs/" name="GFS">. Otherwise,
each host will assume
it is the sole user of the filesystem and data corruption will
result.
<sect1>Q: Can you give me an overview of GFS and related software?
<p>
A: Yes, here's a brief overview.
<sect2>Background
<p>
GFS is a scalable, journaled filesystem designed to be used by more
than one computer at a time. There is a separate journal for each
host using the filesystem. All the hosts working together are
called a cluster, and each member of the cluster is called a cluster
node.
<p>
To achieve acceptible performance, each cluster node remembers what
was on the block device the last time it looked. This is caching,
where data from copies in RAM are used temporarily instead of data
directly from the block device.
<p>
To avoid chaos, the data in the RAM cache of every cluster node has
to match what's on the block device. The members of the cluster
(called "cluster nodes") communicate over TCP/IP to agree on who is
in the cluster and who has the right to use a particular part of the
shared block device.
<sect2>Hardware
<p>
To allow the cluster nodes to control membership in the cluster and
to control access to the shared block storage, "fencing" hardware
can be used.
<p>
Some network switches can be dynamically configured to turn single
ports on and off, effectively fencing a node off from the rest of
the network.
<p>
Remote power switches can be told to turn an outlet off, powering a
cluster node down, so that it is certainly not accessing the shared
storage.
<sect2>Software
<p>
The RedHat Cluster Suite developers have created several pieces of
software besides the GFS filesystem itself to allow the cluster
nodes to coordinate cluster membership and to control access to the
shared block device.
<p>
These parts are listed here, on the GFS Project Page.
<p>
<url url="http://sources.redhat.com/cluster/gfs/" name=" http://sources.redhat.com/cluster/gfs/">
<p>
GFS and its related software are undergoing continuous heavy
development and are maturing slowly but steadily.
<p>
As might be expected, the devleopers working for RedHat target
RedHat Enterprise Linux as the ultimate platform for GFS and its
related software. They also use Fedora Core as a platform for
testing and innovation.
<p>
That means that when choosing a distribution for running GFS, recent
versions of Fedora Core, RedHat Enterprise Linux (RHEL), and RHEL
clones like CentOS should be considered. On these platforms, RPMs
are available that have a good chance of working "out of the box."
<p>
With a RedHat-based distro like Fedora Core, using GFS means seeking
out the appropriate documentation, installing the necessary RPMs,
and creating a few text files for configuring the software.
<p>
Here is a good overview of what the process is generally like. Note
that if you're using RPMs, then building and installing the software
will not be necessary.
<p>
<url url="http://sources.redhat.com/cluster/doc/usage.txt" name="http://sources.redhat.com/cluster/doc/usage.txt">
<sect2>Use
<p>
Once you have things ready, using the GFS is like using any other
filesystem.
<p>
Performance will be greatest when the filesystem operations of the
different nodes do not interfere with one another. For instance, if
all the nodes try to write to the same place in a directory or file,
much time will be spent in coordinating access (locking).
<p>
An easy way to eliminate a large amount of locking is to use the
"noatime" (no access time update) mount option. Even in traditional
filesystems the use of
this option often results in a dramatic performance benefit, because
it eliminates the need to write to the block storage just to record
the time that the file was last accessed.
<sect2>Fencing
<p>
There are several ways to keep a cluster node from accessing shared
storage when that node might have outdated assumptions about the
state of the cluster or the storage. Preventing the node from
accessing the storage is called "fencing", and it can be
accomplished in several ways.
<p>
One popular way is to simply kill the power to the fenced node by
using a remote power switch. Another is to use a network switch
that has ports that can be turned on and off remotely.
<p>
When the shared storage resource is a LUN on an SR, it is
possible to manipulate the LUN's mask list in order to accomplish
fencing. You can read about this technique in the <url
url="/support/linux/contrib/" name="Contributions area">.
<sect1>Q: How can I make a RAID of more than 27 components?
<p>
A: For Linux Software RAID, the kernel limits the number of disks in
one RAID to 27. However, you can easily overcome this limitation by
creating another level of RAID.
<p>
For example, to create a RAID 0 of thirty block devices,
you may create three ten-disk RAIDs (md1, md2, and md3) and then
stripe across them (md0 is a stripe over md1, md2, and md3).
<p>
Here is an example raidtools configuration file that implements the
above scenario for shelves 5, 6, and 7: <url
url="raid0-30component.conf" name="multi-level RAID 0 configuration
file">. Non-trivial raidtab configuration files are easier to
generate from a script than to create by hand.
<p>
EtherDrive storage gives you a lot of freedom, so be creative.
<sect1>Q: Why do my device nodes disappear after a reboot?<label id="udev">
<p>
A: Some Linux distributions create device nodes dynamically. The
upcoming method of choice is called "udev". The aoe driver and udev
work together when the following rules are installed.
<p>
These rules go into a file with a name like <tt>60-aoe.rules</tt>.
Look in your <tt>udev.conf</tt> file (usually
<tt>/etc/udev/udev.conf</tt>) for the line starting with <tt>udev_rules=</tt> to find out where rules go (usually <tt>/etc/udev/rules.d</tt>).
<tscreen><verb>
# These rules tell udev what device nodes to create for aoe support.
# They may be installed along the following lines. Check the section
# 8 udev manpage to see whether your udev supports SUBSYSTEM, and
# whether it uses one or two equal signs for SUBSYSTEM and KERNEL.
# aoe char devices
SUBSYSTEM=="aoe", KERNEL=="discover", NAME="etherd/%k", GROUP="disk", MODE="0220"
SUBSYSTEM=="aoe", KERNEL=="err", NAME="etherd/%k", GROUP="disk", MODE="0440"
SUBSYSTEM=="aoe", KERNEL=="interfaces", NAME="etherd/%k", GROUP="disk", MODE="0220"
SUBSYSTEM=="aoe", KERNEL=="revalidate", NAME="etherd/%k", GROUP="disk", MODE="0220"
SUBSYSTEM=="aoe", KERNEL=="flush", NAME="etherd/%k", GROUP="disk", MODE="0220"
# aoe block devices
KERNEL=="etherd*", NAME="%k", GROUP="disk"
</verb></tscreen>
<p>
Unfortunately the syntax for the udev rules file has changed several
times as new versions of udev appear. You will probably have to
modify the example above for your system, but the existing rules and
the udev documentation should help you.
<p>
There is an example script in the aoe driver,
<tt>linux/Documentation/aoe/udev-install.sh</tt>, that can install the
rules on most systems.
<p>
The udev system can only work with the aoe driver if the aoe driver is
loaded. To avoid confusion, make sure that you load the aoe driver at
boot time.
<sect1>Q: Why does RAID initialization seem slow?
<p>
A: The 2.6 Linux kernel has a problem with its RAID initialization
rate limiting feature. You can override this feature and speed up
RAID initialization by using the following commands. Note that these
commands change kernel memory, so the commands must be re-run after a
reboot.
<tscreen><verb>
echo 100000 > /proc/sys/dev/raid/speed_limit_max
echo 100000 > /proc/sys/dev/raid/speed_limit_min
</verb></tscreen>
<sect1>Q: I can only use shelf zero! Why won't e1.9 work?
<p>
A: Every block device has a device file, usually in /dev, that has a
major and minor number. You can see these numbers using ls. Note the
high major numbers (1744, 2400, and 2401) in the example below.
<tscreen><verb>
ecashin@makki ~$ ls -l /dev/etherd/
total 0
brw------- 1 root disk 152, 1744 Mar 1 14:35 e10.9
brw------- 1 root disk 152, 2400 Feb 28 12:21 e15.0
brw------- 1 root disk 152, 2401 Feb 28 12:21 e15.0p1
</verb></tscreen>
The 2.6 Linux kernel allows high minor device numbers like this, but
until recently, 255 was the highest minor number one could use. Some
distributions contain userland software that cannot understand the
high minor numbers that 2.6 makes possible.
Here's a crude but reliable test that can determine whether your
system is ready to use devices with high minor numbers. In the
example below, we tried to create a device node with a minor number of
1744, but ls shows it as 208.
<tscreen><verb>
root@kokone ~# mknod e10.9 b 152 1744
root@kokone ~# ls -l e10.9
brw-r--r-- 1 root root 158, 208 Mar 2 15:13 e10.9
</verb></tscreen>
On systems like this, you can still use the aoe driver to use up to
256 disks if you're willing to live without support for partitions.
Just make sure that the device nodes and the aoe driver are both
created with one partition per device.
The commands below show how to create a driver without partition
support and then to create compatible device nodes for shelf 10.
<tscreen><verb>
make install AOE_PARTITIONS=1
rm -rf /dev/etherd
env n_partitions=1 aoe-mkshelf /dev/etherd 10
</verb></tscreen>
As of version 1.9.0, the mdadm command supports large minor device
numbers. The mdadm versions before 1.9.0 do not. If you would like
to use versions of mdadm older than 1.9.0, you can configure your
driver and device nodes as outlined above. Be aware that it's easy
confuse yourself by creating a driver that doesn't match the device
nodes.
<sect1>Q: How can I start my AoE storage on boot and shut it down when the system shuts down?<label id="aoeinit">
<p>
A: That is really a question about your own system, so it's a question
you, as the system administrator, are in the best position to answer.
<p>
In general, though, many Linux distributions follow the same patterns
when it comes to system "init scripts". Most use a System V style.
<p>
The example below should help get you started if you have never
created and installed an init script. Start by reading the comments
at the top. Make sure you understand how your system works and what
the script does, because every system is different.
Here is an overview of what happens when the aoe module is loaded and
the aoe module begins AoE device discovery. It should help you to
understand the example script below. Starting up the aoe module on
boot can be tricky if necessary parts of the system are not ready when
you want to use AoE.
To discover an AoE device, the aoe driver must receive a Query Config
reponse packet that indicates the device is available. A Coraid SR
broadcasts this response unsolicited when you run the <tt>online</tt>
SR command, but it is usually sent in response to an AoE initiator
broadcasting a Query Config command to discover devices on the
network. Once an AoE device has been discovered, the aoe driver sends
an ATA Device Identify command to get information about the disk
drive. When the disk size is known, the aoe driver will install the
new block device in the system.
The aoe driver will broadcast this AoE discovery command when loaded,
and then once a minute thereafter.
The AoE discovery that takes place on loading the aoe driver does not
take long, but it does take some time. That's why you'll see "sleep"
commands in the example aoe-init script below. If AoE discovery is
failing, try unloading the aoe module and tuning your init script by
invoking it at the command line.
You will often find that a delay is necessary after loading your
network drivers (and before loading the aoe driver). This delay
allows the network interface to initialize and to become usable. An
additional delay is necessary after loading the aoe driver, so that
AoE discovery has time to take place before any AoE storage is used.
Without such a delay, the initial AoE Config Query broadcast packet
might never go out onto the AoE network, and then the AoE initiator
will not know about any AoE targets until the next periodic Config
Query broadcast occurs, usually one minute later.
<tscreen><verb>
#! /bin/sh
# aoe-init - example init script for ATA over Ethernet storage
#
# Edit this script for your purposes. (Changing "eth1" to the
# appropriate interface name, adding commands, etc.) You might
# need to tune the sleep times.
#
# Install this script in /etc/init.d with the other init scripts.
#
# Make it executable:
# chmod 755 /etc/init.d/aoe-init
#
# Install symlinks for boot time:
# cd /etc/rc3.d && ln -s ../init.d/aoe-init S99aoe-init
# cd /etc/rc5.d && ln -s ../init.d/aoe-init S99aoe-init
#
# Install symlinks for shutdown time:
# cd /etc/rc0.d && ln -s ../init.d/aoe-init K01aoe-init
# cd /etc/rc1.d && ln -s ../init.d/aoe-init K01aoe-init
# cd /etc/rc2.d && ln -s ../init.d/aoe-init K01aoe-init
# cd /etc/rc6.d && ln -s ../init.d/aoe-init K01aoe-init
#
case "$1" in
"start")
# load any needed network drivers here
# replace "eth1" with your aoe network interface
ifconfig eth1 up
# time for network interface to come up
sleep 4
modprobe aoe
# time for AoE discovery and udev
sleep 7
# add your raid assemble commands here
# add any LVM commands if needed (e.g. vgchange)
# add your filesystem mount commands here
test -d /var/lock/subsys && touch /var/lock/subsys/aoe-init
;;
"stop")
# add your filesystem umount commands here
# deactivate LVM volume groups if needed
# add your raid stop commands here
rmmod aoe
rm -f /var/lock/subsys/aoe-init
;;
*)
echo "usage: `basename $0` {start|stop}" 1>&2
;;
esac
</verb></tscreen>
<sect1>Q: Why do I get "permission denied" when I'm root?
<p>
A: Some newer systems come with SELinux (Security-Enhanced Linux),
which can limit what the root user can do.
<p>
SELinux is usually good about creating entries in the system logs when
it prevents root from doing something, so examine your logs for such
messages.
<p>
Check the SELinux documentation for information on how to configure
or disable SELinux according to your needs.
<sect1>Q: Why does fdisk ask me for the number of cylinders?<label id="dospart">
<p>
A: Your fdisk is probably asking the kernel for the size of the disk
with a BLKGETSIZE block device ioctl, which returns the sector
count of the disk in a 32-bit number. If the size of the disk exceeds
the ability to be stored in this 32-bit number (2 TB is the limit),
the ioctl returns ETOOBIG as an error. This error indicates that the
program should try the 64-bit ioctl (BLKGETSIZE64), but when fdisk
doesn't do that, it just asks the user to supply the number of
cylinders.
You can
tell fdisk the number of cylinders yourself. The number to use
(sectors / (255 * 63)) is printed by the following commands. Use the
appropriate device instead of "e0.0".
<tscreen><verb>
sectors=`cat /sys/block/etherd\!e0.0/size`
echo $sectors 255 63 '*' / p | dc
</verb></tscreen>
But no MSDOS partition table can ever work with more than 2TB. The
reason is that the numbers in the partition table itself are only 32
bits in size. That means you can't have a partition larger than 2TB
in size or starting further than 2TB from the beginning of the device.
Some options for multi-terabyte volumes are:
<enum>
<item>By doing without partitions, the filesystem can be created
directly on the AoE device itself (e.g., <tt>/dev/etherd/e1.0</tt>),
<item>LVM2, the Logical Volume Manager, is a sophisticated way of
allocating storage to create logical volumes of desired sizes, and
<item>GPT partition tables.
</enum>
The last item in the list above is a new kind of partition table that
overcomes the limitations of the older MSDOS-style partition table.
Andrew Chernow has related his successful experiences using GPT
partition tables on large AoE devices in <url
url="/support/linux/contrib/chernow/gpt.html"
name="this contributed document">.
Please note that some versions of the GNU parted tool, such as version
1.8.6, have a bug. This bug allows the user to create an MSDOS-style
partition table with partitions larger than two terabytes even though
these partitions are too large for an MSDOS partition table. The
result is that the filesystems on these partitions will only be usable
until the next reboot.
<sect1>Q: Can I use AoE equipment with Oracle software?
<p>
A: Oracle used to have a <url
url="http://www.oracle.com/technology/deploy/availability/htdocs/oscp.html"
name="Oracle Storage Compatibility Program">, but simple block-level
storage technologies do not require Oracle validation. ATA over
Ethernet provides simple, block-level storage.
Oracle used to have a list of a frequently asked questions about
running Oracle on Linux, but they have replaced it with <url
url="http://www.oracle.com/technology/tech/linux/htdocs/oracleonlinux_faq.html"
name="documentation
about their own Linux distribution list covering">. A third party
site continues to maintain a <url
url="http://www.orafaq.com/faqlinux.htm"
name="FAQ about running Oracle on Linux">.
<sect1>Q: Why do I have intermittent problems?
<p>
A: Make sure your network is in good shape. Having good patch cables,
reliable network switches with good flow control, and good network
cards will keep your network storage happy.
<sect1>Q: How can I avoid running out of memory when copying large files?
<p>
A: You can tell the Linux kernel not to wait so long before writing
data out to backing storage.
<tscreen><verb>
echo 3 > /proc/sys/vm/dirty_ratio
echo 4 > /proc/sys/vm/dirty_background_ratio
echo 32768 > /proc/sys/vm/min_free_kbytes
</verb></tscreen>
When a large MTU, like 9000, is in being used on the AoE-side network
interfaces, a larger min_free_kbytes setting could be helpful. The more
RAM you have, the larger the number you might have to use.
There are also alternative settings to the above "ratio" settings, available as of kernel version 2.6.29. They are <tt>dirty_bytes</tt> and <tt>dirty_background_bytes</tt>, and they provide finer control for systems with large amounts of RAM.
If you find the /proc settings to be helpful, you can make them
permanent by editing /etc/sysctl.conf or by creating an init script
that performs the settings at boot time.
The Documentation/sysctl/vm.txt file for your kernel has details on the settings
available for your particular kernel, but some guiding principles are...
<itemize>
<item>Linux will use free RAM to cache the data that is on AoE targets, which is helpful.
<item>Writes to the AoE target go first to RAM, updating the cache. Those updated parts of the cached data are "dirty" until the changes are written out to the AoE target. Then they're "clean".
<item>If the system needs RAM for something else, clean parts of the cache can be repurposed immediately.
<item>The RAM that is holding dirty cache data cannot be reclaimed immediately, because it reflects updates to the AoE target that have not yet made it to the AoE target.
<item>Systems with much RAM and doing many writes will accumulate dirty data quickly.
<item>If the processes creating the write workload are forced by the Linux kernel to wait for the dirty data to be flushed out to the backing store (AoE targets), then I/O goes fast but the producers are naturally throttled, and the system stays responsive and stable.
<item>If the dirty data is flushed in "the background", though, then when there's too much dirty data to flush out, the system becomes unresponsive.</item>
<item>Telling Linux to maintain a certain amount of truly free RAM, not used for caching, allows the system to have plenty of RAM for doing the work of flushing out the dirty data.
<item>Telling Linux to push dirty data out sooner keeps the backing store more consistent while it is being used (with regard to the danger of power failures, network failures, and the like). It also allows the system to quickly reclaim memory used for caching when needed, since the data is clean.
</itemize>
<sect1>Q: Why doesn't the aoe driver notice that an AoE device has disappeared or changed size?
<p>
A: Prior to the aoe6-15 driver, aoe drivers only learned an AoE device's
characteristics once, and the only way to use an AoE device that had
grown or to get rid of "phantom" AoE devices that were no longer