EtherDrive-2.6-HOWTO.sgml

<!doctype linuxdoc system>
<!-- This document is the SGML "linuxdoc" flavor described in the 
     "Howtos-with-LinuxDoc-mini-HOWTO", found at the following URL.

     http://www.tldp.org/HOWTO/Howtos-with-LinuxDoc.html

     This HOWTO was originally written by Sam Hopkins and is currently
     maintained by Ed L. Cashin.
-->

<article>
<title>EtherDrive&reg; storage and Linux 2.6
<!-- a technical "How To Guide" -->
<author>Sam Hopkins and Ed L. Cashin <tt/{sah,ecashin}@coraid.com/
<date>April 2008

<abstract>

Using network data storage with <url url="http://www.coraid.com/documents/AoEr10.txt"
name="ATA over Ethernet"> is easy after understanding a few
simple concepts.
This document explains how to use AoE targets from a Linux-based
Operating System, but the basic principles are applicable to other
systems 
that use AoE devices.  Below we begin by explaining the
key components of the network
communication method, ATA over Ethernet (AoE).  Next, we discuss the
way a Linux host uses AoE devices, providing serveral examples.
A list of frequently asked questions follows, and the document ends
with 
supplementary information.

</abstract>

<toc>

<sect>The EtherDrive System

<p>
The ATA over Ethernet network protocol allows any type of data
storage to be used over a local ethernet network.  An "AoE target"
receives ATA read and write commands, executes them, and returns
responses to the "AoE initiator" that is using the storage.

These AoE commands and responses appear on the network as ethernet
frames with type 0x88a2, the IANA registered Ethernet type for <url
url="http://www.coraid.com/documents/AoEr10.txt" name="ATA over
Ethernet (AoE)">.  An AoE target is identified by a pair of numbers:
the shelf address, and the slot address.

For example, the Coraid SR appliance can perform RAID internally on
its SATA disks, making the resulting storage capacity available on the
ethernet network as one or more AoE targets.  All of the targets will
have the same shelf address because they are all exported by the same
SR.  They will have different AoE slot addresses, so that each AoE
target is individually addressable.  The SR documentation calls each
target a "LUN".  Each LUN behaves like a network disk.

Using EtherDrive technology like the SR appliance is as simple as
sending and receiving AoE packets.

To a Linux-based system running the "aoe" driver, it doesn't matter
what the remote AoE device really is.  All that matters is that the
AoE protocol can be used to communicate with a device identified by a
certain shelf and slot address.

<sect>How Linux Uses The EtherDrive System
<p>
For security and performance reasons, many people use a second,
dedicated network
interface card (NIC) for ATA over
Ethernet traffic.  

A NIC must be up before it can perform any networking, including AoE.
On examining the output of the <tt>ifconfig</tt> command, you should
see your AoE NIC listed as "UP" before attempting to use an AoE device
reachable via that NIC.

You can <bf>activate the NIC</bf> with a simple <tt>ifconfig eth1
up</tt>, using the appropriate device name instead of "eth1".  Note
that assigning an IP address is not necessary if the NIC is being used
only for AoE traffic, but having an IP address on a NIC used for AoE
will not interfere with AoE.

On a Linux system, block devices are used via special files called
device nodes.  A familiar example is <tt>/dev/hda</tt>.  When a block
device node is opened and used, the kernel translates operations on
the file into operations on the corresponding hardware EtherDrive.

Each accessible AoE target on your network is represented by a disk
device node in the <tt>/dev/etherd/</tt> directory and can be used
just like any other direct attached disk.  The "aoe" device driver is
an open-source loadable kernel module authored by Coraid.  It
translates system reads/writes on a device into AoE request frames for
the associated remote EtherDrive storage device, retransmitting requests if needed.  When the AoE
responses from the device are received, the appropriate system
read/write call is acknowledged as complete.  The aoe device driver
handles retransmissions in the event of network congestion.

The association of AoE targets on your network to device nodes in
<tt>/dev/etherd/</tt> follows a simple naming scheme.  Each device
node is named eX.Y, where X represents a shelf address and Y
represents a slot address.  Both X and Y are decimal integers.  As an
example, the following command displays the first 4 KiB of data from
the AoE target with shelf address 0 and slot address 1.

<tscreen><verb>
dd if=/dev/etherd/e0.1 bs=1024 count=4 | hexdump -C
</verb></tscreen>

Creating an ext3 filesystem on the same AoE target is as simple
as ...

<tscreen><verb>
mkfs.ext3 /dev/etherd/e0.1
</verb></tscreen>

Notice that the filesystem goes directly on the block device.  There's
no need for any intermediate "format" or partitioning step.

Although partitions are not usually needed, they may be created using
a tool like fdisk or GNU parted.
Please see the <ref id="dospart" name="FAQ entry about partition
tables"> for important caveats.

Partitions are used by adding "p" and the partition number to
the device name.  For example, <tt>/dev/etherd/e0.3p1</tt> is the
first partition on the AoE target with shelf address zero and slot
address three.

After creating a filesystem, it can be mounted in the normal way.  It
is important to remember to unmount the filesystem before shutting
down your network devices.  Without networking, there is no way to
unmount a filesystem that resides on a disk across the network.

It is best to update your init scripts so that filesystems on
EtherDrive storage is unmounted early in the system-shutdown
procedure, before network interfaces are shut down.
<ref
id="aoeinit" name="An example"> is found below in the <ref id="faq"
name="list of Frequently Asked Questions">.

The device nodes in <tt>/dev/etherd/</tt> are usually created in one
of three ways:

<enum>
<item>Most distributions today use udev to dynamically create device nodes
as needed.  You can configure udev to create the device nodes for your
AoE disks.  (For an example of udev
configuration rules, see <ref id="udev" name="Why do my device nodes
disappear after a reboot?"> in the <ref id="faq" name="FAQ section"> below.)

<item>If you are using the standalone aoe driver, as opposed to the
one distributed with the Linux kernel, and you are not using udev, the
Makefile will create device
nodes for you when you do a "make install".

<item>If you are not using udev you can use static device nodes.  Use
the <tt>aoe_dyndevs=0</tt> module load option for the aoe driver.
(You do not need this option if your aoe driver is older than version
aoe6-50.)  Then the
<tt>aoe-mkdevs</tt> and <tt>aoe-mkshelf</tt> scripts in the <url
url="http://aoetools.sourceforge.net/" name="aoetools"> package can be
used to
create the static device nodes manually.  It is very important to
avoid using these static device nodes with an aoe driver that has the
aoe_dyndevs module parameter set to 1, because you could accidentally
use the wrong device.

</enum>

<sect>The ATA over Ethernet Tools
<p>
The aoe kernel driver allows Linux to do ATA over Ethernet.  In
addition to the aoe driver, there is a collection of helpful programs
that operate outside of the kernel, in "user space".  This collection
of tools and documentation is called the aoetools, and may be found at
<url 
url="http://aoetools.sourceforge.net/"
name="http://aoetools.sourceforge.net/">.

Current aoe drivers from the Coraid website are bundled with a
compatible version of the aoetools.  This HOWTO may make reference to
commands from the aoetools, like the aoe-stat command.

<sect1>Limiting AoE traffic to certain network interfaces
<p>
By default, the aoe driver will use any local network interface
available to reach an AoE target.  Most of the time, though, the
administrator expects legitimate AoE targets to appear only on certain
ethernet interfaces, e.g., "eth1" and "eth2".

Using the <tt>aoe-interfaces</tt> command from the aoetools package
allows the administrator to limit AoE activity to a set list of
ethernet interfaces.

This configuration is especially important when some ethernet
interfaces are on networks where an unexpected AoE target with the
same shelf and slot address as a production AoE target might appear.

Please see the <tt>aoe-interfaces</tt> manpage
for more information.

At module load time the list of allowable interfaces may be set with
the "aoe_iflist" module parameter.

<tscreen><verb>
modprobe aoe 'aoe_iflist=eth2 eth3'
</verb></tscreen>

<sect>EtherDrive storage and Linux Software RAID
<p>
Some AoE devices are internally redundant.  A Coraid SR1521, for example,
might be exporting a 14-disk RAID 5 as a single 9.75 terabyte LUN.
In that case, the AoE target itself is performing RAID, enhancing
performance and reliability.

You can also perform RAID on the AoE initiator.  Linux Software RAID
can increase performance by striping over multiple AoE targets and
reliability by using data redundancy.  Reading the <url
url="http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html" name="Linux
Software RAID HOWTO"> before you start to work with RAID will likely
save time in the long run.  The Linux
kernel has an "md" driver that performs the Software RAID, and there
are 
several tool
sets that allow you to use this kernel feature.

The main software package for using the md driver is <url
url="http://www.cse.unsw.edu.au/~neilb/source/mdadm/" name="mdadm">.
Less popular alternatives include the older raidtools package <ref
id="archives" name="(discussed in the Archives below)">, and <url
url="http://evms.sourceforge.net/" name="EVMS">.

<sect1>Example: RAID 5 with mdadm
<p>
In this example we have five AoE targets in shelves 0-4, with each
shelf exporting a single LUN 0.  The following mdadm command uses these five
AoE devices as RAID components, creating a level-5 RAID array.  The md
configuration information is stored on the components themselves in
"md superblocks", which can be examined with another mdadm command.

<tscreen><verb>
# mdadm -C -n 5 --level=raid5 --auto=md /dev/md0 /dev/etherd/e[0-4].0
mdadm: array /dev/md0 started.
# mdadm --examine /dev/etherd/e0.0
/dev/etherd/e0.0:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 46079e2f:a285bc60:743438c8:144532aa (local to host ellijay)
...
</verb></tscreen>

<p>
The <tt>/proc/mdstats</tt> file contains summary information about the
RAID as reported by the kernel itself.

<tscreen><verb>
# cat /proc/mdstat 
Personalities : [raid5] [raid4] 
md0 : active raid5 etherd/e4.0[5] etherd/e3.0[3] etherd/e2.0[2] etherd/e1.0[1] etherd/e0.0[0]
      5860638208 blocks level 5, 64k chunk, algorithm 2 [5/4] [UUUU_]
      [>....................]  recovery =  0.0% (150272/1465159552) finish=23605.3min speed=1032K/sec
      
unused devices: <none>
</verb></tscreen>

Until md finishes initializing the parity of the RAID, performance is
sub-optimal, and the RAID will not be usable if one of the components
fails during initialization.  After initialization is complete, the md
device can continue 
to be used even if one component fails.

Later the array can be stopped in order to shut it down cleanly in
preparation for a system reboot or halt.

<tscreen><verb>
# mdadm -S /dev/md0
</verb></tscreen>

In a system init script (see <ref id="aoeinit" name="the aoe-init
example in the FAQ">) an mdadm command can assemble the RAID
components using the configuration information that was stored on them
when the RAID was created.

<tscreen><verb>
# mdadm -A /dev/md0 /dev/etherd/e[0-4].0
mdadm: /dev/md0 has been started with 5 drives.
</verb></tscreen>

To make an xfs filesystem on the RAID array and mount it, the
following commands can be issued:

<tscreen><verb>
# mkfs -t xfs /dev/md0
# mkdir /mnt/raid
# mount /dev/md0 /mnt/raid
</verb></tscreen>

Once md has finished initializing the RAID, the storage is
single-fault tolerant: Any of the components can fail without making
the storage unavailable.  Once a single component has failed, the md
device is said to be in a "degraded" state.  Using a degraded array is
fine, but a degraded array cannot remain usable if another component
fails.

Adding hot spares makes the array even more robust.  Having hot spares
allows md to bring a new component into the RAID as soon as one of its
components has failed so that the normal state may be achieved as
quickly as possible.  You can check <tt>/proc/mdstat</tt> for
information on the initialization's progress.

The new write-intent bitmap feature can dramatically reduce the time
needed for re-initialization after a component fails and is later
added back to the array.  Reducing the time the RAID spends in
degraded mode makes a double fault less likely.  Please see the mdadm
manpages for details.

<sect1>Important notes
<p>

<enum>

<item>Some Linux distributions come with an mdmonitor service running
by default.  Unless you configure the mdmonitor to do what you want,
consider turning off this service with <tt>chkconfig mdmonitor
off</tt> and <tt>/etc/init.d/mdmonitor stop</tt> or your system's
equivalent commands.  If mdadm is running in its "monitor" mode
without being properly configured, it may interfere with failover to
hot spares, the stopping of the RAID, and other actions.

<item>There is a problem with the way some 2.6 kernels determine
whether an I/O device is idle.  On these kernels, RAID initialization
is about five times slower than it needs to be.

On these kernels you can do the following to work around the problem:

<tscreen><verb>
echo 100000 > /proc/sys/dev/raid/speed_limit_max
echo 100000 > /proc/sys/dev/raid/speed_limit_min
</verb></tscreen>

</enum>

<sect>FAQ (contains important info)<label id="faq">
<p>

<sect1>Q: How does the system know about the AoE targets on the network?
<p>
A: When an AoE target comes online, it emits a broadcast
	frame indicating its presence.  In addition to this mechanism, 
	the AoE initiator may send out a query frame to discover
	any new AoE targets.

	The Linux aoe driver, for example, sends an
	AoE query once per minute.  The discovery can be triggered
	manually with the "aoe-discover" tool, one of the
	<url url="http://aoetools.sourceforge.net/" name="aoetools">.

<sect1>Q: How do I see what AoE devices the system knows about?
<p>
A: The /usr/sbin/aoe-stat program (from the <url
 	url="http://aoetools.sourceforge.net/" name="aoetools">) lists the devices
 	the system considers valid.  It also displays the
	status of the device (up or down).  For example:
 
<tscreen><verb>
root@makki root# aoe-stat
      e0.0     10995.116GB   eth0 up            
      e0.1     10995.116GB   eth0 up            
      e0.2     10995.116GB   eth0 up            
      e1.0      1152.874GB   eth0 up            
      e7.0       370.566GB   eth0 up
</verb></tscreen>

<sect1>Q: What is the "closewait" state?
<p>
A: The "down,closewait" status means that the device went down but at
least one process still has it open.  After all processes close the
device, it will become "up" again if it the remote AoE device is
available and ready.

The user can also use the "aoe-revalidate" command to manually cause
the aoe driver to query the AoE device.  If the AoE device is
available and ready, the device state on the Linux host will change
from "down,closewait" to "up".

<sect1>Q: How does the system know an AoE device has failed?
<p>
A: When an AoE target cannot complete a requested command it will
	indicate so in the response to the failed request.
	The Linux aoe driver will mark the AoE device as failed upon
	reception of such a response.  In addition, if an AoE target
	has not responded to a prior request within a default
	timeout (currently three minutes) the aoe driver will fail
	the device.

<sect1>Q: How do I take an AoE device out of the failed state?
<p>
A: If the aoe driver shows the device state to be "down", first
check the EtherDrive storage itself and the AoE network.  Once any
problem has been rectified, you can use the "aoe-revalidate" command
from the <url
 	url="http://aoetools.sourceforge.net/" name="aoetools"> to ask
	the aoe driver to change the state back to "up".

<p>
If the Linux Software RAID driver has marked the
device as "failed" (so 
that an "F" shows up in the output of "cat /proc/mdstat"), then you
first 
need to remove the device from the RAID using mdadm.  Next you add the
device back to the array with mdadm.  

<p>
An example follows, showing how (after manually failing e10.0) the
device is removed from the array and then added back.  After adding
it back to the RAID, the md driver begins rebuilding the redundancy of
the array.

<tscreen><verb>
root@kokone ~# cat /proc/mdstat
Personalities : [raid1] [raid5] 
md0 : active raid1 etherd/e10.1[1] etherd/e10.0[0]
      524224 blocks [2/2] [UU]
      
unused devices: <none>
root@kokone ~# mdadm --fail /dev/md0 /dev/etherd/e10.0
mdadm: set /dev/etherd/e10.0 faulty in /dev/md0
root@kokone ~# cat /proc/mdstat
Personalities : [raid1] [raid5] 
md0 : active raid1 etherd/e10.1[1] etherd/e10.0[2](F)
      524224 blocks [2/1] [_U]
      
unused devices: <none>
root@kokone ~# mdadm --remove /dev/md0 /dev/etherd/e10.0
mdadm: hot removed /dev/etherd/e10.0
root@kokone ~# mdadm --add /dev/md0 /dev/etherd/e10.0
mdadm: hot added /dev/etherd/e10.0
root@kokone ~# cat /proc/mdstat
Personalities : [raid1] [raid5] 
md0 : active raid1 etherd/e10.0[2] etherd/e10.1[1]
      524224 blocks [2/1] [_U]
      [=>...................]  recovery =  5.0% (26944/524224) finish=0.6min speed=13472K/sec
unused devices: <none>
root@kokone ~# 
</verb></tscreen>

<sect1>Q: How can I use LVM with my EtherDrive storage?
<p>
A: With older <url url="http://sources.redhat.com/lvm2/"
name="LVM2"> releases, you may need to edit
lvm.conf, but the current version of LVM2 supports AoE
devices "out of the box".

You can also create md devices from your aoe devices and tell LVM to
use the md devices.

It's necessary to understand LVM itself in order to use AoE devices
with LVM.  Besides the manpages for the LVM commands, the <url
url="http://tldp.org/HOWTO/LVM-HOWTO/" name="LVM HOWTO"> is a big help
getting started if you are starting out with LVM.

If you have an old LVM2 that does not already detect and work with AoE
devices, you can add this line to the "devices" block of your
lvm.conf.

<tscreen><verb>
types = [ "aoe", 16 ]
</verb></tscreen>

If you are creating physical volumes out of RAIDs over EtherDrive
storage, make sure to turn on md component detection so that LVM2
doesn't go snooping around on the underlying EtherDrive disks.

<tscreen><verb>
md_component_detection = 1
</verb></tscreen>

The snapshots feature in LVM2 did not work in early 2.6 kernels.
Lately, Coraid customers have reported success using snapshots on
AoE-backed logical volumes when using a recent kernel and aoe driver.
Older aoe drivers, like version 22, may need <url
url="https://bugzilla.redhat.com/attachment.cgi?id=311070" name="a
fix"> to work correctly with snapshots.

Customers have reported data corruption and kernel panics when using
striped logical volumes (created with the "-i" option to lvcreate)
when using aoe driver versions prior to aoe6-48.  No such problems
occur with normal logical volumes or with Software RAID's striping
(RAID 0).

Most systems have boot scripts that try to detect LVM physical volumes
early in the boot process, before AoE devices are available.  In
playing with LVM, you may need to help LVM to recognize AoE devices
that are physical devices by running vgscan after loading the aoe
module.

There have been reports that partitions can interfere with LVM's
ability to use an AoE device as a physical volume.  For example, with
partitions e0.1p1 and e0.1p2 residing on e0.1, <tt>pvcreate /dev/etherd/e0.1</tt> might
complain,

<tscreen><verb>
Device /dev/etherd/e0.1 not found.
</verb></tscreen>

Removing the partitions allows LVM to create a physical volume from
e0.1.

<sect1>Q: I get an "invalid module format" error on modprobe.  Why?
<p>
A: The aoe module and the kernel must be built to match one another.
On module load, the kernel version, SMP support (yes or no), the
compiler version, and the target processor must be the same for the
module as it was building the kernel.

<sect1>Q: Can I allow multiple Linux hosts to use a filesystem that is on my EtherDrive storage?
<p>
A: Yes, but you're now taking advantage of the flexibility of
EtherDrive storage, using it like a SAN.  Your software
must be "cluster aware", like <url
url="http://sources.redhat.com/cluster/gfs/" name="GFS">.  Otherwise,
each host will assume 
it is the sole user of the filesystem and data corruption will
result. 

<sect1>Q: Can you give me an overview of GFS and related software?
<p>
A: Yes, here's a brief overview.

<sect2>Background
<p>
  GFS is a scalable, journaled filesystem designed to be used by more
  than one computer at a time.  There is a separate journal for each
  host using the filesystem.  All the hosts working together are
  called a cluster, and each member of the cluster is called a cluster
  node.
<p>
  To achieve acceptible performance, each cluster node remembers what
  was on the block device the last time it looked.  This is caching,
  where data from copies in RAM are used temporarily instead of data
  directly from the block device.
<p>
  To avoid chaos, the data in the RAM cache of every cluster node has
  to match what's on the block device.  The members of the cluster
  (called "cluster nodes") communicate over TCP/IP to agree on who is
  in the cluster and who has the right to use a particular part of the
  shared block device.

<sect2>Hardware
<p>
  To allow the cluster nodes to control membership in the cluster and
  to control access to the shared block storage, "fencing" hardware
  can be used.
<p>
  Some network switches can be dynamically configured to turn single
  ports on and off, effectively fencing a node off from the rest of
  the network.
<p>
  Remote power switches can be told to turn an outlet off, powering a
  cluster node down, so that it is certainly not accessing the shared
  storage.

<sect2>Software
<p>
  The RedHat Cluster Suite developers have created several pieces of
  software besides the GFS filesystem itself to allow the cluster
  nodes to coordinate cluster membership and to control access to the
  shared block device.
<p>
  These parts are listed here, on the GFS Project Page.
<p>
   <url url="http://sources.redhat.com/cluster/gfs/" name=" http://sources.redhat.com/cluster/gfs/">
<p>
  GFS and its related software are undergoing continuous heavy
  development and are maturing slowly but steadily.
<p>
  As might be expected, the devleopers working for RedHat target
  RedHat Enterprise Linux as the ultimate platform for GFS and its
  related software.  They also use Fedora Core as a platform for
  testing and innovation.
<p>
  That means that when choosing a distribution for running GFS, recent
  versions of Fedora Core, RedHat Enterprise Linux (RHEL), and RHEL
  clones like CentOS should be considered.  On these platforms, RPMs
  are available that have a good chance of working "out of the box."
<p>
  With a RedHat-based distro like Fedora Core, using GFS means seeking
  out the appropriate documentation, installing the necessary RPMs,
  and creating a few text files for configuring the software.
<p>
  Here is a good overview of what the process is generally like.  Note
  that if you're using RPMs, then building and installing the software
  will not be necessary.
<p>
    <url url="http://sources.redhat.com/cluster/doc/usage.txt" name="http://sources.redhat.com/cluster/doc/usage.txt">

<sect2>Use
<p>
  Once you have things ready, using the GFS is like using any other
  filesystem.
<p>
  Performance will be greatest when the filesystem operations of the
  different nodes do not interfere with one another.  For instance, if
  all the nodes try to write to the same place in a directory or file,
  much time will be spent in coordinating access (locking).
<p>
  An easy way to eliminate a large amount of locking is to use the
  "noatime" (no access time update) mount option.  Even in traditional
  filesystems the use of 
  this option often results in a dramatic performance benefit, because
  it eliminates the need to write to the block storage just to record
  the time that the file was last accessed.

<sect2>Fencing
<p>
  There are several ways to keep a cluster node from accessing shared
  storage when that node might have outdated assumptions about the
  state of the cluster or the storage.  Preventing the node from
  accessing the storage is called "fencing", and it can be
  accomplished in several ways.
<p>
  One popular way is to simply kill the power to the fenced node by
  using a remote power switch.  Another is to use a network switch
  that has ports that can be turned on and off remotely.
<p>
  When the shared storage resource is a LUN on an SR, it is
  possible to manipulate the LUN's mask list in order to accomplish
  fencing.  You can read about this technique in the <url
  url="/support/linux/contrib/" name="Contributions area">.

<sect1>Q: How can I make a RAID of more than 27 components?
<p>
A: For Linux Software RAID, the kernel limits the number of disks in
one RAID to 27.  However, you can easily overcome this limitation by
creating another level of RAID.
<p>
For example, to create a RAID 0 of thirty block devices,
you may create three ten-disk RAIDs (md1, md2, and md3) and then
stripe across them (md0 is a stripe over md1, md2, and md3).
<p>
Here is an example raidtools configuration file that implements the
above scenario for shelves 5, 6, and 7: <url
url="raid0-30component.conf" name="multi-level RAID 0 configuration
file">.  Non-trivial raidtab configuration files are easier to
generate from a script than to create by hand.
<p>
EtherDrive storage gives you a lot of freedom, so be creative.

<sect1>Q: Why do my device nodes disappear after a reboot?<label id="udev">
<p>
A: Some Linux distributions create device nodes dynamically.  The
upcoming method of choice is called "udev".  The aoe driver and udev
work together when the following rules are installed.
<p>
These rules go into a file with a name like <tt>60-aoe.rules</tt>.
Look in your <tt>udev.conf</tt> file (usually
<tt>/etc/udev/udev.conf</tt>) for the line starting with <tt>udev_rules=</tt> to find out where rules go (usually <tt>/etc/udev/rules.d</tt>).

<tscreen><verb>
# These rules tell udev what device nodes to create for aoe support.
# They may be installed along the following lines.  Check the section
# 8 udev manpage to see whether your udev supports SUBSYSTEM, and 
# whether it uses one or two equal signs for SUBSYSTEM and KERNEL.

# aoe char devices
SUBSYSTEM=="aoe", KERNEL=="discover",	NAME="etherd/%k", GROUP="disk", MODE="0220"
SUBSYSTEM=="aoe", KERNEL=="err",	NAME="etherd/%k", GROUP="disk", MODE="0440"
SUBSYSTEM=="aoe", KERNEL=="interfaces",	NAME="etherd/%k", GROUP="disk", MODE="0220"
SUBSYSTEM=="aoe", KERNEL=="revalidate",	NAME="etherd/%k", GROUP="disk", MODE="0220"
SUBSYSTEM=="aoe", KERNEL=="flush",	NAME="etherd/%k", GROUP="disk", MODE="0220"

# aoe block devices     
KERNEL=="etherd*",       NAME="%k", GROUP="disk"
</verb></tscreen>

<p>
Unfortunately the syntax for the udev rules file has changed several
times as new versions of udev appear.  You will probably have to
modify the example above for your system, but the existing rules and
the udev documentation should help you.

<p>
There is an example script in the aoe driver,
<tt>linux/Documentation/aoe/udev-install.sh</tt>, that can install the
rules on most systems.

<p>
The udev system can only work with the aoe driver if the aoe driver is
loaded.  To avoid confusion, make sure that you load the aoe driver at
boot time.

<sect1>Q: Why does RAID initialization seem slow?
<p>
A: The 2.6 Linux kernel has a problem with its RAID initialization
rate limiting feature.  You can override this feature and speed up
RAID initialization by using the following commands.  Note that these
commands change kernel memory, so the commands must be re-run after a
reboot.

<tscreen><verb>
echo 100000 > /proc/sys/dev/raid/speed_limit_max
echo 100000 > /proc/sys/dev/raid/speed_limit_min
</verb></tscreen>


<sect1>Q: I can only use shelf zero!  Why won't e1.9 work?
<p>
A: Every block device has a device file, usually in /dev, that has a
major and minor number.  You can see these numbers using ls.  Note the
high major numbers (1744, 2400, and 2401) in the example below.

<tscreen><verb>
ecashin@makki ~$ ls -l /dev/etherd/
total 0
brw-------  1 root disk 152, 1744 Mar  1 14:35 e10.9
brw-------  1 root disk 152, 2400 Feb 28 12:21 e15.0
brw-------  1 root disk 152, 2401 Feb 28 12:21 e15.0p1
</verb></tscreen>

The 2.6 Linux kernel allows high minor device numbers like this, but
until recently, 255 was the highest minor number one could use.  Some
distributions contain userland software that cannot understand the
high minor numbers that 2.6 makes possible.  

Here's a crude but reliable test that can determine whether your
system is ready to use devices with high minor numbers.  In the
example below, we tried to create a device node with a minor number of
1744, but ls shows it as 208.

<tscreen><verb>
root@kokone ~# mknod e10.9 b 152 1744
root@kokone ~# ls -l e10.9
brw-r--r--  1 root root 158, 208 Mar  2 15:13 e10.9
</verb></tscreen>

On systems like this, you can still use the aoe driver to use up to
256 disks if you're willing to live without support for partitions.
Just make sure that the device nodes and the aoe driver are both
created with one partition per device.

The commands below show how to create a driver without partition
support and then to create compatible device nodes for shelf 10.

<tscreen><verb>
make install AOE_PARTITIONS=1
rm -rf /dev/etherd
env n_partitions=1 aoe-mkshelf /dev/etherd 10
</verb></tscreen>

As of version 1.9.0, the mdadm command supports large minor device
numbers.  The mdadm versions before 1.9.0 do not.  If you would like
to use versions of mdadm older than 1.9.0, you can configure your
driver and device nodes as outlined above.  Be aware that it's easy
confuse yourself by creating a driver that doesn't match the device
nodes.

<sect1>Q: How can I start my AoE storage on boot and shut it down when the system shuts down?<label id="aoeinit">
<p>
A: That is really a question about your own system, so it's a question
you, as the system administrator, are in the best position to answer.

<p>
In general, though, many Linux distributions follow the same patterns
when it comes to system "init scripts".  Most use a System V style.

<p>
The example below should help get you started if you have never
created and installed an init script.  Start by reading the comments
at the top.  Make sure you understand how your system works and what
the script does, because every system is different.

Here is an overview of what happens when the aoe module is loaded and
the aoe module begins AoE device discovery.  It should help you to
understand the example script below.  Starting up the aoe module on
boot can be tricky if necessary parts of the system are not ready when
you want to use AoE.

To discover an AoE device, the aoe driver must receive a Query Config
reponse packet that indicates the device is available.  A Coraid SR
broadcasts this response unsolicited when you run the <tt>online</tt>
SR command, but it is usually sent in response to an AoE initiator
broadcasting a Query Config command to discover devices on the
network.  Once an AoE device has been discovered, the aoe driver sends
an ATA Device Identify command to get information about the disk
drive.  When the disk size is known, the aoe driver will install the
new block device in the system.

The aoe driver will broadcast this AoE discovery command when loaded,
and then once a minute thereafter.

The AoE discovery that takes place on loading the aoe driver does not
take long, but it does take some time.  That's why you'll see "sleep"
commands in the example aoe-init script below.  If AoE discovery is
failing, try unloading the aoe module and tuning your init script by
invoking it at the command line.

You will often find that a delay is necessary after loading your
network drivers (and before loading the aoe driver).  This delay
allows the network interface to initialize and to become usable.  An
additional delay is necessary after loading the aoe driver, so that
AoE discovery has time to take place before any AoE storage is used.

Without such a delay, the initial AoE Config Query broadcast packet
might never go out onto the AoE network, and then the AoE initiator
will not know about any AoE targets until the next periodic Config
Query broadcast occurs, usually one minute later.

<tscreen><verb>
#! /bin/sh
# aoe-init - example init script for ATA over Ethernet storage
# 
#   Edit this script for your purposes.  (Changing "eth1" to the
#   appropriate interface name, adding commands, etc.)  You might
#   need to tune the sleep times.
#
#   Install this script in /etc/init.d with the other init scripts.
#
#   Make it executable:
#     chmod 755 /etc/init.d/aoe-init
#
#   Install symlinks for boot time:
#     cd /etc/rc3.d && ln -s ../init.d/aoe-init S99aoe-init
#     cd /etc/rc5.d && ln -s ../init.d/aoe-init S99aoe-init
#
#   Install symlinks for shutdown time:
#     cd /etc/rc0.d && ln -s ../init.d/aoe-init K01aoe-init
#     cd /etc/rc1.d && ln -s ../init.d/aoe-init K01aoe-init
#     cd /etc/rc2.d && ln -s ../init.d/aoe-init K01aoe-init
#     cd /etc/rc6.d && ln -s ../init.d/aoe-init K01aoe-init
#

case "$1" in
	"start")
		# load any needed network drivers here

		# replace "eth1" with your aoe network interface
        	ifconfig eth1 up

		# time for network interface to come up
                sleep 4

                modprobe aoe

		# time for AoE discovery and udev
                sleep 7

                # add your raid assemble commands here
		# add any LVM commands if needed (e.g. vgchange)
                # add your filesystem mount commands here

		test -d /var/lock/subsys && touch /var/lock/subsys/aoe-init
                ;;
        "stop")
                # add your filesystem umount commands here
		# deactivate LVM volume groups if needed
                # add your raid stop commands here
		rmmod aoe
		rm -f /var/lock/subsys/aoe-init
		;;
	*)
        	echo "usage: `basename $0` {start|stop}" 1>&2
		;;
esac
</verb></tscreen>

<sect1>Q: Why do I get "permission denied" when I'm root?
<p>
A: Some newer systems come with SELinux (Security-Enhanced Linux),
which can limit what the root user can do.

<p>
SELinux is usually good about creating entries in the system logs when
it prevents root from doing something, so examine your logs for such
messages.

<p>
Check the SELinux documentation for information on how to configure
or disable SELinux according to your needs.

<sect1>Q: Why does fdisk ask me for the number of cylinders?<label id="dospart">

<p>
A: Your fdisk is probably asking the kernel for the size of the disk
with a BLKGETSIZE block device ioctl, which returns the sector
count of the disk in a 32-bit number.  If the size of the disk exceeds
the ability to be stored in this 32-bit number (2 TB is the limit),
the ioctl returns ETOOBIG as an error.  This error indicates that the
program should try the 64-bit ioctl (BLKGETSIZE64), but when fdisk
doesn't do that, it just asks the user to supply the number of
cylinders.

You can
tell fdisk the number of cylinders yourself.  The number to use
(sectors / (255 * 63)) is printed by the following commands.  Use the
appropriate device instead of "e0.0".

<tscreen><verb>
sectors=`cat /sys/block/etherd\!e0.0/size`
echo $sectors 255 63 '*' / p | dc
</verb></tscreen>

But no MSDOS partition table can ever work with more than 2TB.  The
reason is that the numbers in the partition table itself are only 32
bits in size.  That means you can't have a partition larger than 2TB
in size or starting further than 2TB from the beginning of the device.

Some options for multi-terabyte volumes are:

<enum>
<item>By doing without partitions, the filesystem can be created
directly on the AoE device itself (e.g., <tt>/dev/etherd/e1.0</tt>),
<item>LVM2, the Logical Volume Manager, is a sophisticated way of
allocating storage to create logical volumes of desired sizes, and
<item>GPT partition tables.
</enum>

The last item in the list above is a new kind of partition table that
overcomes the limitations of the older MSDOS-style partition table.
Andrew Chernow has related his successful experiences using GPT
partition tables on large AoE devices in <url
url="/support/linux/contrib/chernow/gpt.html"
name="this contributed document">.

Please note that some versions of the GNU parted tool, such as version
1.8.6, have a bug.  This bug allows the user to create an MSDOS-style
partition table with partitions larger than two terabytes even though
these partitions are too large for an MSDOS partition table.  The
result is that the filesystems on these partitions will only be usable
until the next reboot.

<sect1>Q: Can I use AoE equipment with Oracle software?

<p>
A: Oracle used to have a <url
url="http://www.oracle.com/technology/deploy/availability/htdocs/oscp.html"
name="Oracle Storage Compatibility Program">, but simple block-level
storage technologies do not require Oracle validation.  ATA over
Ethernet provides simple, block-level storage.

Oracle used to have a list of a frequently asked questions about
running Oracle on Linux, but they have replaced it with <url
url="http://www.oracle.com/technology/tech/linux/htdocs/oracleonlinux_faq.html"
name="documentation
about their own Linux distribution list covering">.  A third party
site continues to maintain a <url
url="http://www.orafaq.com/faqlinux.htm"
name="FAQ about running Oracle on Linux">.

<sect1>Q: Why do I have intermittent problems?

<p>
A: Make sure your network is in good shape.  Having good patch cables,
reliable network switches with good flow control, and good network
cards will keep your network storage happy.

<sect1>Q: How can I avoid running out of memory when copying large files?

<p>
A: You can tell the Linux kernel not to wait so long before writing
data out to backing storage.

<tscreen><verb>
echo 3 > /proc/sys/vm/dirty_ratio 
echo 4 > /proc/sys/vm/dirty_background_ratio 
echo 32768 > /proc/sys/vm/min_free_kbytes
</verb></tscreen>

When a large MTU, like 9000, is in being used on the AoE-side network
interfaces, a larger min_free_kbytes setting could be helpful.  The more
RAM you have, the larger the number you might have to use.

There are also alternative settings to the above "ratio" settings, available as of kernel version 2.6.29.  They are <tt>dirty_bytes</tt> and <tt>dirty_background_bytes</tt>, and they provide finer control for systems with large amounts of RAM.

If you find the /proc settings to be helpful, you can make them
permanent by editing /etc/sysctl.conf or by creating an init script
that performs the settings at boot time.

The Documentation/sysctl/vm.txt file for your kernel has details on the settings
available for your particular kernel, but some guiding principles are...

<itemize>
<item>Linux will use free RAM to cache the data that is on AoE targets, which is helpful.
<item>Writes to the AoE target go first to RAM, updating the cache.  Those updated parts of the cached data are "dirty" until the changes are written out to the AoE target.  Then they're "clean".
<item>If the system needs RAM for something else, clean parts of the cache can be repurposed immediately.
<item>The RAM that is holding dirty cache data cannot be reclaimed immediately, because it reflects updates to the AoE target that have not yet made it to the AoE target.
<item>Systems with much RAM and doing many writes will accumulate dirty data quickly.
<item>If the processes creating the write workload are forced by the Linux kernel to wait for the dirty data to be flushed out to the backing store (AoE targets), then I/O goes fast but the producers are naturally throttled, and the system stays responsive and stable.
<item>If the dirty data is flushed in "the background", though, then when there's too much dirty data to flush out, the system becomes unresponsive.</item>
<item>Telling Linux to maintain a certain amount of truly free RAM, not used for caching, allows the system to have plenty of RAM for doing the work of flushing out the dirty data.
<item>Telling Linux to push dirty data out sooner keeps the backing store more consistent while it is being used (with regard to the danger of power failures, network failures, and the like).  It also allows the system to quickly reclaim memory used for caching when needed, since the data is clean.
</itemize>

<sect1>Q: Why doesn't the aoe driver notice that an AoE device has disappeared or changed size?

<p>
A: Prior to the aoe6-15 driver, aoe drivers only learned an AoE device's
characteristics once, and the only way to use an AoE device that had
grown or to get rid of "phantom" AoE devices that were no longer
present was to re-load the aoe module completely.

<tscreen><verb>
rmmod aoe
modprobe aoe
</verb></tscreen>

Since aoe6-15, aoe drivers have supported the aoe-revalidate command.
See the aoe-revalidate manpage for more information.

<sect1>Q: My NFS client hangs when I export a filesystem on an AoE device.

<p>
A: If you are exporting a filesystem over NFS, then that filesystem
resides on a block device.  Every block device has a major and minor
device number that you can see by running "ls -l".

If the block device has a "high" minor number, over 255, and you're
trying to export a filesystem on that device, then NFS will have
trouble using the minor number to identify the filesystem.  You can
tell the NFS server to use a different number by using the "fsid"
option in your /etc/exports file.  

The fsid option is documented in the "exports" manpage.  Here's an
example of how its use might look in /etc/exports.

<tscreen><verb>
/mnt/alpha 205.185.197.207(rw,sync,no_root_squash,fsid=20)
</verb></tscreen>

As the manpage says, each filesystem needs its own unique fsid.

<sect1>Q: Why do I see "unknown partition table" errors in my logs?

<p>
A: Those are probably not errors.  
Usually this message means that your disk doesn't have a partition
table.  With AoE devices, that's the common case.

When a new block device is detected
by the kernel, the kernel tries to read the part of the block device
where a partition table is conventially stored.

The kernel checks to see whether the data there looks like any kind of
partition table that it knows about.  It can't tell the difference
between a disk with a kind of partition table it doesn't know about
and a disk with no partition table at all.

<sect1>Q: Why do I get better throughput to a file on an AoE device than to the device itself?

<p>
Most of the time a filesystem resides on a block device, so that the
filesystem can be mounted and the storage is used by reading and
writing files and directories.
When you are not using a filesystem at all, you might see somewhat
degraded performance.  Sometimes this degradation comes as a surprise
to new AoE users when they first try out an AoE device with the dd
command, for example, before creating a filesystem on the device.

If the AoE device has an odd
number of sectors, the block layer of the Linux kernel presents the
aoe driver with 512-byte I/O jobs.  Each AoE packet winds up with only
one sector of data, doubling the number of AoE packets when normal
ethernet frames are in use.

The Linux kernel's block layer gives special treatment to filesystem
I/O, giving the aoe driver I/O jobs in the filesystem block size, so
there is no performance penalty to using a filesystem on an AoE device
that has an odd number of sectors.  Since there isn't a large demand for
non-filesystem I/O, the complexity associated with coalescing
multiple I/O jobs in the aoe driver is probably not worth the
potential driver instability it could introduce.

One way to work around this issue is to use the O_DIRECT flag to the 
"open" system call. For recent versions of dd, you can use the option, 
"oflag=direct" to tell dd to use this O_DIRECT flag.  You should combine 
this option with a large blocksize, such as "bs=4M" in order to take use 
the larger possible I/O batch size.

Another way to work around this issue is to
use a trivial md device as a wrapper.  (Almost everyone uses a
filesystem.  This technique is only interesting to those who are not
using a filesystem, so most people should ignore this idea.)  In the
example below, a single-disk RAID 0 is created for the AoE device
e0.3.  Although e0.3 has an odd number of sectors, the md1 device does
not, and tcpdump confirms that each AoE packet has 1 KiB of data as we
would like.

<tscreen><verb>
makki:~# mdadm -C -l 0 -n 1 --auto=md  /dev/md1 /dev/etherd/e0.3
mdadm: '1' is an unusual number of drives for an array, so it is probably
     a mistake.  If you really mean it you will need to specify --force before
     setting the number of drives.
makki:~# mdadm -C -l 0 --force -n 1 --auto=md  /dev/md1 /dev/etherd/e0.3
mdadm: array /dev/md1 started.
makki:~# cat /sys/block/etherd\!e0.3/size
209715201
makki:~# cat /sys/block/md1/size
209715072
</verb></tscreen>

<sect1>Q: How can I boot diskless systems from my Coraid EtherDrive devices?

<p>
Booting from AoE devices is similar to other kinds of network
booting.  Customers have contributed examples of successful strategies
in the <url
url="/support/linux/contrib/"
name="Contributions Area">
of the Coraid website.

<url
url="/support/linux/contrib/index.html#jvboot"
name="Jayson Vantuyl: Making A Flexible Initial Ramdisk">

<url
url="/support/linux/contrib/index.html#jmboot"
name="Jason McMullan: Add root filesystem on AoE support to aoe driver">

<p>
Keep in mind that if you intend to use AoE devices before udev is
running, you must use static minor numbers for the device nodes.  An
aoe6 driver version 50 or above can be instructed to use static minor
numbers by being loaded with the <tt>aoe_dyndevs=0</tt> module
parameter.  (Previous aoe drivers only used static minor device
numbers.)

<sect1>Q: What filesystems do you recommend for very large block devices?

<p>
The filesystem you choose will depend on how you want to use the
storage.  Here are some generalizations that may serve as a starting
point.

There are two major classes of filesystems: cluster filesystems and
traditional filesystems.  Cluster filesystems are more complex and
support simultaneous access from multiple independent computers to a
single filesystem stored on a shared block device.

Traditional filesystems are only mounted by one host at a time.  Some
traditional filesystems that scale to sizes larger than those
supported by ext3 include the following journalling filesystems.

<url
url="http://oss.sgi.com/projects/xfs/"
name="XFS">, developed at SGI, specializes in high throughput to large files.

<url
url="http://www.namesys.com/"
name="Reiserfs">, an often experimental filesystem can perform well
with many 
small files.

<url
url="http://jfs.sourceforge.net/"
name="JFS">, developed at IBM, is a general purpose filesystem.

<sect1>Q: Why does umount say, "device is busy"?

<p>
A: That just means you're still using the filesystem on that device.

Unless something has gone very wrong, you should be able to unmount
after you stop using the filesystem.  Here are a few ways you might be
using the filesystem without knowing it:

<itemize>
<item> NFS might be exporting it.  Stopping the NFS service will unuse
    the filesystem.

<item> A process might be holding open a file on the filesystem.  Killing
    the process will unuse the filesystem.

<item> A process might have some directory on the filesystem as its
    current working directory.  In that case, you can kill the process
    or (if it's a shell) cd to some other directory that's not on the
    fs you're trying to unmount.
</itemize>

The <tt/lsof/
command can be helpful in finding processes that are using files.

<sect1>Q: How do I use the multiple network path support in driver versions 33 and up?

<p>
A: You don't have to do anything to benefit from the aoe driver's
ability to use multiple network paths to the same AoE target.

The aoe driver will automatically use each end-to-end path in an
essentially round-robin fashion.  If one network path becomes
unusable, the aoe driver will attempt to use the remaining network
paths to reach the AoE target, even retransmitting any lost packets
through one of the remaining paths.

<sect1>Q: Why does "xfs_check" say "out of memory"?

<p>
A: The xfstools use a huge amount of virtual memory when operating on
large filesystems.  The CLN HOWTO has some helpful information about
using temporary swap space when necessary for accomodating the
xfstools' virtual memory requirements.

<url
url="/support/cln/CLN-HOWTO/ar01s05.html#id2515012"
name="CLN HOWTO: Repairing a Filesystem">

The 32-bit xfstools are limited in the size of the filesystem they can
operate on, but 64-bit systems overcome this limitation.  This limit
is likely to be encountered with 32-bit xfstools for filesystems over
2 TiB in size.

<sect1>Q: Can virtual machines running on VMware ESX use AoE over jumbo frames?

<p>
A: It is somewhat difficult to find public information about the
ESX configuration necessary to use jumbo frames, but there is
information in the public forum at the URL below.

<url
url="http://communities.vmware.com/thread/135691"
name="How to setup TCP/IP Jumbo packet support in VMware ESX 3.5 on W2K3 VMs">

<sect1>Q: Can I use SMART with my AoE devices?

<p>
A: The early Coraid products like the EtherDrive PATA blades simply
passed ATA commands through to the attached PATA disk,
including SMART commands.  While there was no way to ask the aoe driver
to send SMART commands, one could ask aoeping to send SMART commands.
The aoeping manpage has more information.

<p>
The Coraid SR and VS storage appliances present AoE targets that are
LUNs, not corresponding to a specific disk.  The SR supports SMART
internally, on its command line, but the AoE LUNs do not support SMART.

<sect>Jumbo Frames

<p>
Data is transmitted over the ethernet in frames, usually with a
maximum frame size of 1500.  Receiving or transmitting a frame of data
takes time, and by increasing the amount of data per frame, data can
often be transmitted more efficiently over an ethernet network.

Frames larger than 1500 octets are called "jumbo frames."  There is
plenty of information about jumbo frames out there, so in this section
we're going to focus on how jumbo frames relate to the use of AoE.

When you change the MTU on your Linux host's network interface, the
interface must essentially reboot itself.  Once this has completed and
the interface is back up, you should run the <tt>aoe-discover</tt>
command to
trigger the reevaluation of the aoe device's jumbo frame capability.
You should see lines in your log (or in the output of the
<tt>dmesg</tt> 
command) indicating that the outstanding frame
size has changed.  The example text below appears after setting the
MTU on eth1 to 4200, enough for 4 KiB of data, plus headers.

<tscreen><verb>
aoe: e7.0: setting 4096 byte data frames on eth1:003048865ed2
</verb></tscreen>

If you do not see this output, try running <tt>aoe-revalidate</tt> on
the device in question.  If you have a switch inbetween your SR and
your linux client that does not have jumbo frames enabled, the aoe
driver will fall back to 1 KiB of data per packet until a forced
revalidation occurs.

For larger frames to be used, the whole network path must support
them.  For example, consider a scenario where you are using ...

<enum>
<item>a LUN
from a Coraid SR1521 as your AoE target, 
<item>a Linux host with an Intel
gigabit NIC as your AoE initiator, and
<item>a gigabit switch between the
target and initiator.
</enum>

In that case, all three points on the network must be configured to
handle large frames in order for AoE data to be transmitted in jumbo
frames.

<sect1>Linux NIC MTU
<p>
Check the documentation for your network card's driver to find out how
to change its maximum transmission unit (MTU).  For example, if you
have a gigabit Intel NIC, you can read the
<tt>Documentation/networking/e1000.txt</tt> file in the kernel
sources to find out that the following command increases the MTU to
4200.

<tscreen><verb>
ifconfig ethx mtu 4200 up
</verb></tscreen>

The real name of your interface (e.g., "eth1") should be used instead
of "ethx".

<sect1>Network Switch MTU
<p>
Usually you have to turn on jumbo frames in a switch that supports
them.  Doing jumbo frames requires a different buffer allocation in
the switch that's not usually sensible for ethernet with standard
frames.  Check the documentation for your switch for details.

<sect1>SR MTU
<p>
No special configuration steps need to be taken on the Coraid
SATA+RAID unit for it to use jumbo frames if the firmware release is
20060316 or newer.  

You can see what firmware release your SR is running by issuing the
"release" command at its command line.

<sect>Appendix A: Archives<label id="archives">

<p>
This section contains material that is no longer relevant to a
majority of readers.  It has been placed in this appendix with minimal
editing.

<sect1>Example: RAID 5 with the raidtools
<p>
Let us assume we have five AoE targets that are virtual LUNs numbered
0 through 4, exported from a Coraid VS appliance that has been
assigned shelf address 0.  Let us further assume we want to use these
five LUNs to create a level-5 RAID array.  Using a text editor, we
create 
a Software RAID configuration file named "/etc/rt".  The transcript
below shows its contents.

<tscreen><verb>
$ cat /etc/rt
raiddev /dev/md0
        raid-level      5
        nr-raid-disks   5
        chunk-size      32
        persistent-superblock 1
        device          /dev/etherd/e0.0
        raid-disk       0
        device          /dev/etherd/e0.1
        raid-disk       1
        device          /dev/etherd/e0.2
        raid-disk       2
        device          /dev/etherd/e0.3
        raid-disk       3
        device          /dev/etherd/e0.4
        raid-disk       4
</verb></tscreen>

Here is an example for setting up and using the RAID array described
by the above configuration file, <tt>/etc/rt</tt>.

<tscreen><verb>
$ mkraid -c /etc/rt /dev/md0
DESTROYING the contents of /dev/md0 in 5 seconds, Ctrl-C if unsure!
handling MD device /dev/md0
analyzing super-block
disk 0: /dev/etherd/00:00, 19535040kB, raid superblock at 19534976kB
disk 1: /dev/etherd/00:01, 19535040kB, raid superblock at 19534976kB
disk 2: /dev/etherd/00:02, 19535040kB, raid superblock at 19534976kB
disk 3: /dev/etherd/00:03, 19535040kB, raid superblock at 19534976kB
disk 4: /dev/etherd/00:04, 19535040kB, raid superblock at 19534976kB
$
</verb></tscreen>

To make an ext3 filesystem on the RAID array and mount it, the
following commands can be issued:

<tscreen><verb>
$ mkfs.ext3 /dev/md0
... (mkfs output)
$ mount /dev/md0 /mnt/raid
$
</verb></tscreen>

The resulting storage is single-fault tolerant.  Add hot spares to
make the array even more robust (see the Software RAID documentation
for more information.)  Remember that it takes the md driver some time
to initialize a new RAID 5 array.  During that time, you can use the
device, but performance is sub-optimal until md finishes.  Check
<tt>/proc/mdstat</tt> for information on the initialization's
progress.

<sect1>Example: RAID 10 with mdadm

<p>
Today, the Linux kernel supports a raid10 personality, and you can
create a RAID 10 with one mdadm command.  Things used to be more
complicated.  The section below shows the steps that used to be
necessary to create a RAID 10 by first creating several RAID 1 mirrors
that could serve as components for the larger RAID 0.

<p>
RAID 10 is striping over mirrors.  That is, a RAID 0 is created to
stripe data over several RAID 1 devices.  Each RAID 1 is a mirrored
pair of disks.  For a given (even) number of disks, a RAID 10 has less
capacity and throughput than a RAID 5.  Nevertheless, storage experts
often prefer RAID 10 for its superior resiliancy to failure,
its low re-initialization time, and its low computational overhead.

The first example shows how to create a RAID 10 and a hot spare from
eight AoE targets that share shelf address 1.  After checking the
mdadm manpage, it should be easy for you to create startup and
shutdown scripts.

<tscreen><verb>
# make-raid10.sh
# create a RAID 10 from shelf 1 to be used with mdadm-aoe.conf

set -xe		# shell flags: be verbose, exits on errors
shelf=1

# create the mirrors
mdadm -C /dev/md1 -l 1 -n 2 /dev/etherd/e$shelf.0 /dev/etherd/e$shelf.1
mdadm -C /dev/md2 -l 1 -n 2 /dev/etherd/e$shelf.2 /dev/etherd/e$shelf.3
mdadm -C /dev/md3 -l 1 -n 2 /dev/etherd/e$shelf.4 /dev/etherd/e$shelf.5
mdadm -C /dev/md4 -l 1 -n 2 -x 2 /dev/etherd/e$shelf.6 /dev/etherd/e$shelf.7 \
	/dev/etherd/e$shelf.8
sleep 1
# create the stripe over the mirrors
mdadm -C /dev/md0 -l 0 -n 4 /dev/md1 /dev/md2 /dev/md3 /dev/md4
</verb></tscreen>

Notice that the <tt>make-raid10.sh</tt> script above sets up
<tt>md4</tt> with the hot spare drive.  What if one of the drives in
<tt>md1</tt> fails?  The "spare group" mdadm feature allows an mdadm
process running in monitor mode to dynamically allocate hot spares as
needed, so that the single hot spare can replace a faulty disk in any
RAID 1 of the four.

The configuration file below tells the mdadm monitor process that it
can use the hot spare to replace any drive in the RAID 10.

<tscreen><verb>
# mdadm-aoe.conf
# see mdadm.conf manpage for syntax and info
#
# There's a "spare group" called e1, after the shelf
# with address 1, so that mdadm can use hot spares for
# any RAID 1 in the RAID 10 on shelf 1.
# 

DEVICE /dev/etherd/e1.[0-9]

ARRAY /dev/md1
  devices=/dev/etherd/e1.0,/dev/etherd/e1.1
  spare-group=e1
ARRAY /dev/md2
  devices=/dev/etherd/e1.2,/dev/etherd/e1.3
  spare-group=e1
ARRAY /dev/md3
  devices=/dev/etherd/e1.4,/dev/etherd/e1.5
  spare-group=e1
ARRAY /dev/md4
  devices=/dev/etherd/e1.6,/dev/etherd/e1.7,/dev/etherd/e1.8
  spare-group=e1

ARRAY /dev/md0
  devices=/dev/md1,/dev/md2,/dev/md3,/dev/md4

MAILADDR root

# This is normally a program that handles events instead
# of just /bin/echo.  If you run the mdadm monitor in the
# forground, though, using echo allows you to see what events
# are occurring.
#
PROGRAM /bin/echo
</verb></tscreen>

<sect1>Important notes
<p>

<enum>
<item>You may note above that the example creates the RAID device
configuration file as <tt>/etc/rt</tt> rather than the conventional
<tt>/etc/raidtab</tt>.  The kernel uses the existence of
<tt>/etc/raidtab</tt> to trigger starting the RAID device on boot
before any other initializations are performed.  This is done to
permit users the ability to use a Software RAID device for their root
filesystem.  Unfortunately, because the kernel has not yet initialized
the network it is unable to access the EtherDrive storage at this point
and the kernel hangs.  The workaround for this is to place
EtherDrive-based RAID configurations in another file such as /etc/rt
and add calls in an rc.local file similar to the following for startup
on boot:

<tscreen><verb>
raidstart -c /etc/rt /dev/md0
mount /dev/md0 /mnt/raid
</verb></tscreen>

</enum>

<sect1>Old FAQ List

<p>
These questions are no longer frequently asked, probably because they
relate to software that is no longer widely used.

<sect2>Q: When I "modprobe aoe", it takes a long time. The system seems to hang.  What could be the problem?
<p>
A: When the hotplug service was first making its way into Linux
distributions, it could slow 
things down and 
cause problems when the aoe module loaded.  For some systems, it is
may be easiest to disable it on your system.  Usually the right commands look
like this:

<tscreen><verb>
chkconfig hotplug off
/etc/init.d/hotplug stop
</verb></tscreen>

More recent distributions may need hotplug working in conjunction with
udev.  See the udev question in this FAQ for more information.


</article>