How I built a 2.8TB RAID storage array

Discussion in 'Storage' started by Yeechang Lee, Feb 20, 2005.

  1. Yeechang Lee

    Yeechang Lee Guest

    My 2.8TB RAID 5 array is finally up and running. Here I'll discuss my
    initial intended specifications, what I actually ended up with, and
    associated commentary. Please see
    <URL:http://groups.google.ca/groups?selm=>
    and
    <URL:http://groups.google.ca/groups?selm=>
    for background material.

    STORAGE MEDIUM
    Initial: Eight 250GB SATA drives.
    Actual: Nine 400GB PATA drives; eight for use, one as a cold spare.
    Why: Found a stupendous sale at CompUSA Christmas week;
    just-released-in-November Seagate Barracuda 7200.8 400GB PATA drives
    at $230 each, with no quantity limitation . I'd have loved to have
    gone with the SATA model, but given that Froogle lists the lowest
    price for one at $350 (the PATA model retails at $250-350), it was an
    easy choice.


    CASE
    Initial: Antec tower case.
    Actual: Antec 4U rackmount case.
    Why: I'd always thought of rackmounts as unsuitable for anyone with an
    actual rack sitting in their data center, but after realizing that a
    rackmount case is simply a tower case sitting on its size, it was an
    easy decision given the space advantages. The Antec case here comes
    with Antec's True Power 550W EPS12V power supply, and both have great
    reputations. In practice, I found that the Antec case was remarkably
    easy to open up (one thumbscrew), work with (all drive cages are
    removable), and roomy.


    MOTHERBOARD
    Initial: Unspecified, but probably something Athlon-based and cheap.
    Actual: Gigabyte X5DAL-G Intel server motherboard
    Why: I became convinced that the sheer volume of the PCI traffic
    generated by my proposed array under software RAID would overwhelm any
    non-server motherboard, resulting in errors. In addition, I wanted
    PCI-X slots for optimal performance. Even though I think AMD in
    general offers much better bang for the buck, since I didn't want to
    spend the $$$ for Opteron, a Xeon motherboard with an Intel server
    chipset was the best comprimise.


    CONTROLLER CARDS
    Initial: Two Highpoint RocketRAID 454 cards.
    Actual: Two 3Ware 7506-4LP cards.
    Why: I needed PATA cards to go with my PATA drives, and also wanted to
    put the two PCI-X slots on my motherboard to use. I found exactly two
    PATA PCI-X controller cards: The 3Ware, and the Acard AEC-6897. Given
    that the Acard's Linux driver compatibility looked really, really
    iffy, I went with the 3Ware. I briefly considered the 7506-8 model,
    which would've saved me about $120, but figured I'd be better off
    distributing the bandwidth over two PCI-X slots rather than one.


    SOFTWARE
    Initial: Linux software RAID 5 and XFS or JFS.
    Actual: Linux software RAID 5 and JFS.
    Why: Initially I planned on software RAID knowing that the Highpoint
    (and the equivalent Promise and Adaptec cards) didn't do true hardware
    RAID. Even after switching over to 3Ware (which *does* do true
    hardware RAID), everything I saw and read convinced me that software
    RAID was still the way to go for performance, long-term compatibility,
    and even 400GB extra space (given I'd be building one large RAID 5
    array instead of two smaller ones).

    I saw *lots* of conflicting benchmarks on whether XFS or JFS was the
    way to go. Ultimately
    <URL:http://pcbunn.cacr.caltech.edu/gae/3ware_raid_tests.htm> pushed
    me toward JFS, but I suspect I could have gone XFS with no difficulty
    whatsoever.


    COST
    As implied above, I paid $2070 plus sales tax for the drives. I lucked
    out and found a terrific eBay deal for a prebuilt system containing
    the above-mentioned case and motherboard, two Xeon 2.8GHz CPUs, a DVD
    drive, and 2GB memory for $1260 including shipping labor aside, I'd
    have paid *much* more to build an equivalent system myself. The 3Ware
    cards were $240 each, no shipping or tax, from Monarch Computer. With
    miscellaneous costs (such as a Cooler Master 4-in-3 drive cage and an
    80GB boot drive from Best Buy for $40 after rebates), I paid under
    $4100, tax and shipping included, for everything. At $1.46/GB *plus* a
    powerful dual-CPU system, boatloads of memory, and a spare drive, I am
    quite satisfied with the overall bang for the buck.


    ASSEMBLY: HARDWARE
    I spent most of the assembly time on the physical assembly part; it's
    astonishing just how long the simple tasks of opening up each
    retail-boxed drive, screwing the drive into the drive cage, putting
    the cage into the case, removing the cage and the drive when you
    realize you've put the drive in with the wrong mounting holes,
    reinstalling the drive and cage, etc., etc. take! My studio apartment
    still looks like a computer store exploded inside it.

    3Ware wisely provides PATA master-only cables with its cards, which
    saved some room, but my formerly-roomy case nonetheless looks like the
    rat's nest to end all rat's nests inside.


    ASSEMBLY: SOFTWARE
    I'd gone ahead and installed Fedora Core 3 with the boot drive only
    before the controller cards arrived. The 3Ware cards present each
    PATA drive as a SCSI device (/dev/sd[a-h]). Once booted, I used mdadm
    to create the RAID array (no partitions; just whole drives). While the
    array chugged along to create the parity information (about four
    hours), I then created one large LVM2 volume group and logical volume
    on top of the array, then created one large JFS file system.

    By the way, I found a RAID-related bug with Fedora Core's bootscripts;
    see <URL:https://bugzilla.redhat.com/beta/show_bug.cgi?id=129633>).


    RESULTS
    'df -h':
    /dev/mapper/VolGroup01-LogVol00
    2.6T 221G 2.4T 9% /mnt/newspace


    'mdadm --detail /dev/md0':
    Version : 00.90.01
    Creation Time : Wed Feb 16 01:53:33 2005
    Raid Level : raid5
    Array Size : 2734979072 (2608.28 GiB 2800.62 GB)
    Device Size : 390711296 (372.61 GiB 400.09 GB)
    Raid Devices : 8
    Total Devices : 8
    Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sat Feb 19 16:26:34 2005
    State : clean
    Active Devices : 8
    Working Devices : 8
    Failed Devices : 0
    Spare Devices : 0

    Layout : left-symmetric
    Chunk Size : 512K

    Number Major Minor RaidDevice State
    0 8 0 0 active sync /dev/sda
    1 8 16 1 active sync /dev/sdb
    2 8 32 2 active sync /dev/sdc
    3 8 48 3 active sync /dev/sdd
    4 8 64 4 active sync /dev/sde
    5 8 80 5 active sync /dev/sdf
    6 8 96 6 active sync /dev/sdg
    7 8 112 7 active sync /dev/sdh
    Events : 0.319006


    'bonnie++ -s 4G -m 3ware-swraid5-type -p 3 ; \
    bonnie++ -s 4G -m 3ware-swraid5-type-c1 -y & \
    bonnie++ -s 4G -m 3ware-swraid5-type-c2 -y & \
    bonnie++ -s 4G -m 3ware-swraid5-type-c3 -y &'
    (To be honest these results are just a bunch of numbers to me, so any
    interpretations of them are welcome. I should mention that these were
    done with three distributed computing [BOINC, mprime, and
    [email protected]] projects running in the background. Although 'nice -n
    19' each, they surely impacted CPU and perhaps disk performance
    somewhat.)

    Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
    3ware-swraid5-ty 4G 15749 50 15897 8 7791 6 10431 49 20245 11 138.1 2
    ------Sequential Create------ --------Random Create--------
    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
    files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
    16 381 6 +++++ +++ 208 3 165 7 +++++ +++ 192 4
    3ware-swraid5-type-c1,4G,15749,50,15897,8,7791,6,10431,49,20245,11,138.1,2,16,381,6,+++++,+++,208,3,165,7,+++++,+++,192,4
    done.
    Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
    3ware-swraid5-ty 4G 13739 46 17265 9 7930 6 10569 50 20196 11 146.7 2
    ------Sequential Create------ --------Random Create--------
    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
    files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
    16 383 7 +++++ +++ 207 3 162 7 +++++ +++ 191 4
    3ware-swraid5-type-c2,4G,13739,46,17265,9,7930,6,10569,50,20196,11,146.7,2,16,383,7,+++++,+++,207,3,162,7,+++++,+++,191,4
    done.
    Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
    3ware-swraid5-ty 4G 13288 43 16143 8 7863 6 10695 50 20231 12 149.6 2
    ------Sequential Create------ --------Random Create--------
    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
    files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
    16 537 9 +++++ +++ 207 3 161 7 +++++ +++ 188 4
    3ware-swraid5-type-c3,4G,13288,43,16143,8,7863,6,10695,50,20231,12,149.6,2,16,537,9,+++++,+++,207,3,161,7,+++++,+++,188,4


    FINAL NOTES, THOUGHTS, AND QUESTIONS
    I've noticed that over sync NFS, initiating a file copy from my older
    Athlon 1.4GHz system to the RAID array system is *much, much, much*
    (seconds as opposed to many minutes)slower than if I initiate the copy
    in the same direction but from the array system. Why is this?

    I almost went with the SATA (8506) version of the 3Ware cards and a
    bunch of PATA-SATA adapters in order to maintain compatibility with
    future drives, likely to be SATA only. However, a colleague pointed
    out the foolishness of paying $200 extra ($120 for eight adapters plus
    $80 for the extra cost of the SATA cards) in order to (possibly)
    futureproof a $480 investment.

    I was concerned that the drives (and the PATA cables) would cause
    horrible heat and noise issues. These, surprisingly, didn't occur;
    according to 'sensors', internal temperatures only rose by a few
    degrees, and the server is just as (very) noisy now as pre-RAID
    drives. I think I'l be able to get away with stuffing the array inside
    my hall closet after all.

    The server, before I put the cards and RAID drives into the system but
    with the distributed-computing projects putting the CPU at 100%
    utilization, took the power output on my Best Fortress 750VA/450W UPS
    from about 55% to about 76%. With the RAID up and running and again
    with 100% CPU utilization, output is 87-101% with the median at
    perhaps 93%. I realize I really ought to invest in another UPS, but
    with these figures I'm tempted to get by on what I currently have.

    Yes, I could've saved a considerable amount of money had I gone with,
    say, a used dual PIII server system with regular PCI slots (and, thus,
    $80 Highpoint RAID cards, again for the four PATA channels and not for
    their RAID functionality per se) and 512MB. And I suspect that for a
    home user like me performance wouldn't have been too much less. But I
    like to buy and build systems I can use for years and years without
    having to bother with upgrading, and figure I've made a long-term (at
    least 4-5 years, which is long term in the computer world) investment
    that provides me with much more than just storage functionality. And
    again, $1.46/GB is hard to beat.
     
    Yeechang Lee, Feb 20, 2005
    #1
    1. Advertisements

  2. Yeechang Lee

    Yeechang Lee Guest

    Flat. The only thing special about them was that they lacked slave
    connectors.

    I'm glad they're flat; despite the (lack of) air flow, at some point I
    intend to try the fabled PATA cable origami methods I've heard about.
    This does concern me. How the heck do I tell them apart, even now? How
    di I figure out which drive is sda, which is sdb, which is sdc, etc.,
    etc.? Advice is appreciated.
    Not me; all my research told me that software was the way to go for
    both performance and downward-compatibility reasons.
    Thank you. It's still amazes me to see that little '2.6T' label appear
    in the 'df -h' output.
     
    Yeechang Lee, Feb 20, 2005
    #2
    1. Advertisements

  3. Yeechang Lee

    Anton Ertl Guest

    One way is to disconnect them one by one, and see which drive is
    missing in the list (unless you want to test the md driver's
    reconstruction abilities, you should be doing this with a kernel that
    does not have an md driver, probably booting from CD). You can also
    use that method when a drive fails (but then its even more important
    that the kernel does not have an md driver).

    Another way is to just look which ports on the cards connect with
    which drives. They are typically marked on the card and/or in the
    manual with IDE0, IDE1, etc. You also have to find out which card is
    which. There may be a method to do this through the PCI IDs, but I
    would go for the disconnection method for that.

    Followups set to comp.os.linux.hardware (because I read that, csiphs
    would probably be more appropriate).

    - anton
     
    Anton Ertl, Feb 20, 2005
    #3
  4. Yeechang Lee

    Yeechang Lee Guest

    PSU concerns are why I went with an Antec 550W supply as opposed to
    some 300-400W noname brand. Since my rackmount case does not have room
    for a redundant supply, I suspect this is the best I can do. As you
    say, PSU problems are relatively rare.

    That said, anyone know how I can dynamically measure the actual
    wattage used by my system, beyond just adding up each individual
    component's wattage?
     
    Yeechang Lee, Feb 20, 2005
    #4
  5. Yeechang Lee

    Al Dykes Guest


    http://www.ahernstore.com/p4400.html about $30. I've got one.
     
    Al Dykes, Feb 20, 2005
    #5

  6. Another option is the Watts-Up meter, which I've been using for a few
    years and it's been very solid and reliable. But I don't know if it's
    any better than the Kill-A-Watt however, at 25% the price.

    There's a new Watts-Up Pro that has a nifty-looking PC (Windows)
    interface: http://www.nooutage.com/wattsup-pro.htm ... So geekorific, I
    might have to get one.
     
    chocolatemalt, Feb 20, 2005
    #6
  7. Not necessarily. PCI (and PCI-X) bandwidth is per bus, not per slot.
    So if those two cards are in two slots on one PCI-X bus, that's not
    distributing the bandwidth at all. The motherboard may offer multiple
    PCI-X busses, in which case the OP may want to ensure the cards are in
    slots that correspond to different busses. The built-in NIC on most
    motherboards (along with most other built-in devices) are also on one
    (or more) of the PCI busses, so consider bandwidth used by those as well
    when distributing the load.
     
    John-Paul Stewart, Feb 20, 2005
    #7
  8. Probably, yes.
    Depends on what PCI-X (version, clock) and whether the slots are
    seperate PCI buses or not.

    If seperate buses the highest clock is atainable and they both have the
    full PCI-X bandwidth, say 1GB/s (133MHz) or 533 MB/s (66MHz)
    If on same bus, the clock is lower to start with and they have to share
    that bus PCI-X bandwidth, say a still plenty 400MB/s each (100MHz)
    but may become iffy in case of 66MHz clock (266MB/s) or even 50MHz.
    What if?
     
    Folkert Rienstra, Feb 21, 2005
    #8
  9. Yeechang Lee

    Yeechang Lee Guest

    The Supermicro X5DAL-G motherboard does indeed offer a dedicated bus
    to each PCI-X slot, thus my desire to spread out the load with two
    cards. Otherwise I'd have gone with the 7506-8 eight-channel card
    instead and saved about $120.

    The built-in Gigabit Ethernet jack does indeed share one of the PCI-X
    slots' buses, but I only have a 100Mbit router right now. I wonder
    whether I should expect it to significantly contribute to overall
    bandwidth usage on that bus, either now or if/when I upgrade to
    Gigabit?
     
    Yeechang Lee, Feb 21, 2005
    #9
  10. Yeechang Lee

    Yeechang Lee Guest

    No, the consensus is that Linux software RAID 5 has the edge on even
    3Ware (the consensus hardware RAID leader). See, among others,
    <URL:http://www.chemistry.wustl.edu/~gelb/castle_raid.html> (which
    does note that software striping two 3Ware hardware RAID 5 solutions
    "might be competitive" with software) and
    <URL:http://staff.chess.cornell.edu/~schuller/raid.html> (which states
    that no, all-software still has the edge in such a scenario).
     
    Yeechang Lee, Feb 21, 2005
    #10
  11. If all you care about is "rod length check" long-sequential-read or
    long-sequential-write performance, that's probably true. If, of
    course, you restrict yourself to a single stream...

    ....of course, in the real world, people actually do short writes and
    multi-stream large access every once in a while. Software RAID is
    particularly bad at the former because it can't safely gather writes
    without NVRAM. Of course, both software implementations *and* typical
    cheap PCI RAID card (e.g. 3ware 7/8xxx) implementations are pretty
    awful at the latter, too, and for no good reason that I could ever see.
     
    Thor Lancelot Simon, Feb 21, 2005
    #11
  12. Yeechang Lee

    Steve Wolfe Guest

    No, one PCI-X card would be just as good.
    The numbers that you posted from Bonnie++ , if I followed them correctly,
    showed max throughputs in the 20 MB/second range. That seems awfully slow
    for this sort of setup.

    As a comparison, I have two machines with software RAID 5 arrays, one a
    2x866 P3 system with 5x120-gig drives, the other an A64 system with 8x300
    gig drives, and both of them can read and write to/from their RAID 5 array
    at 45+ MB/s, even with the controller cards plugged into a single 32/33 PCI
    bus.

    To answer your question, GigE at full speed is a bit more than 100
    MB/sec. The PCI-X busses on that motherboard are both capable of at least
    100 MHz operation, which at 64 bits would give you a max *realistic*
    throughput of about 500 MB/second, so any performance detriment from using
    the gigE would likely be completely insignificant.

    I've got another machine with a 3Ware 7000-series card with a bunch of
    120-gig drives on it (I haven't looked at the machine in quite a while), and
    I was pretty disappointed with the performance from that controller. It
    works for the intended usage (point-in-time snapshots), but responsiveness
    of the machine under disk I/O is pathetic - even with dual Xeons.

    steve
     
    Steve Wolfe, Feb 21, 2005
    #12
  13. Yeechang Lee

    Yeechang Lee Guest

    Agreed. However, those benchmarks were done with no tuning whatsoever
    (and, as noted, the three distributed computing projects going full
    blast); since then I've done some minor tweaking, notably the noatime
    mount option, which has helped. I'd post newer benchmarks but the
    array's right now rebuilding itself due to a kernel panic I caused by
    trying to use smartctl to talk to the bare drives without invoking the
    special 3ware switch.
    That was my sense as well; I suspect network saturation-by-disk will
    only cease to be an issue when we all hit the 10GigE world.

    (Actually, the 7506 cards are 66MHz PCI-X, so they don't take full
    advantage of the theoretical bandwidth available on the slots,
    anyway.)
    Appreciate the report. Fortunately, as a home user performance (or
    given that I'm only recording TV episodes, even data integrity
    actually; thus no backup plans for the array, even if backing up 2.8TB
    was practical in any way budgetwise) isn't my prime
    consideration. Were I after that, I'd probably have gone with the
    9000-series controllers and SATA drives, but my wallet's busted enough
    with what I already have!
     
    Yeechang Lee, Feb 21, 2005
    #13
  14. I noticed that, too, but then noticed that the OP seemed to be running
    three copies of Bonnie++ in parallel. His command line was:

    'bonnie++ -s 4G -m 3ware-swraid5-type -p 3 ; \
    bonnie++ -s 4G -m 3ware-swraid5-type-c1 -y & \
    bonnie++ -s 4G -m 3ware-swraid5-type-c2 -y & \
    bonnie++ -s 4G -m 3ware-swraid5-type-c3 -y &'

    I'm no expert, but if he's running three in parallel on the same
    software RAID, I'd suspect that the total performance should be taken as
    the *sum* of those three---or over 60 MB/sec.
    As another point of comparison: 5x73GB SCSI drives, software RAID-5,
    one U160 SCSI channel, 32-bit/33-MHz bus, dual 1GHz P-III: writes at 36
    MB/sec and read reads at 74 MB/sec.
     
    John-Paul Stewart, Feb 21, 2005
    #14
  15. Yeechang Lee

    Peter Guest

    (Actually, the 7506 cards are 66MHz PCI-X, so they don't take full
    There is no 66MHz PCI-X.
    3Ware 7506 cards are PCI 2.2 compliant 64-bit/66MHz bus master.
     
    Peter, Feb 21, 2005
    #15
  16. The PCI-SIG seem to think different. Perhaps you know better then?
    And contrary to what you say elsewhere, they say there is no 100MHz
    spec. That was added by the industry.
     
    Folkert Rienstra, Feb 21, 2005
    #16
  17. Yeechang Lee

    Yeechang Lee Guest

    I wrote earlier:
    As it turns out, it proved straightforward to use either 'smartctl -a
    --device=3ware,[0-3] /dev/twe[0-1]' or 3Ware's 3dm2 and tw_cli
    (available on the Web site) tools to read the serial numbers of the
    drives. So mystery solved.
     
    Yeechang Lee, Feb 21, 2005
    #17
  18. Yeechang Lee

    Yeechang Lee Guest

    What's the difference? I thought 64-bit/66Mhz PCI *was* PCI-X.
     
    Yeechang Lee, Feb 21, 2005
    #18
  19. Yeechang Lee

    Rod Speed Guest

    Thats measuring the power INTO the power supply, not what its supplying
    so isnt very useful for checking how close you are getting to the PSU rating.
     
    Rod Speed, Feb 21, 2005
    #19
  20. Yeechang Lee

    Steve Wolfe Guest

    I noticed that, too, but then noticed that the OP seemed to be running
    Good point- I missed that!

    steve
     
    Steve Wolfe, Feb 22, 2005
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.