Unimpressive performance of large MD raid

Discussion in 'Storage' started by kkkk, Apr 22, 2009.

  1. kkkk

    kkkk Guest

    Hi there,
    we have a big "storage" computer: dual Xeon E5345@2.33GHz (8 cores
    total) with lots of disks partly connected to a 3ware 9650SE controller
    and partly to the SATA/SAS controllers in the mobo.
    The hard disks are Western Digital WD7500AYYS 750GB

    We are using an ext3 filesystem with defaults mount on top of LVM + MD
    raid 6. The raid-6 is on 12 disks (hence it is 10 disks for data, 2 for
    parity). 6 of those disks are through the mobo controller, the others
    are through the 3ware.

    I hoped I would get something like 1 GB/sec sequential write on 10 disks
    :p instead I see MUCH lower performances

    I can't understand where is the bottleneck!?

    In sequential read with separate instances of "dd" one for each drive
    (directly from the block device), I can reach at least 800 MB/sec no
    problem (I can probably go much higher, I just have not tried). So I
    would exclude that it is a bus bandwidth problem (it's pci-express in
    any case, and the 3ware is on an 8x).

    Here are my write performances:

    I am writing a sequential 14GB file with dd
    time dd if=/dev/zero of=zerofile count=28160000 conv=notrunc ; time sync
    (the throughput I report is not the one reported by dd: it is adjusted
    by hand after also seeing the time sync takes, so it's near to the real
    throughput. I confirm the drives LEDs are off after sync finishes.)
    There is no other I/O activity. Disk scheduler is deadline for all drives.

    All caches enabled on both 3ware an disks attached to the MOBO:
    first write = 111 MB/sec
    overwrite = 194 MB/sec

    Cache enabled only in disks connected to the MOBO (6 over 12):
    first write = 95 MB/sec
    overwrite = 120 MB/sec

    Cache disabled everywhere: (this takes an incredibly long time to do the
    final flush)
    first write = 63 MB/sec
    overwrite = 75 MB/sec


    I have looked in top and htop what happens. Htop reports LOTS of red
    bars (iowait?) practically 50% red bars in every core (8 cores).

    Here is what happens in a few of those situations:
    - Cache all enabled, overwrite:
    dd is constantly at 100% CPU (question: shouldn't it be always 0% CPU
    always waiting on blocking IO??). Depending on the moment, either
    kjournald or pdflush are at about 75%. More time it is kjournald.
    md1_raid5 (raid6 in fact) is around 35%.

    - Cache all enabled, first write:
    like above but there are often moments in which neither kjournald nor
    pdflush are running. Hence the speed difference. dd is always at near
    100% CPU.

    - cache only in disks attached to mobo, overwrite:
    Similar to "cache all enabled, overwrite" except that in this case dd
    can never reach 100%, it is around 40%, the other processes are down
    accordingly, hence the lower speed. There are more red bars shown in
    htop, for all cores.

    - cache only in disks attached to mobo, first write:
    dd reaches 100% but kjournald reaches 40% max. Pdflush reaches 15% max.
    md1_raid5 is down to about 15%.

    - cache all disabled, overwrite:
    dd reaches about 30%, kjournald is max 20% and md1_raid5 reaches 10%
    max. Actually dd alone reaches even 100% but only in the first 20
    seconds or so and at that time kjournald and md1_raid5 are still at 20%
    and 10%.

    - cache all disabled, first write:
    similar to above.


    So I don't understand how the thing works here.
    I don't understand why dd CPU is at 100% (caches on) instead of being 0%
    I don't understand why kjournald doesn't go 100%, I don't understand
    what kjournald has to do in the case of overwrite (there is no
    significant journal on overwrites, right? I am using defaults, should be
    data=ordered)
    I don't understand why the caches change the performance so much for
    sequential write...
    Also, question: if I had the most powerful hardware RAID, would
    performances be limited anyway to 200MB/sec due to kjournald??


    Then I have another question: "sync" from the bash really seems to work
    in the sense that it takes time and after this time I confirm that the
    activity LEDs of the drives are really off. I have a MD-raid-6+LVM here!
    Weren't both MDraid5-6 AND LVM supposed NOT to pass the write barriers
    downstream to the disks?? Doesn't sync use exactly barriers?
    (implemented with device flushes) Sync here seems to work!

    Thanks for your help
     
    kkkk, Apr 22, 2009
    #1
    1. Advertisements

  2. kkkk

    calypso Guest

    First of all, you've got best of breed 3Ware 9650SE controller which has the
    best RAID6 of all SATA RAID controllers... And you're mixing it with onboard
    controller to produce the software RAID6?! WHY?!!!

    Second, 12 drives for RAID6 is suboptimal... Go with 8 or 16 drives and
    attach them directly to 3Ware...

    Do this:
    Attach 8 drives to 3Ware 9650SE (it is 9650SE-8, right?), and 4 drives to
    the onboard controller... Build hardware RAID6 on 3ware and if possible,
    hardware RAID6 on onboard controller... And have 2 logical drives with
    that... If you really want, concatenate them via LVM, but I won't suggest
    it...

    Other thing, RAID6 has double distributed parity (like RAID5 does)... So,
    every drive is for data, and every drive is used for parity...
    With your configuration - very unlikely! :/

    --
    "Divovskis li pijetaou maltretiru ?" upita majmuna pjeva Rudio mirise.
    "Ne znam ja nista !" rece skupstinaa maltretira "Ja samo plivaco zvace divovskim !"
    By runf

    Damir Lukic, calypso@_MAKNIOVO_fly.srk.fer.hr
    http://inovator.blog.hr
    http://calypso-innovations.blogspot.com/
     
    calypso, Apr 23, 2009
    #2
    1. Advertisements

  3. kkkk

    kkkk Guest

    We are a research entity and the funding comes at unknown times. We
    decided to build the system so that any component can be replaced with
    any other similar component available in shops at any time, e.g. the
    9650SE can be replaced with multiple controllers in the future. If the
    3ware breaks in the future, at that time there might not be a compatible
    controller in production. (We cannot buy from ebay!)
    With the current setup, in any emergency, the disks can be connected via
    any controller or even USB, and the MD linux raid will still work and we
    will be able to get data out. Furthermore we trust visible, open,
    old/tested, linux MD code more than any embedded RAID code which nobody
    knows except 3ware. What if there was a bug in 9650SE code? It was a
    recent controller when we bought it, and we would have found out only
    later, maybe years later after setting up our array.
    Also, we were already proficient with linux MD.

    Anyway since linux MD raid never occupies more than 35% CPU (of a single
    core!) in any test, I don't think it is the bottleneck. But this is part
    of my question.

    We have already lots of data and virtual machines loaded in there. Even
    if it was possible to attach all to 3ware controller (actually it might
    indeed be possible, since it is a 16ML [we have 24 drives on the
    machine]), we wouldn't have used the RAID from 3ware for the reasons
    explained above.

    With MD raid it shouldn't make a difference unless you say that the
    larger cache in the 3ware speeds up the operation. This is again part of
    my question: the cache seems to have a dramatic effect which I do not
    completely understand for sequential I/O. It must be something related
    to the bus overhead or the context switching of the CPU (for serving the
    interrupts) but I would like a confirmation. Also consider that with 8
    cores and a PCI express bus, both overheads should have been negligible.
    Anyway the cache from the drives should be enough to minimize this
    overhead (I mean for the MOBO drives) so I would not expect a tremendous
    speedup from using the 3ware cache for all the drives (I mean still with
    MD).
    LVM concatenation looks like very unsafe...
    I know. My sentence was for explaining the exact chunk/stride size we have.
    What performance would you expect from 3ware raid-6 12-disks with ext3
    (defaults mount) sequential dd write?

    Thank you
     
    kkkk, Apr 23, 2009
    #3
  4. kkkk

    kkkk Guest

    Read my other reply
    Had we found a cheap no-raid 16 drives SATA controller for PCI-Express
    we would have bought it. If you know of any, please tell me.

    Thank you
     
    kkkk, Apr 23, 2009
    #4
  5. kkkk

    David Brown Guest

    I understand entirely your reasons for wanting to use Linux software
    raid rather than a hardware raid. But I've a couple of other points or
    questions - as much for my own learning as anything else.

    If you have so many disks connected, did you consider having at least
    one as a hot spare? If one of your disks dies and it takes time to
    replace it, the system will be very slow while running degraded.

    Secondly, did you consider raid 10 as an alternative? Obviously it is
    less efficient in terms of disk space, but it should be much faster. It
    may also be safer (depending on the likely rates of different kinds of
    failures) since there is no "raid 5 write hole". Raid 6, on the other
    hand, is probably the slowest raid choice. Any writes that don't cover
    a complete stripe will need reads from several of the disks, followed by
    parity calculations - and the more disks you have, the higher the
    chances of hitting such incomplete stripe writes.

    <http://www.enterprisenetworkingplanet.com/nethub/article.php/10950_3730176_1>
     
    David Brown, Apr 23, 2009
    #5
  6. kkkk

    kkkk Guest

    Of course! We have 4 spares shared among all the arrays.
    I wouldn't expect performances of raid 10 via MD to be higher than the
    raid-6 of my original post (and might even be much slower at the same
    number of drives) because, as I mentioned, the "md1_raid5" (raid-6
    actually) process never goes higher than 35% CPU occupation. Regarding
    the read+checksum+write problem of raid5/6 for small writes, there
    shouldn't be any in this case because I am doing a sequential write.
    Not the case here because I am doing sequential write.

    Also, the overhead you mention is present if the stripe is not in cache,
    but with large amounts of RAM I expect the stripe should be in cache
    (especially the stripe related to the file/directory metadata should
    be... while the rest doesn't matter as it is sequential). Yesterday
    during the tests the free amount of RAM was 33GB on that machine over a
    total of 48GB...
     
    kkkk, Apr 23, 2009
    #6
  7. kkkk

    calypso Guest

    For data recovery purposes, anyone can spare 50$ and buy it as a private
    person from ebay...
    There is very good support for 3Ware controllers on www.3ware.com, just
    check knowledge base...

    3Ware 9650SE is at least 4 years old controller, and all the bugs are
    solved (there were some incompatibilities at first with chipsets, OS's and
    such stuff, but now it's working as it should)...
    RAID ASIC + onboard cache vs. software implementation? RAID ASIC always...
    Because! :)

    8 or 16 is optimal for RAID5 and RAID6...

    How do you calculate block size of a stripe with number of drives different
    than 8 or 16?

    64/12 = ?
    256/10 = ?

    I don't know how cache is organized in 3Ware, but in EMC storage systems
    (CLARiiON) you can choose the memory page size (4kB, 8kB and 16kB) to
    optimize it for some applications...
    Your reasons are quite paranoic... And using 9650SE-16ML (1000$ controller)
    as a normal SATA controller is, sorry for the term, stupid... ;)

    http://store.3ware.com/?category=10&subcategory=8&productid=9650SE-16ML

    Place 16 drive on this controller, and build RAID6 from them... If you
    really want, buy yourself another one as a spare, but the company I worked
    for had sold many servers and workstations on Supermicro+3Ware combinations,
    and haven't heard yet that a controller went dead... That was around 5 years
    ago when I first started working for that company, back then I've used 3Ware
    8506 controllers...
    256MB cache is quite much for the RAID controller, and is used mostly as a
    write cache (and local memory for RAID ASIC chip that provides RAID5 and
    RAID6 calculations)...

    And, why is RAID controller much faster than your software RAID is very
    simple... RAID controller has got it's own firmware with many optimized RAID
    features (lots of years were spent for researching RAID algorithms that were
    implemented in hardware, 9650SE uses 8th generation of StorSwitch
    technology) and has got onboard cache for RAID functions...
    I haven't tested RAID6 on 9650SE, but have tested RAID5 on 9650SE (older
    generation on PCI-X), and IIRC got around 250MB/s write from 15x160GB
    Hitachi 7200rpm SATA drives... So, with this 9650SE I expect at least around
    350MB/s from 16 drives (today's SATA)... Consider that bandwidth is not what
    you'll be worried about, it's more to RAID6 write penalty that cache memory
    annulates (it's 6 IOPS per write)...


    Just use this 3Ware as it should be used and stop thinking about 'what will
    happen' things... :) 3Ware won't die, drives will die more likely...

    --
    Kinez prdi bradat pekarau izdrkavu navecer u ribarnici ?
    By runf

    Damir Lukic, calypso@_MAKNIOVO_fly.srk.fer.hr
    http://inovator.blog.hr
    http://calypso-innovations.blogspot.com/
     
    calypso, Apr 23, 2009
    #7
  8. kkkk

    David Brown Guest

    You didn't mention it, so I thought I'd check, since I don't know your
    background or experience. I've heard of people using raid 6 because
    then they don't need hot spares - the array will effectively run as raid
    5 until they replace the dud drive...
    Linux raid 10 with "far" layout normally gives sequential read
    performance around equal to a pure striped array. It will be a little
    faster than raid 6 for the same number of drives, but not a huge
    difference (with "near" layout raid 10, you'll get much slower
    sequential reads). Sequential write for raid 10 will be a little slower
    than for raid 6 (since you are not cpu bound at all). But random
    writes, especially of small sizes, will be much better, as will the
    performance of multiple simultaneous reads (sequential or random). Of
    course this depends highly on your workload, and is based on how the
    data is laid out on the disk.

    Where you will really see the difference is if when a disk fails, and
    you are running in degraded mode and rebuilding. Replacing a disk and
    rebuilding will take a tenth of the disk activity with raid 10 than with
    raid 6 - it only needs to read through a single disk to do the copy.
    With raid 6, the rebuild involves reading *all* the data off *all* the
    other disks. And according to some articles I've read, the chances of
    getting a sector unrecoverable read error during this rebuild with many
    large disks is very high, leading to a second disk failure. This is, of
    course, totally independent of whether you are using software raid or
    (as others suggest) hardware raid.

    It looks quite likely that your performance issues are some sort of IO
    bottleneck, but I don't have the experience to help here.
    You're right here - caching the data will make a very big difference.
    And this could be an area where software raid on Linux will do much
    better than hardware raid on the card - the software raid can use all of
    that 48 GB for such caching, not just the memory on the raid card.

    Thanks for your comments - as I said, I'm learning about this myself
    (currently mostly theory - when I get the time, I can put it into practice).
     
    David Brown, Apr 24, 2009
    #8
  9. kkkk

    David Brown Guest

    An alternative to consider, especially if you are working mainly with
    large files, is xfs rather than ext3. xfs works better with large files
    (mainly due to it's support of extents), and has good support for
    working with raid (it matches its data and structures with the raid
    stripes).
    There is a lot more information about linux raid5 than raid6. I think
    that reflects usage. Raid 6 is typically used when you have a larger
    number of drives - say, 8 or more. People using such large arrays are
    much more likely to be looking for higher-end solutions with strong
    support contracts, and are thus more likely to be using something with
    high-end hardware raid cards. Raid 5 needs only 3 disks, and is a very
    common solution for small servers. If you search around for
    configuration how-tos, benchmarks, etc., you'll find relatively few that
    have more than 4 disks, and therefore few that use raid 6. There's also
    a trend (so I've read) towards raid 10 (whether it be linux raid10, or
    standard raid 1 + 0) rather than raid 5/6 because of better recovery.
     
    David Brown, Apr 24, 2009
    #9
  10. kkkk

    calypso Guest

    RAID3 implementation doesn't exist on 3Ware controllers... So far, I've seen
    RAID3 only on some storage arrays (EMC to tell you the truth only, can check
    the others), and the old SCSI RAID controller - AMI MegaRAID Enterprise
    1300 that I once had...

    RAID3 is very similar to RAID5, but hasn't got distributed parity drive,
    instead it has got dedicated parity drive, and, yes, it's used for special
    purposes only where big sequential read/write speed is needed... But, like I
    said already, 3Ware doesn't support it...

    Seems like I was partially right with 8 or 16 drives as a optimal number of
    drives... Seems like that for RAID6 it's optimal to have 6, 10 or 18 drives
    (4+2, 8+2, 16+2)... Here's a nice text from EMC guy (look at Stripe size of
    a LUN):

    http://clariionblogs.blogspot.com/



    --
    U frizideru se ponekad sretan Peroo lize. By runf

    Damir Lukic, calypso@_MAKNIOVO_fly.srk.fer.hr
    http://inovator.blog.hr
    http://calypso-innovations.blogspot.com/
     
    calypso, Apr 24, 2009
    #10
  11. kkkk

    kkkk Guest

    With an 8x PCI-e bus there should be space for 2 GB/sec transfer...
    This is true only for non-sequential write.

    In my case the system starts writing 5 seconds after dd is pushing data
    out (dirty_writeback_centisecs = 500). At that time there is so much
    sequential data to write that it will fill many stripes completely.
     
    kkkk, Apr 24, 2009
    #11
  12. kkkk

    kkkk Guest

    I still don't understand. Why 4+2, 8+2, 16+2 should be more optimal?
    Please note that one raid chunk is NOT one block long (512 bytes). In
    facts on my raid-6 it is 64KB long, so the stripes are 64*10=640KB long.
    What's wrong with that? Why should that be less performing than 512KB
    long? Please note that ext3 has blocksize 4K, so there are 160 and 128
    ext3 blocks in one stripe respectively in the two configurations. I
    don't see why 128 blocks should be significantly better than 160 blocks..!?
     
    kkkk, Apr 24, 2009
    #12
  13. kkkk

    Michel Talon Guest

    RAID3 you can get with FreeBSD and its geom module, if you need it.
     
    Michel Talon, Apr 24, 2009
    #13
  14. kkkk

    kkkk Guest

    You mean on *Linux MD* raid5? That could be good. Where?

    Raid-6 algorithms are practically equivalent to raid-5, except parity
    computation obviously .
     
    kkkk, Apr 24, 2009
    #14
  15. kkkk

    kkkk Guest

    In my case dd pushes 5 seconds of data before the disks start writing
    (dirty_writeback_centiseces = 500). dd stays always at least 5 seconds
    ahead of the writes. This should fill all stripes completely causing no
    reads. I even tried to raise the dirty_writeback_centisecs with no
    measurable performance benefit.

    Where is this 5secs of data stored? Is it at the ext3 layer or at the
    LVM layer (I doubt this one, also I notice there is no LVM kernel thread
    runing) or at the MD layer?

    Why do you think dd stays at 100% CPU? (with disks/3ware caches enabled)
    Shouldn't that be 0%?

    Do you think the CPU is high due to a memory-copy operation? If it was
    that, I suppose dd from /dev/zero to /dev/null should go at 200MB/sec
    instead it goes at 1.1GB/sec (with 100%CPU occupation indeed, 65% of
    which is in kernel mode). That would mean that the number of copies
    performed by dd while copying to the ext3-raid is 5 times greater than
    that for copying from /dev/zero to /dev/null . Hmmm... a bit difficult
    to believe. there must be other stuff performed in the ext3 case so to
    hog the CPU. Is the ext3 code running whithin the dd process when dd writes?

    I think this overhead should affect the first-writes but not the
    rewrites performance for ext3 defaults mount (defaults should be
    data=ordered which I think means no journal written for rewrites,
    correct?). Am I correct?

    Hmm probably not because kjournald had significant CPU occupation. What
    is the role of the journal during file overwrites?

    Agreed.

    Thanks for your answer
     
    kkkk, Apr 24, 2009
    #15
  16. kkkk

    kkkk Guest

    What filesystem and operating system? This is important...

    I assume you mean "first write of a sequential file"..?
    (overwrites as you see are much faster)
    "6 IOPS per write"? Could you explain this?

    Thank you
     
    kkkk, Apr 24, 2009
    #16
  17. kkkk

    calypso Guest

    Windows XP, NTFS...
    These results are from a benchmarking tool used with BlackMagic Video Design
    capture cards...
    Normal write penalty for small writes in RAID6... RAID5 has got 4 IOPS write
    penalty...

    http://www.slichke.com/viewer.php?id=rgh1240568505h.png

    This picture is taken from Berkeley lectures about RAID from prof. Patterson
    (one of inventors of RAID arrays)...

    --
    Biljkaa rascvjetava debeli krekero pije navecer pod stolom.
    By runf

    Damir Lukic, calypso@_MAKNIOVO_fly.srk.fer.hr
    http://inovator.blog.hr
    http://calypso-innovations.blogspot.com/
     
    calypso, Apr 24, 2009
    #17
  18. kkkk

    calypso Guest

    Because you're thinking decimal instead of binary/hexadecimal...

    Cache memory optimizations and firmware optimizations are on base2, not
    base10...

    --
    Iza kuce cigano siluje crven Crnogorkaog gladija
    za pet minuta. By runf

    Damir Lukic, calypso@_MAKNIOVO_fly.srk.fer.hr
    http://inovator.blog.hr
    http://calypso-innovations.blogspot.com/
     
    calypso, Apr 24, 2009
    #18
  19. kkkk

    kkkk Guest

    I suspected that. I suspect NTFS is much faster than ext3, it will
    probably be like XFS in Linux. (and also more unsafe e.g. in case of
    power losses, just like XFS) Speed depends among other things on how
    paranoid is the journal behaviour.
     
    kkkk, Apr 24, 2009
    #19
  20. kkkk

    calypso Guest

    NTFS unsafe in case of power loss? You missed something, we're not talking
    about FAT here (which is faster than NTFS)...

    --
    "Bradats li mackau farbu ?" upita Dzonia pasira Miskoo podmazuje.
    "Nisam ja nikog bombardiro !" rece bombao udise "Ja samo Zidovo hoce cokoladanm !" By runf

    Damir Lukic, calypso@_MAKNIOVO_fly.srk.fer.hr
    http://inovator.blog.hr
    http://calypso-innovations.blogspot.com/
     
    calypso, Apr 24, 2009
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.